# DB Observability and Resilience

Source:
- https://docs.sqlalchemy.org/en/21/core/pooling.html
- https://docs.sqlalchemy.org/en/21/core/engines.html
- https://docs.sqlalchemy.org/en/21/core/events.html
- https://fastapi.tiangolo.com/advanced/events/

Status: adopted
Decision level: mandatory
Applies to: api-runtime, workers, tests
Last reviewed: 2026-06-17

---

## Purpose

Define baseline observability and resilience practices for DB connectivity in async FastAPI + SQLAlchemy apps.

Goals:

- detect and recover from stale/disconnected connections,
- expose useful diagnostics for pool/engine behavior,
- make readiness/liveness signals meaningful.

---

## Scope and Non-Goals

- In scope: pool health, connection liveness, SQL/pool logging hygiene, readiness checks, failure handling.
- Out of scope: full APM stack design and vendor-specific monitoring platform setup.

---

## Rules

- Enable connection liveness strategy (`pool_pre_ping=True`) for long-running services.
- Keep DB health checks out of liveness; include dependency checks in readiness.
- Centralize engine options and logging configuration.
- Avoid noisy SQL debug logging in production defaults.
- Treat disconnect handling as a first-class test scenario.

---

## Recommended Baseline

```python
engine = create_async_engine(
    settings.database_url,
    pool_pre_ping=True,
    # Tune only from measured behavior:
    # pool_size=10,
    # max_overflow=20,
    # pool_timeout=30,
    # pool_recycle=1800,
)
```

Operational guidance:

- `pool_pre_ping=True` for stale-connection resilience.
- Introduce `pool_recycle` where backend/network idle timeout behavior warrants it.
- Use structured app logs with request correlation and error context.

---

## Health Endpoint Policy

- `/healthz`: process is alive; no DB call required.
- `/readyz`: application can currently serve traffic; include DB connectivity verification.

Readiness checks should be lightweight and bounded (timeouts), not heavy diagnostic queries.

---

## Failure Handling Guidance

- Handle transient disconnects with pool invalidation/reconnect semantics.
- Keep one failed request from cascading into broad app instability.
- Capture and log contextual DB errors with enough metadata for debugging.

---

## Anti-Patterns

- No readiness check for DB-dependent services.
- Permanent debug SQL echo in production.
- Per-handler ad hoc pool settings.
- Assuming disconnect events are too rare to test.

---

## Operational Checks

- Engine creation is centralized and configured once.
- Liveness/readiness behavior is documented and validated.
- Pool settings are explicit, versioned, and reviewed.
- DB-related errors produce actionable logs.

---

## Testing Checks

- Readiness endpoint test covers healthy and unhealthy DB states.
- Integration test simulates disconnect/reconnect behavior.
- Load/concurrency tests validate pool behavior under stress.

---

## Migration Notes

- Start with resilient defaults (`pool_pre_ping`) and simple health policy.
- Add deeper metrics/event hooks incrementally once baseline reliability is in place.