# DB Observability and Resilience Source: - https://docs.sqlalchemy.org/en/21/core/pooling.html - https://docs.sqlalchemy.org/en/21/core/engines.html - https://docs.sqlalchemy.org/en/21/core/events.html - https://fastapi.tiangolo.com/advanced/events/ Status: adopted Decision level: mandatory Applies to: api-runtime, workers, tests Last reviewed: 2026-06-17 --- ## Purpose Define baseline observability and resilience practices for DB connectivity in async FastAPI + SQLAlchemy apps. Goals: - detect and recover from stale/disconnected connections, - expose useful diagnostics for pool/engine behavior, - make readiness/liveness signals meaningful. --- ## Scope and Non-Goals - In scope: pool health, connection liveness, SQL/pool logging hygiene, readiness checks, failure handling. - Out of scope: full APM stack design and vendor-specific monitoring platform setup. --- ## Rules - Enable connection liveness strategy (`pool_pre_ping=True`) for long-running services. - Keep DB health checks out of liveness; include dependency checks in readiness. - Centralize engine options and logging configuration. - Avoid noisy SQL debug logging in production defaults. - Treat disconnect handling as a first-class test scenario. --- ## Recommended Baseline ```python engine = create_async_engine( settings.database_url, pool_pre_ping=True, # Tune only from measured behavior: # pool_size=10, # max_overflow=20, # pool_timeout=30, # pool_recycle=1800, ) ``` Operational guidance: - `pool_pre_ping=True` for stale-connection resilience. - Introduce `pool_recycle` where backend/network idle timeout behavior warrants it. - Use structured app logs with request correlation and error context. --- ## Health Endpoint Policy - `/healthz`: process is alive; no DB call required. - `/readyz`: application can currently serve traffic; include DB connectivity verification. Readiness checks should be lightweight and bounded (timeouts), not heavy diagnostic queries. --- ## Failure Handling Guidance - Handle transient disconnects with pool invalidation/reconnect semantics. - Keep one failed request from cascading into broad app instability. - Capture and log contextual DB errors with enough metadata for debugging. --- ## Anti-Patterns - No readiness check for DB-dependent services. - Permanent debug SQL echo in production. - Per-handler ad hoc pool settings. - Assuming disconnect events are too rare to test. --- ## Operational Checks - Engine creation is centralized and configured once. - Liveness/readiness behavior is documented and validated. - Pool settings are explicit, versioned, and reviewed. - DB-related errors produce actionable logs. --- ## Testing Checks - Readiness endpoint test covers healthy and unhealthy DB states. - Integration test simulates disconnect/reconnect behavior. - Load/concurrency tests validate pool behavior under stress. --- ## Migration Notes - Start with resilient defaults (`pool_pre_ping`) and simple health policy. - Add deeper metrics/event hooks incrementally once baseline reliability is in place.