3.2 KiB
3.2 KiB
DB Observability and Resilience
!!! info "Primary sources" - SQLAlchemy pooling - SQLAlchemy engine configuration - SQLAlchemy events - FastAPI lifespan events
??? abstract "Decision metadata" - Status: adopted - Decision level: mandatory - Applies to: api-runtime, workers, tests - Last reviewed: 2026-06-17
Purpose
Define baseline observability and resilience practices for DB connectivity in async FastAPI + SQLAlchemy apps.
Goals:
- detect and recover from stale/disconnected connections,
- expose useful diagnostics for pool/engine behavior,
- make readiness/liveness signals meaningful.
Scope and Non-Goals
- In scope: pool health, connection liveness, SQL/pool logging hygiene, readiness checks, failure handling.
- Out of scope: full APM stack design and vendor-specific monitoring platform setup.
Rules
- Enable connection liveness strategy (
pool_pre_ping=True) for long-running services. - Keep DB health checks out of liveness; include dependency checks in readiness.
- Centralize engine options and logging configuration.
- Avoid noisy SQL debug logging in production defaults.
- Treat disconnect handling as a first-class test scenario.
Recommended Baseline
engine = create_async_engine(
settings.database_url,
pool_pre_ping=True,
# Tune only from measured behavior:
# pool_size=10,
# max_overflow=20,
# pool_timeout=30,
# pool_recycle=1800,
)
Operational guidance:
pool_pre_ping=Truefor stale-connection resilience.- Introduce
pool_recyclewhere backend/network idle timeout behavior warrants it. - Use structured app logs with request correlation and error context.
Health Endpoint Policy
/healthz: process is alive; no DB call required./readyz: application can currently serve traffic; include DB connectivity verification.
Readiness checks should be lightweight and bounded (timeouts), not heavy diagnostic queries.
Failure Handling Guidance
- Handle transient disconnects with pool invalidation/reconnect semantics.
- Keep one failed request from cascading into broad app instability.
- Capture and log contextual DB errors with enough metadata for debugging.
Anti-Patterns
- No readiness check for DB-dependent services.
- Permanent debug SQL echo in production.
- Per-handler ad hoc pool settings.
- Assuming disconnect events are too rare to test.
Operational Checks
- Engine creation is centralized and configured once.
- Liveness/readiness behavior is documented and validated.
- Pool settings are explicit, versioned, and reviewed.
- DB-related errors produce actionable logs.
Testing Checks
- Readiness endpoint test covers healthy and unhealthy DB states.
- Integration test simulates disconnect/reconnect behavior.
- Load/concurrency tests validate pool behavior under stress.
Migration Notes
- Start with resilient defaults (
pool_pre_ping) and simple health policy. - Add deeper metrics/event hooks incrementally once baseline reliability is in place.