114 lines
3.2 KiB
Markdown
114 lines
3.2 KiB
Markdown
# DB Observability and Resilience
|
|
|
|
!!! info "Primary sources"
|
|
- [SQLAlchemy pooling](https://docs.sqlalchemy.org/en/21/core/pooling.html)
|
|
- [SQLAlchemy engine configuration](https://docs.sqlalchemy.org/en/21/core/engines.html)
|
|
- [SQLAlchemy events](https://docs.sqlalchemy.org/en/21/core/events.html)
|
|
- [FastAPI lifespan events](https://fastapi.tiangolo.com/advanced/events/)
|
|
|
|
??? abstract "Decision metadata"
|
|
- Status: adopted
|
|
- Decision level: mandatory
|
|
- Applies to: api-runtime, workers, tests
|
|
- Last reviewed: 2026-06-17
|
|
|
|
---
|
|
|
|
## Purpose
|
|
|
|
Define baseline observability and resilience practices for DB connectivity in async FastAPI + SQLAlchemy apps.
|
|
|
|
Goals:
|
|
|
|
- detect and recover from stale/disconnected connections,
|
|
- expose useful diagnostics for pool/engine behavior,
|
|
- make readiness/liveness signals meaningful.
|
|
|
|
---
|
|
|
|
## Scope and Non-Goals
|
|
|
|
- In scope: pool health, connection liveness, SQL/pool logging hygiene, readiness checks, failure handling.
|
|
- Out of scope: full APM stack design and vendor-specific monitoring platform setup.
|
|
|
|
---
|
|
|
|
## Rules
|
|
|
|
- Enable connection liveness strategy (`pool_pre_ping=True`) for long-running services.
|
|
- Keep DB health checks out of liveness; include dependency checks in readiness.
|
|
- Centralize engine options and logging configuration.
|
|
- Avoid noisy SQL debug logging in production defaults.
|
|
- Treat disconnect handling as a first-class test scenario.
|
|
|
|
---
|
|
|
|
## Recommended Baseline
|
|
|
|
```python
|
|
engine = create_async_engine(
|
|
settings.database_url,
|
|
pool_pre_ping=True,
|
|
# Tune only from measured behavior:
|
|
# pool_size=10,
|
|
# max_overflow=20,
|
|
# pool_timeout=30,
|
|
# pool_recycle=1800,
|
|
)
|
|
```
|
|
|
|
Operational guidance:
|
|
|
|
- `pool_pre_ping=True` for stale-connection resilience.
|
|
- Introduce `pool_recycle` where backend/network idle timeout behavior warrants it.
|
|
- Use structured app logs with request correlation and error context.
|
|
|
|
---
|
|
|
|
## Health Endpoint Policy
|
|
|
|
- `/healthz`: process is alive; no DB call required.
|
|
- `/readyz`: application can currently serve traffic; include DB connectivity verification.
|
|
|
|
Readiness checks should be lightweight and bounded (timeouts), not heavy diagnostic queries.
|
|
|
|
---
|
|
|
|
## Failure Handling Guidance
|
|
|
|
- Handle transient disconnects with pool invalidation/reconnect semantics.
|
|
- Keep one failed request from cascading into broad app instability.
|
|
- Capture and log contextual DB errors with enough metadata for debugging.
|
|
|
|
---
|
|
|
|
## Anti-Patterns
|
|
|
|
- No readiness check for DB-dependent services.
|
|
- Permanent debug SQL echo in production.
|
|
- Per-handler ad hoc pool settings.
|
|
- Assuming disconnect events are too rare to test.
|
|
|
|
---
|
|
|
|
## Operational Checks
|
|
|
|
- Engine creation is centralized and configured once.
|
|
- Liveness/readiness behavior is documented and validated.
|
|
- Pool settings are explicit, versioned, and reviewed.
|
|
- DB-related errors produce actionable logs.
|
|
|
|
---
|
|
|
|
## Testing Checks
|
|
|
|
- Readiness endpoint test covers healthy and unhealthy DB states.
|
|
- Integration test simulates disconnect/reconnect behavior.
|
|
- Load/concurrency tests validate pool behavior under stress.
|
|
|
|
---
|
|
|
|
## Migration Notes
|
|
|
|
- Start with resilient defaults (`pool_pre_ping`) and simple health policy.
|
|
- Add deeper metrics/event hooks incrementally once baseline reliability is in place. |