move

2026-06-18 22:06:40 -05:00
parent 6c5fda9c3a
commit e78383be1f
24 changed files with 0 additions and 0 deletions
@@ -0,0 +1,113 @@
+# DB Observability and Resilience
+
+Source:
+- https://docs.sqlalchemy.org/en/21/core/pooling.html
+- https://docs.sqlalchemy.org/en/21/core/engines.html
+- https://docs.sqlalchemy.org/en/21/core/events.html
+- https://fastapi.tiangolo.com/advanced/events/
+
+Status: adopted
+Decision level: mandatory
+Applies to: api-runtime, workers, tests
+Last reviewed: 2026-06-17
+
+---
+
+## Purpose
+
+Define baseline observability and resilience practices for DB connectivity in async FastAPI + SQLAlchemy apps.
+
+Goals:
+
+- detect and recover from stale/disconnected connections,
+- expose useful diagnostics for pool/engine behavior,
+- make readiness/liveness signals meaningful.
+
+---
+
+## Scope and Non-Goals
+
+- In scope: pool health, connection liveness, SQL/pool logging hygiene, readiness checks, failure handling.
+- Out of scope: full APM stack design and vendor-specific monitoring platform setup.
+
+---
+
+## Rules
+
+- Enable connection liveness strategy (`pool_pre_ping=True`) for long-running services.
+- Keep DB health checks out of liveness; include dependency checks in readiness.
+- Centralize engine options and logging configuration.
+- Avoid noisy SQL debug logging in production defaults.
+- Treat disconnect handling as a first-class test scenario.
+
+---
+
+## Recommended Baseline
+
+```python
+engine = create_async_engine(
+    settings.database_url,
+    pool_pre_ping=True,
+    # Tune only from measured behavior:
+    # pool_size=10,
+    # max_overflow=20,
+    # pool_timeout=30,
+    # pool_recycle=1800,
+)
+```
+
+Operational guidance:
+
+- `pool_pre_ping=True` for stale-connection resilience.
+- Introduce `pool_recycle` where backend/network idle timeout behavior warrants it.
+- Use structured app logs with request correlation and error context.
+
+---
+
+## Health Endpoint Policy
+
+- `/healthz`: process is alive; no DB call required.
+- `/readyz`: application can currently serve traffic; include DB connectivity verification.
+
+Readiness checks should be lightweight and bounded (timeouts), not heavy diagnostic queries.
+
+---
+
+## Failure Handling Guidance
+
+- Handle transient disconnects with pool invalidation/reconnect semantics.
+- Keep one failed request from cascading into broad app instability.
+- Capture and log contextual DB errors with enough metadata for debugging.
+
+---
+
+## Anti-Patterns
+
+- No readiness check for DB-dependent services.
+- Permanent debug SQL echo in production.
+- Per-handler ad hoc pool settings.
+- Assuming disconnect events are too rare to test.
+
+---
+
+## Operational Checks
+
+- Engine creation is centralized and configured once.
+- Liveness/readiness behavior is documented and validated.
+- Pool settings are explicit, versioned, and reviewed.
+- DB-related errors produce actionable logs.
+
+---
+
+## Testing Checks
+
+- Readiness endpoint test covers healthy and unhealthy DB states.
+- Integration test simulates disconnect/reconnect behavior.
+- Load/concurrency tests validate pool behavior under stress.
+
+---
+
+## Migration Notes
+
+- Start with resilient defaults (`pool_pre_ping`) and simple health policy.
+- Add deeper metrics/event hooks incrementally once baseline reliability is in place.