Files
prompts/docs/skills/fastapi-async-sqlalchemy-modernization/references/observability.md
T
John Lancaster 3347443ca9 formatting
2026-06-19 01:29:05 -05:00

3.2 KiB

DB Observability and Resilience

!!! info "Primary sources" - SQLAlchemy pooling - SQLAlchemy engine configuration - SQLAlchemy events - FastAPI lifespan events

??? abstract "Decision metadata" - Status: adopted - Decision level: mandatory - Applies to: api-runtime, workers, tests - Last reviewed: 2026-06-17


Purpose

Define baseline observability and resilience practices for DB connectivity in async FastAPI + SQLAlchemy apps.

Goals:

  • detect and recover from stale/disconnected connections,
  • expose useful diagnostics for pool/engine behavior,
  • make readiness/liveness signals meaningful.

Scope and Non-Goals

  • In scope: pool health, connection liveness, SQL/pool logging hygiene, readiness checks, failure handling.
  • Out of scope: full APM stack design and vendor-specific monitoring platform setup.

Rules

  • Enable connection liveness strategy (pool_pre_ping=True) for long-running services.
  • Keep DB health checks out of liveness; include dependency checks in readiness.
  • Centralize engine options and logging configuration.
  • Avoid noisy SQL debug logging in production defaults.
  • Treat disconnect handling as a first-class test scenario.

engine = create_async_engine(
    settings.database_url,
    pool_pre_ping=True,
    # Tune only from measured behavior:
    # pool_size=10,
    # max_overflow=20,
    # pool_timeout=30,
    # pool_recycle=1800,
)

Operational guidance:

  • pool_pre_ping=True for stale-connection resilience.
  • Introduce pool_recycle where backend/network idle timeout behavior warrants it.
  • Use structured app logs with request correlation and error context.

Health Endpoint Policy

  • /healthz: process is alive; no DB call required.
  • /readyz: application can currently serve traffic; include DB connectivity verification.

Readiness checks should be lightweight and bounded (timeouts), not heavy diagnostic queries.


Failure Handling Guidance

  • Handle transient disconnects with pool invalidation/reconnect semantics.
  • Keep one failed request from cascading into broad app instability.
  • Capture and log contextual DB errors with enough metadata for debugging.

Anti-Patterns

  • No readiness check for DB-dependent services.
  • Permanent debug SQL echo in production.
  • Per-handler ad hoc pool settings.
  • Assuming disconnect events are too rare to test.

Operational Checks

  • Engine creation is centralized and configured once.
  • Liveness/readiness behavior is documented and validated.
  • Pool settings are explicit, versioned, and reviewed.
  • DB-related errors produce actionable logs.

Testing Checks

  • Readiness endpoint test covers healthy and unhealthy DB states.
  • Integration test simulates disconnect/reconnect behavior.
  • Load/concurrency tests validate pool behavior under stress.

Migration Notes

  • Start with resilient defaults (pool_pre_ping) and simple health policy.
  • Add deeper metrics/event hooks incrementally once baseline reliability is in place.