jmealo/otel-metrics-rollout.md

OTel Metrics Rollout — Monday Checklist

Background

The prometheus_client Python library uses threading.Lock per label combination on every .labels().observe() call. This caused OOM crashes (700+ MiB → OOMKill) on aaa-api in staging. The fix replaces the prometheus_client backend with OpenTelemetry via a PROMETHEUS_BACKEND=otel env var toggle in gisual-prometheus-clients.

Validated in staging since 2026-03-22: aaa-api running at 162 MiB (under 368 MiB limit), zero restarts, zero 500 errors, all metrics present in /metrics output.

Step 1: Publish Stable Packages

1a. gisual-prometheus-clients → 0.2.0

cd ~/src/gisual/backend/libraries/gisual-prometheus-clients
git checkout experiment/otel-backend

# Review the MR first:
# https://gitlab.com/gisual/backend/libraries/gisual-prometheus-clients/-/merge_requests/5

# Merge to main
git checkout main && git merge experiment/otel-backend

# Tag and release
poe release  # runs checks, bumps version, builds, uploads to PyPI

What changed: Added otel_client.py (OTel-backed PrometheusClient), async MetricsHandler, PROMETHEUS_BACKEND=otel env var toggle. No breaking changes — existing prometheus_client backend is the default.

1b. aaa-lib → 0.21.3

cd ~/src/gisual/backend/aaa/aaa-lib
git checkout fix/token-claims-none

# Review: one-line fix in tokens.py user_type()
# https://gitlab.com/gisual/backend/aaa/aaa-lib (branch: fix/token-claims-none)

git checkout main && git merge fix/token-claims-none
poe release

What changed: user_type() returns 'unknown' instead of crashing with AttributeError when token_claims is None.

Step 2: Roll Out to Services

For each service below:

Per-service steps

# 1. Add OTel dependency
cd ~/src/gisual/backend/{apis,consumers,task-runners}/<service-name>
git checkout main && git pull
git checkout -b feature/otel-metrics

# 2. Edit pyproject.toml — add to dependencies:
#    "gisual-prometheus-clients[otel]>=0.2.0",

# 3. Update lockfile
uv lock

# 4. Commit
git add pyproject.toml uv.lock
git commit -m "feat: add OTel metrics backend

Adds gisual-prometheus-clients[otel] dependency to enable the OTel
metrics backend via PROMETHEUS_BACKEND=otel env var. Fixes memory
leak from prometheus_client threading locks.

Co-Authored-By: Robin Klingsberg <robin@gisual.com>"

# 5. Build and push Docker image
poe docker-build

# 6. Push branch
git push -u origin feature/otel-metrics

Per-service Helm change

Edit kubernetes/applications/backend/helm/<service>/environments/staging/values.yaml:

env:
  standard:
    PROMETHEUS_BACKEND: otel

Deploy to staging

cd ~/src/gisual/configuration-management/kubernetes
helm upgrade <service> applications/backend/helm/<service> \
  --namespace default \
  --values applications/backend/helm/<service>/values.yaml \
  --values applications/backend/helm/<service>/environments/staging/values.yaml

Verify

kubectl get pods -l app.kubernetes.io/name=<service>    # Running, 0 restarts
kubectl top pods -l app.kubernetes.io/name=<service>     # Memory under limit
kubectl logs -l app.kubernetes.io/name=<service> --since=60s | grep '"status": 500' | wc -l  # Should be 0

Step 3: Service Checklist

HTTP APIs (17 services)

AMQP Consumers (12 services)

These benefit the MOST from the fix — the push thread (every 15s) creates real cross-thread lock contention with the event loop.

Task Runners (12 services)

Same push thread contention as consumers.

Step 4: After All Services Verified in Staging

Promote to Production

For each service, copy the PROMETHEUS_BACKEND: otel env var to the production values file and deploy:

# In kubernetes/applications/backend/helm/<service>/environments/production/values.yaml:
# env:
#   standard:
#     PROMETHEUS_BACKEND: otel

Clean Up RC Versions

Remove from pypi.gisual.net:

gisual-prometheus-clients: 0.2.0rc1 through 0.2.0rc7, 0.2.0rc2.dev0+...
aaa-lib: 0.21.3rc1, 0.21.3rc2
gisual-runtime: 0.1.0rc2

Remove Docker images:

docker.gisual.net/backend/aaa/aaa-api:experiment-otel-metrics
docker.gisual.net/backend/aaa/aaa-api:experiment-gisual-runtime
docker.gisual.net/backend/aaa/aaa-api:stable-0.9.6

What NOT to Do Yet

Don't merge the datadog removal MRs (6 libraries) — that's Phase 2 after all services are on OTel metrics
Don't deploy gisual-runtime to any service — that's Phase 3
Don't remove dd_trace_enabled: false from Helm values — harmless, removal requires library changes
Don't change production until the service is verified in staging first

Troubleshooting

Pod crashloops after deploy:

Check logs: kubectl logs <pod> -n default --tail=20
Common issue: missing opentelemetry-exporter-prometheus package → means uv lock wasn't run after adding the [otel] extra

Metrics not appearing in /metrics:

Verify PROMETHEUS_METRICS_ENABLE=true is set (check metrics.enabled: true in values.yaml)
Verify PROMETHEUS_BACKEND=otel is set: kubectl exec <pod> -- printenv PROMETHEUS_BACKEND
Metrics only appear after the code path fires — give it a few minutes of traffic

Memory still high:

Verify the new image is actually running: kubectl get pod <pod> -o jsonpath='{.spec.containers[0].image}'
Check if PROMETHEUS_BACKEND=otel is set (not just PROMETHEUS_METRICS_ENABLE)

Key Links

Resource	URL
prometheus-clients MR	https://gitlab.com/gisual/backend/libraries/gisual-prometheus-clients/-/merge_requests/5
aaa-lib fix MR	Branch: `fix/token-claims-none` in aaa-lib
aaa-api OTel experiment	Branch: `experiment/otel-metrics` in aaa-api
Staging validation worklog	`worklogs/staging/2026-03-22.md`