The prometheus_client Python library uses threading.Lock per label combination on every .labels().observe() call. This caused OOM crashes (700+ MiB → OOMKill) on aaa-api in staging. The fix replaces the prometheus_client backend with OpenTelemetry via a PROMETHEUS_BACKEND=otel env var toggle in gisual-prometheus-clients.
Validated in staging since 2026-03-22: aaa-api running at 162 MiB (under 368 MiB limit), zero restarts, zero 500 errors, all metrics present in /metrics output.
cd ~/src/gisual/backend/libraries/gisual-prometheus-clients
git checkout experiment/otel-backend
# Review the MR first:
# https://gitlab.com/gisual/backend/libraries/gisual-prometheus-clients/-/merge_requests/5
# Merge to main
git checkout main && git merge experiment/otel-backend
# Tag and release
poe release # runs checks, bumps version, builds, uploads to PyPIWhat changed: Added otel_client.py (OTel-backed PrometheusClient), async MetricsHandler, PROMETHEUS_BACKEND=otel env var toggle. No breaking changes — existing prometheus_client backend is the default.
cd ~/src/gisual/backend/aaa/aaa-lib
git checkout fix/token-claims-none
# Review: one-line fix in tokens.py user_type()
# https://gitlab.com/gisual/backend/aaa/aaa-lib (branch: fix/token-claims-none)
git checkout main && git merge fix/token-claims-none
poe releaseWhat changed: user_type() returns 'unknown' instead of crashing with AttributeError when token_claims is None.
For each service below:
# 1. Add OTel dependency
cd ~/src/gisual/backend/{apis,consumers,task-runners}/<service-name>
git checkout main && git pull
git checkout -b feature/otel-metrics
# 2. Edit pyproject.toml — add to dependencies:
# "gisual-prometheus-clients[otel]>=0.2.0",
# 3. Update lockfile
uv lock
# 4. Commit
git add pyproject.toml uv.lock
git commit -m "feat: add OTel metrics backend
Adds gisual-prometheus-clients[otel] dependency to enable the OTel
metrics backend via PROMETHEUS_BACKEND=otel env var. Fixes memory
leak from prometheus_client threading locks.
Co-Authored-By: Robin Klingsberg <robin@gisual.com>"
# 5. Build and push Docker image
poe docker-build
# 6. Push branch
git push -u origin feature/otel-metricsEdit kubernetes/applications/backend/helm/<service>/environments/staging/values.yaml:
env:
standard:
PROMETHEUS_BACKEND: otelcd ~/src/gisual/configuration-management/kubernetes
helm upgrade <service> applications/backend/helm/<service> \
--namespace default \
--values applications/backend/helm/<service>/values.yaml \
--values applications/backend/helm/<service>/environments/staging/values.yamlkubectl get pods -l app.kubernetes.io/name=<service> # Running, 0 restarts
kubectl top pods -l app.kubernetes.io/name=<service> # Memory under limit
kubectl logs -l app.kubernetes.io/name=<service> --since=60s | grep '"status": 500' | wc -l # Should be 0- aaa-api (already done — deployed and validated)
- aaa-gateway
- assets-api
- data-collection-api
- decisions-api
- dependencies-api
- incidents-api
- intel-api
- intel-requests-api
- locations-api
- outage-mock-server
- outage-scans-api
- outage-validation-api
- predictions-api
- public-api
- satellite-api
- usage-api
- web-tool-api
These benefit the MOST from the fix — the push thread (every 15s) creates real cross-thread lock contention with the event loop.
- ai-prediction-tester
- alarm-lifecycle-manager
- asset-monitor
- asset-status-updater
- cache-updater
- event-recorder
- feed-monitor
- intel-searcher
- message-amplifier
- notification-sender
- outage-updater
- regional-outage-collector
Same push thread contention as consumers.
- archive-tagger
- asset-outage-notifier
- broken-utility-detector
- current-incidents-cache-pruner
- customer-utility-updater
- feed-config-server
- load-test-notification-sender
- outage-cache-warmer
- outage-validation-task-manager
- regional-outage-feeder
- search-retry-feeder
- transformers-api
For each service, copy the PROMETHEUS_BACKEND: otel env var to the production values file and deploy:
# In kubernetes/applications/backend/helm/<service>/environments/production/values.yaml:
# env:
# standard:
# PROMETHEUS_BACKEND: otelRemove from pypi.gisual.net:
gisual-prometheus-clients: 0.2.0rc1 through 0.2.0rc7, 0.2.0rc2.dev0+...aaa-lib: 0.21.3rc1, 0.21.3rc2gisual-runtime: 0.1.0rc2
Remove Docker images:
docker.gisual.net/backend/aaa/aaa-api:experiment-otel-metricsdocker.gisual.net/backend/aaa/aaa-api:experiment-gisual-runtimedocker.gisual.net/backend/aaa/aaa-api:stable-0.9.6
- Don't merge the datadog removal MRs (6 libraries) — that's Phase 2 after all services are on OTel metrics
- Don't deploy gisual-runtime to any service — that's Phase 3
- Don't remove
dd_trace_enabled: falsefrom Helm values — harmless, removal requires library changes - Don't change production until the service is verified in staging first
Pod crashloops after deploy:
- Check logs:
kubectl logs <pod> -n default --tail=20 - Common issue: missing
opentelemetry-exporter-prometheuspackage → meansuv lockwasn't run after adding the[otel]extra
Metrics not appearing in /metrics:
- Verify
PROMETHEUS_METRICS_ENABLE=trueis set (checkmetrics.enabled: truein values.yaml) - Verify
PROMETHEUS_BACKEND=otelis set:kubectl exec <pod> -- printenv PROMETHEUS_BACKEND - Metrics only appear after the code path fires — give it a few minutes of traffic
Memory still high:
- Verify the new image is actually running:
kubectl get pod <pod> -o jsonpath='{.spec.containers[0].image}' - Check if
PROMETHEUS_BACKEND=otelis set (not justPROMETHEUS_METRICS_ENABLE)
| Resource | URL |
|---|---|
| prometheus-clients MR | https://gitlab.com/gisual/backend/libraries/gisual-prometheus-clients/-/merge_requests/5 |
| aaa-lib fix MR | Branch: fix/token-claims-none in aaa-lib |
| aaa-api OTel experiment | Branch: experiment/otel-metrics in aaa-api |
| Staging validation worklog | worklogs/staging/2026-03-22.md |