Skip to content

Instantly share code, notes, and snippets.

@jmealo
Created March 22, 2026 19:11
Show Gist options
  • Select an option

  • Save jmealo/8ce5946611a80795e5047c06761f957f to your computer and use it in GitHub Desktop.

Select an option

Save jmealo/8ce5946611a80795e5047c06761f957f to your computer and use it in GitHub Desktop.
OTel Metrics Rollout — Monday Checklist for Robin

OTel Metrics Rollout — Monday Checklist

Background

The prometheus_client Python library uses threading.Lock per label combination on every .labels().observe() call. This caused OOM crashes (700+ MiB → OOMKill) on aaa-api in staging. The fix replaces the prometheus_client backend with OpenTelemetry via a PROMETHEUS_BACKEND=otel env var toggle in gisual-prometheus-clients.

Validated in staging since 2026-03-22: aaa-api running at 162 MiB (under 368 MiB limit), zero restarts, zero 500 errors, all metrics present in /metrics output.


Step 1: Publish Stable Packages

1a. gisual-prometheus-clients → 0.2.0

cd ~/src/gisual/backend/libraries/gisual-prometheus-clients
git checkout experiment/otel-backend

# Review the MR first:
# https://gitlab.com/gisual/backend/libraries/gisual-prometheus-clients/-/merge_requests/5

# Merge to main
git checkout main && git merge experiment/otel-backend

# Tag and release
poe release  # runs checks, bumps version, builds, uploads to PyPI

What changed: Added otel_client.py (OTel-backed PrometheusClient), async MetricsHandler, PROMETHEUS_BACKEND=otel env var toggle. No breaking changes — existing prometheus_client backend is the default.

1b. aaa-lib → 0.21.3

cd ~/src/gisual/backend/aaa/aaa-lib
git checkout fix/token-claims-none

# Review: one-line fix in tokens.py user_type()
# https://gitlab.com/gisual/backend/aaa/aaa-lib (branch: fix/token-claims-none)

git checkout main && git merge fix/token-claims-none
poe release

What changed: user_type() returns 'unknown' instead of crashing with AttributeError when token_claims is None.


Step 2: Roll Out to Services

For each service below:

Per-service steps

# 1. Add OTel dependency
cd ~/src/gisual/backend/{apis,consumers,task-runners}/<service-name>
git checkout main && git pull
git checkout -b feature/otel-metrics

# 2. Edit pyproject.toml — add to dependencies:
#    "gisual-prometheus-clients[otel]>=0.2.0",

# 3. Update lockfile
uv lock

# 4. Commit
git add pyproject.toml uv.lock
git commit -m "feat: add OTel metrics backend

Adds gisual-prometheus-clients[otel] dependency to enable the OTel
metrics backend via PROMETHEUS_BACKEND=otel env var. Fixes memory
leak from prometheus_client threading locks.

Co-Authored-By: Robin Klingsberg <robin@gisual.com>"

# 5. Build and push Docker image
poe docker-build

# 6. Push branch
git push -u origin feature/otel-metrics

Per-service Helm change

Edit kubernetes/applications/backend/helm/<service>/environments/staging/values.yaml:

env:
  standard:
    PROMETHEUS_BACKEND: otel

Deploy to staging

cd ~/src/gisual/configuration-management/kubernetes
helm upgrade <service> applications/backend/helm/<service> \
  --namespace default \
  --values applications/backend/helm/<service>/values.yaml \
  --values applications/backend/helm/<service>/environments/staging/values.yaml

Verify

kubectl get pods -l app.kubernetes.io/name=<service>    # Running, 0 restarts
kubectl top pods -l app.kubernetes.io/name=<service>     # Memory under limit
kubectl logs -l app.kubernetes.io/name=<service> --since=60s | grep '"status": 500' | wc -l  # Should be 0

Step 3: Service Checklist

HTTP APIs (17 services)

  • aaa-api (already done — deployed and validated)
  • aaa-gateway
  • assets-api
  • data-collection-api
  • decisions-api
  • dependencies-api
  • incidents-api
  • intel-api
  • intel-requests-api
  • locations-api
  • outage-mock-server
  • outage-scans-api
  • outage-validation-api
  • predictions-api
  • public-api
  • satellite-api
  • usage-api
  • web-tool-api

AMQP Consumers (12 services)

These benefit the MOST from the fix — the push thread (every 15s) creates real cross-thread lock contention with the event loop.

  • ai-prediction-tester
  • alarm-lifecycle-manager
  • asset-monitor
  • asset-status-updater
  • cache-updater
  • event-recorder
  • feed-monitor
  • intel-searcher
  • message-amplifier
  • notification-sender
  • outage-updater
  • regional-outage-collector

Task Runners (12 services)

Same push thread contention as consumers.

  • archive-tagger
  • asset-outage-notifier
  • broken-utility-detector
  • current-incidents-cache-pruner
  • customer-utility-updater
  • feed-config-server
  • load-test-notification-sender
  • outage-cache-warmer
  • outage-validation-task-manager
  • regional-outage-feeder
  • search-retry-feeder
  • transformers-api

Step 4: After All Services Verified in Staging

Promote to Production

For each service, copy the PROMETHEUS_BACKEND: otel env var to the production values file and deploy:

# In kubernetes/applications/backend/helm/<service>/environments/production/values.yaml:
# env:
#   standard:
#     PROMETHEUS_BACKEND: otel

Clean Up RC Versions

Remove from pypi.gisual.net:

  • gisual-prometheus-clients: 0.2.0rc1 through 0.2.0rc7, 0.2.0rc2.dev0+...
  • aaa-lib: 0.21.3rc1, 0.21.3rc2
  • gisual-runtime: 0.1.0rc2

Remove Docker images:

  • docker.gisual.net/backend/aaa/aaa-api:experiment-otel-metrics
  • docker.gisual.net/backend/aaa/aaa-api:experiment-gisual-runtime
  • docker.gisual.net/backend/aaa/aaa-api:stable-0.9.6

What NOT to Do Yet

  • Don't merge the datadog removal MRs (6 libraries) — that's Phase 2 after all services are on OTel metrics
  • Don't deploy gisual-runtime to any service — that's Phase 3
  • Don't remove dd_trace_enabled: false from Helm values — harmless, removal requires library changes
  • Don't change production until the service is verified in staging first

Troubleshooting

Pod crashloops after deploy:

  • Check logs: kubectl logs <pod> -n default --tail=20
  • Common issue: missing opentelemetry-exporter-prometheus package → means uv lock wasn't run after adding the [otel] extra

Metrics not appearing in /metrics:

  • Verify PROMETHEUS_METRICS_ENABLE=true is set (check metrics.enabled: true in values.yaml)
  • Verify PROMETHEUS_BACKEND=otel is set: kubectl exec <pod> -- printenv PROMETHEUS_BACKEND
  • Metrics only appear after the code path fires — give it a few minutes of traffic

Memory still high:

  • Verify the new image is actually running: kubectl get pod <pod> -o jsonpath='{.spec.containers[0].image}'
  • Check if PROMETHEUS_BACKEND=otel is set (not just PROMETHEUS_METRICS_ENABLE)

Key Links

Resource URL
prometheus-clients MR https://gitlab.com/gisual/backend/libraries/gisual-prometheus-clients/-/merge_requests/5
aaa-lib fix MR Branch: fix/token-claims-none in aaa-lib
aaa-api OTel experiment Branch: experiment/otel-metrics in aaa-api
Staging validation worklog worklogs/staging/2026-03-22.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment