Skip to content

Instantly share code, notes, and snippets.

@wcatz
Last active March 28, 2026 12:15
Show Gist options
  • Select an option

  • Save wcatz/a1ef5fffd7f69abeb15cc2d389fb4d6a to your computer and use it in GitHub Desktop.

Select an option

Save wcatz/a1ef5fffd7f69abeb15cc2d389fb4d6a to your computer and use it in GitHub Desktop.
Demeter Dashboard Standardization Proposal — helmfile label → Prometheus → Grafana variable chain

Demeter Dashboard Standardization Proposal

Problem

The current Demeter dashboards have scaling and usability issues:

  1. Hardcoded namespace filtering — every dashboard query uses {namespace="ext-nodes-m1"} or {namespace="ftr-ogmios-v0"}. Adding a new cluster or namespace means updating every dashboard and alert manually.

  2. Inconsistent query patterns — the node dashboard filters by namespace + pod pattern, ogmios by pod pattern only, kupo by generic pod label. There's no unified way to drill down.

  3. Manual variable lists — the Network and Namespace dropdowns are hardcoded custom variables, not auto-discovered from Prometheus. When a new network or namespace is added, someone has to update dashboards.

  4. No cross-cluster comparison — with two datasources (GKE Prometheus + Grafana Cloud), there's no way to compare the same service across clusters without building separate panels.

  5. O(n × m) alert scaling — each new namespace × service combination requires duplicated alert rules.

Proposal: Helmfile Label → Prometheus Label → Dashboard Variable

Core Idea

Use a consistent set of pod labels defined in helmfile values, propagated through PodMonitor podTargetLabels into Prometheus metrics, and auto-discovered by Grafana dashboard variables. One source of truth — the helmfile release values — drives everything downstream.

Standard Label Set

These labels are added alongside existing labels — nothing is removed or replaced.

Label Purpose Example Values Source
app Application type cardano-node, ogmios, tx-submit-api, kupo, bursa extraPodLabels in helmfile values
network Cardano network mainnet, preprod, preview, prime-mainnet, prime-testnet extraPodLabels (already exists)
alias Human-friendly node ID cn.m.bp.az1, og.pv.az1, tx.m.az1 extraPodLabels (new)
az Availability zone az1, az2, us-central1-a extraPodLabels (new)
group Node role/group core, relay, bp extraPodLabels (new)

The alias naming convention: {app_short}.{network_short}.{role}.{az} — e.g., cn.m.bp.az1 = cardano-node, mainnet, block producer, az1.

How It Flows

helmfile values (extraPodLabels)
    ↓
Pod labels on running containers
    ↓
PodMonitor podTargetLabels copies them to Prometheus
    ↓
Dashboard variables auto-discover via label_values()
    ↓
All panels filter with {network="$network", app="$app", alias=~"$alias"}

Dashboard Variable Chain

# 1. Network — auto-discovered from metrics
network:
  type: query
  query: label_values(cardano_node_metrics_blockNum_int, network)

# 2. App — filtered by selected network
app:
  type: query
  query: label_values(cardano_node_metrics_blockNum_int{network="$network"}, app)

# 3. Node — filtered by network + app, multi-select
alias:
  type: query
  query: label_values(cardano_node_metrics_blockNum_int{network="$network", app="$app"}, alias)
  includeAll: true
  multi: true

Every panel query then uses:

cardano_node_metrics_blockNum_int{network="$network", app="$app", alias=~"$alias"}

No hardcoded namespaces. No pod name pattern matching. Adding a new network or az is just a helmfile values change — dashboards auto-discover it.

What Needs to Change in Blink Charts

I've already opened PRs for the helm chart changes needed:

Chart PR Change Status
tx-submit-api #355 Add PodMonitor template, metrics port, extraPodLabels Review ready
ogmios #356 Add podTargetLabels to PodMonitor Review ready
kupo #357 Add podTargetLabels to PodMonitor Review ready
balius #358 Add podTargetLabels + configurable podMetricsEndpoints Review ready
bursa #359 Add PodMonitor template, extraPodLabels Review ready

cardano-node and dingo charts already have full support (extraPodLabels + podTargetLabels + PodMonitor).

What Needs to Change in Blink Infrastructure

Additive changes only — all existing labels (network, node-version, role, salt, cardano.demeter.run/network) remain untouched. We add three new labels: alias, app, az, and group.

Current:

cardano_node_mainnet:
  extraPodLabels:
    network: "mainnet"
    node-version: "10.3.1"
    role: "node"
    salt: "v6g"
    cardano.demeter.run/network: mainnet

Proposed (existing labels preserved, new labels added):

cardano_node_mainnet:
  extraPodLabels:
    # Existing labels — unchanged
    network: "mainnet"
    node-version: "10.3.1"
    role: "node"
    salt: "v6g"
    cardano.demeter.run/network: mainnet
    # New labels for dashboard targeting
    alias: cn.m.az1
    app: cardano-node
    group: core
    az: az1

Same pattern applied to ogmios, kupo, tx-submit-api, etc. Each service gets its own alias prefix:

Service Alias Pattern Examples
cardano-node cn.{net}.{role}.{az} cn.m.bp.az1, cn.pv.relay.az2
ogmios og.{net}.{az} og.m.az1, og.pp.az1
tx-submit-api tx.{net}.{az} tx.m.az1, tx.pv.az1
kupo ku.{net}.{az} ku.m.az1, ku.pp.az1
bursa bursa.{net} bursa.m, bursa.pv

Then in PodMonitor configs, add podTargetLabels to propagate these labels into Prometheus:

podMonitor:
  enabled: true
  podTargetLabels:
    - alias
    - app
    - az
    - group
    - network

This tells Prometheus to copy the pod labels into the scraped metric labels, making them available for label_values() queries in Grafana.

Two-Datasource Dashboard Architecture

For Demeter dashboards that pull from two Prometheus instances (GKE cluster + Grafana Cloud), the dashboard-generator tool supports this natively:

datasources:
  demeter:
    type: prometheus
    uid: grafanacloud-prom
    url: https://blinklabsio.grafana.net/api/datasources/proxy/uid/grafanacloud-prom
    token: $BLINKLABS_GRAFANA_SA_TOKEN
  k3s:
    type: prometheus
    uid: prometheus
    token: $GRAFANA_TOKEN

Once both clusters use the same label conventions, a single dashboard config generates panels that work against either datasource — same queries, same variable chains, same filtering. Comparison panels can overlay metrics from both clusters side by side.

Dashboard Suite

With standardized labels, I can generate a linked dashboard suite using dashboard-generator:

Dashboard Content Apps
Overview Block height, epoch, sync status, peer counts, mempool cardano-node, dingo
Block Production Forging, adoption, leadership, latency histograms cardano-node, dingo
Peer Health Hot/warm/cold peers, connections, chainsync clients cardano-node, dingo
Mempool TX pool depth, evictions, CBOR cache hit ratios cardano-node, dingo
Resources CPU, memory, GC, goroutines, FDs dingo, ogmios, bursa
Ogmios Sync %, connections, messages, sessions, heap ogmios
TX Submit Submissions, failures, request latency tx-submit-api

All dashboards share the same variable chain (network → app → alias) and nav links. Switching from my k3s cluster to Demeter's GKE cluster is just changing the datasource — same labels, same queries.

Benefits

Current State Proposed State
Hardcoded namespace="ext-nodes-m1" in every query {network="$network", app="$app"} — auto-discovered
Manual variable lists updated per namespace label_values() queries — auto-populated
Different query patterns per dashboard One consistent filter pattern everywhere
O(n×m) alert duplication Single alert rule with label matchers
Adding a cluster = update dashboards + alerts Adding a cluster = update helmfile values only
No cross-cluster comparison Same labels on both clusters = unified views

Timeline

  1. Now — Merge helm chart PRs (5 PRs, all review-ready, CodeRabbit/Cubic clean)
  2. Next — Add new labels to infrastructure defaults.yaml (additive, no removals)
  3. Then — Add podTargetLabels to PodMonitor configs per service
  4. Roll out — Per network, starting with preview/preprod, mainnet last
  5. Finally — Generate and deploy new dashboards via dashboard-generator

Rollout Risk

Low risk. All changes are:

  • Additive — new labels added alongside existing ones, nothing removed
  • Non-breaking — existing dashboards continue to work unchanged (they filter by namespace/pod pattern, not these new labels)
  • Incremental — can roll out per-network, starting with preview/preprod
  • Reversible — removing labels from helmfile values removes them from pods on next deploy

The PodMonitor podTargetLabels change is purely additive — it tells Prometheus to copy pod labels into scraped metrics. It doesn't change what gets scraped or how.

Existing labels like salt (used for operator reconciliation triggers), node-version, role, and cardano.demeter.run/network all remain in place and continue to serve their current purposes.

Charts Without Metrics Endpoints

Two charts currently don't expose Prometheus metrics at the application level:

  • dolos — Rust app, no /metrics endpoint
  • adder — no /metrics endpoint

These would need upstream application changes before PodMonitor support is useful. Not blockers for this proposal — we can add monitoring support to these charts once the apps expose metrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment