wcatz/demeter-dashboard-proposal.md

Demeter Dashboard Standardization Proposal

Problem

The current Demeter dashboards have scaling and usability issues:

Hardcoded namespace filtering — every dashboard query uses {namespace="ext-nodes-m1"} or {namespace="ftr-ogmios-v0"}. Adding a new cluster or namespace means updating every dashboard and alert manually.
Inconsistent query patterns — the node dashboard filters by namespace + pod pattern, ogmios by pod pattern only, kupo by generic pod label. There's no unified way to drill down.
Manual variable lists — the Network and Namespace dropdowns are hardcoded custom variables, not auto-discovered from Prometheus. When a new network or namespace is added, someone has to update dashboards.
No cross-cluster comparison — with two datasources (GKE Prometheus + Grafana Cloud), there's no way to compare the same service across clusters without building separate panels.
O(n × m) alert scaling — each new namespace × service combination requires duplicated alert rules.

Proposal: Helmfile Label → Prometheus Label → Dashboard Variable

Core Idea

Use a consistent set of pod labels defined in helmfile values, propagated through PodMonitor podTargetLabels into Prometheus metrics, and auto-discovered by Grafana dashboard variables. One source of truth — the helmfile release values — drives everything downstream.

Standard Label Set

These labels are added alongside existing labels — nothing is removed or replaced.

Label	Purpose	Example Values	Source
`app`	Application type	`cardano-node`, `ogmios`, `tx-submit-api`, `kupo`, `bursa`	`extraPodLabels` in helmfile values
`network`	Cardano network	`mainnet`, `preprod`, `preview`, `prime-mainnet`, `prime-testnet`	`extraPodLabels` (already exists)
`alias`	Human-friendly node ID	`cn.m.bp.az1`, `og.pv.az1`, `tx.m.az1`	`extraPodLabels` (new)
`az`	Availability zone	`az1`, `az2`, `us-central1-a`	`extraPodLabels` (new)
`group`	Node role/group	`core`, `relay`, `bp`	`extraPodLabels` (new)

The alias naming convention: {app_short}.{network_short}.{role}.{az} — e.g., cn.m.bp.az1 = cardano-node, mainnet, block producer, az1.

How It Flows

helmfile values (extraPodLabels)
    ↓
Pod labels on running containers
    ↓
PodMonitor podTargetLabels copies them to Prometheus
    ↓
Dashboard variables auto-discover via label_values()
    ↓
All panels filter with {network="$network", app="$app", alias=~"$alias"}

Dashboard Variable Chain

# 1. Network — auto-discovered from metrics
network:
  type: query
  query: label_values(cardano_node_metrics_blockNum_int, network)

# 2. App — filtered by selected network
app:
  type: query
  query: label_values(cardano_node_metrics_blockNum_int{network="$network"}, app)

# 3. Node — filtered by network + app, multi-select
alias:
  type: query
  query: label_values(cardano_node_metrics_blockNum_int{network="$network", app="$app"}, alias)
  includeAll: true
  multi: true

Every panel query then uses:

cardano_node_metrics_blockNum_int{network="$network", app="$app", alias=~"$alias"}

No hardcoded namespaces. No pod name pattern matching. Adding a new network or az is just a helmfile values change — dashboards auto-discover it.

What Needs to Change in Blink Charts

I've already opened PRs for the helm chart changes needed:

Chart	PR	Change	Status
tx-submit-api	#355	Add PodMonitor template, metrics port, extraPodLabels	Review ready
ogmios	#356	Add podTargetLabels to PodMonitor	Review ready
kupo	#357	Add podTargetLabels to PodMonitor	Review ready
balius	#358	Add podTargetLabels + configurable podMetricsEndpoints	Review ready
bursa	#359	Add PodMonitor template, extraPodLabels	Review ready

cardano-node and dingo charts already have full support (extraPodLabels + podTargetLabels + PodMonitor).

What Needs to Change in Blink Infrastructure

Additive changes only — all existing labels (network, node-version, role, salt, cardano.demeter.run/network) remain untouched. We add three new labels: alias, app, az, and group.

Current:

cardano_node_mainnet:
  extraPodLabels:
    network: "mainnet"
    node-version: "10.3.1"
    role: "node"
    salt: "v6g"
    cardano.demeter.run/network: mainnet

Proposed (existing labels preserved, new labels added):

cardano_node_mainnet:
  extraPodLabels:
    # Existing labels — unchanged
    network: "mainnet"
    node-version: "10.3.1"
    role: "node"
    salt: "v6g"
    cardano.demeter.run/network: mainnet
    # New labels for dashboard targeting
    alias: cn.m.az1
    app: cardano-node
    group: core
    az: az1

Same pattern applied to ogmios, kupo, tx-submit-api, etc. Each service gets its own alias prefix:

Service	Alias Pattern	Examples
cardano-node	`cn.{net}.{role}.{az}`	`cn.m.bp.az1`, `cn.pv.relay.az2`
ogmios	`og.{net}.{az}`	`og.m.az1`, `og.pp.az1`
tx-submit-api	`tx.{net}.{az}`	`tx.m.az1`, `tx.pv.az1`
kupo	`ku.{net}.{az}`	`ku.m.az1`, `ku.pp.az1`
bursa	`bursa.{net}`	`bursa.m`, `bursa.pv`

Then in PodMonitor configs, add podTargetLabels to propagate these labels into Prometheus:

podMonitor:
  enabled: true
  podTargetLabels:
    - alias
    - app
    - az
    - group
    - network

This tells Prometheus to copy the pod labels into the scraped metric labels, making them available for label_values() queries in Grafana.

Two-Datasource Dashboard Architecture

For Demeter dashboards that pull from two Prometheus instances (GKE cluster + Grafana Cloud), the dashboard-generator tool supports this natively:

datasources:
  demeter:
    type: prometheus
    uid: grafanacloud-prom
    url: https://blinklabsio.grafana.net/api/datasources/proxy/uid/grafanacloud-prom
    token: $BLINKLABS_GRAFANA_SA_TOKEN
  k3s:
    type: prometheus
    uid: prometheus
    token: $GRAFANA_TOKEN

Once both clusters use the same label conventions, a single dashboard config generates panels that work against either datasource — same queries, same variable chains, same filtering. Comparison panels can overlay metrics from both clusters side by side.

Dashboard Suite

With standardized labels, I can generate a linked dashboard suite using dashboard-generator:

Dashboard	Content	Apps
Overview	Block height, epoch, sync status, peer counts, mempool	cardano-node, dingo
Block Production	Forging, adoption, leadership, latency histograms	cardano-node, dingo
Peer Health	Hot/warm/cold peers, connections, chainsync clients	cardano-node, dingo
Mempool	TX pool depth, evictions, CBOR cache hit ratios	cardano-node, dingo
Resources	CPU, memory, GC, goroutines, FDs	dingo, ogmios, bursa
Ogmios	Sync %, connections, messages, sessions, heap	ogmios
TX Submit	Submissions, failures, request latency	tx-submit-api

All dashboards share the same variable chain (network → app → alias) and nav links. Switching from my k3s cluster to Demeter's GKE cluster is just changing the datasource — same labels, same queries.

Benefits

Current State	Proposed State
Hardcoded `namespace="ext-nodes-m1"` in every query	`{network="$network", app="$app"}` — auto-discovered
Manual variable lists updated per namespace	`label_values()` queries — auto-populated
Different query patterns per dashboard	One consistent filter pattern everywhere
O(n×m) alert duplication	Single alert rule with label matchers
Adding a cluster = update dashboards + alerts	Adding a cluster = update helmfile values only
No cross-cluster comparison	Same labels on both clusters = unified views

Timeline

Now — Merge helm chart PRs (5 PRs, all review-ready, CodeRabbit/Cubic clean)
Next — Add new labels to infrastructure defaults.yaml (additive, no removals)
Then — Add podTargetLabels to PodMonitor configs per service
Roll out — Per network, starting with preview/preprod, mainnet last
Finally — Generate and deploy new dashboards via dashboard-generator

Rollout Risk

Low risk. All changes are:

Additive — new labels added alongside existing ones, nothing removed
Non-breaking — existing dashboards continue to work unchanged (they filter by namespace/pod pattern, not these new labels)
Incremental — can roll out per-network, starting with preview/preprod
Reversible — removing labels from helmfile values removes them from pods on next deploy

The PodMonitor podTargetLabels change is purely additive — it tells Prometheus to copy pod labels into scraped metrics. It doesn't change what gets scraped or how.

Existing labels like salt (used for operator reconciliation triggers), node-version, role, and cardano.demeter.run/network all remain in place and continue to serve their current purposes.

Charts Without Metrics Endpoints

Two charts currently don't expose Prometheus metrics at the application level:

dolos — Rust app, no /metrics endpoint
adder — no /metrics endpoint

These would need upstream application changes before PodMonitor support is useful. Not blockers for this proposal — we can add monitoring support to these charts once the apps expose metrics.