The current Demeter dashboards have scaling and usability issues:
-
Hardcoded namespace filtering — every dashboard query uses
{namespace="ext-nodes-m1"}or{namespace="ftr-ogmios-v0"}. Adding a new cluster or namespace means updating every dashboard and alert manually. -
Inconsistent query patterns — the node dashboard filters by
namespace + pod pattern, ogmios bypod pattern only, kupo bygeneric pod label. There's no unified way to drill down. -
Manual variable lists — the Network and Namespace dropdowns are hardcoded custom variables, not auto-discovered from Prometheus. When a new network or namespace is added, someone has to update dashboards.
-
No cross-cluster comparison — with two datasources (GKE Prometheus + Grafana Cloud), there's no way to compare the same service across clusters without building separate panels.
-
O(n × m) alert scaling — each new namespace × service combination requires duplicated alert rules.
Use a consistent set of pod labels defined in helmfile values, propagated through PodMonitor podTargetLabels into Prometheus metrics, and auto-discovered by Grafana dashboard variables. One source of truth — the helmfile release values — drives everything downstream.
These labels are added alongside existing labels — nothing is removed or replaced.
| Label | Purpose | Example Values | Source |
|---|---|---|---|
app |
Application type | cardano-node, ogmios, tx-submit-api, kupo, bursa |
extraPodLabels in helmfile values |
network |
Cardano network | mainnet, preprod, preview, prime-mainnet, prime-testnet |
extraPodLabels (already exists) |
alias |
Human-friendly node ID | cn.m.bp.az1, og.pv.az1, tx.m.az1 |
extraPodLabels (new) |
az |
Availability zone | az1, az2, us-central1-a |
extraPodLabels (new) |
group |
Node role/group | core, relay, bp |
extraPodLabels (new) |
The alias naming convention: {app_short}.{network_short}.{role}.{az} — e.g., cn.m.bp.az1 = cardano-node, mainnet, block producer, az1.
helmfile values (extraPodLabels)
↓
Pod labels on running containers
↓
PodMonitor podTargetLabels copies them to Prometheus
↓
Dashboard variables auto-discover via label_values()
↓
All panels filter with {network="$network", app="$app", alias=~"$alias"}
# 1. Network — auto-discovered from metrics
network:
type: query
query: label_values(cardano_node_metrics_blockNum_int, network)
# 2. App — filtered by selected network
app:
type: query
query: label_values(cardano_node_metrics_blockNum_int{network="$network"}, app)
# 3. Node — filtered by network + app, multi-select
alias:
type: query
query: label_values(cardano_node_metrics_blockNum_int{network="$network", app="$app"}, alias)
includeAll: true
multi: trueEvery panel query then uses:
cardano_node_metrics_blockNum_int{network="$network", app="$app", alias=~"$alias"}
No hardcoded namespaces. No pod name pattern matching. Adding a new network or az is just a helmfile values change — dashboards auto-discover it.
I've already opened PRs for the helm chart changes needed:
| Chart | PR | Change | Status |
|---|---|---|---|
| tx-submit-api | #355 | Add PodMonitor template, metrics port, extraPodLabels | Review ready |
| ogmios | #356 | Add podTargetLabels to PodMonitor | Review ready |
| kupo | #357 | Add podTargetLabels to PodMonitor | Review ready |
| balius | #358 | Add podTargetLabels + configurable podMetricsEndpoints | Review ready |
| bursa | #359 | Add PodMonitor template, extraPodLabels | Review ready |
cardano-node and dingo charts already have full support (extraPodLabels + podTargetLabels + PodMonitor).
Additive changes only — all existing labels (network, node-version, role, salt, cardano.demeter.run/network) remain untouched. We add three new labels: alias, app, az, and group.
Current:
cardano_node_mainnet:
extraPodLabels:
network: "mainnet"
node-version: "10.3.1"
role: "node"
salt: "v6g"
cardano.demeter.run/network: mainnetProposed (existing labels preserved, new labels added):
cardano_node_mainnet:
extraPodLabels:
# Existing labels — unchanged
network: "mainnet"
node-version: "10.3.1"
role: "node"
salt: "v6g"
cardano.demeter.run/network: mainnet
# New labels for dashboard targeting
alias: cn.m.az1
app: cardano-node
group: core
az: az1Same pattern applied to ogmios, kupo, tx-submit-api, etc. Each service gets its own alias prefix:
| Service | Alias Pattern | Examples |
|---|---|---|
| cardano-node | cn.{net}.{role}.{az} |
cn.m.bp.az1, cn.pv.relay.az2 |
| ogmios | og.{net}.{az} |
og.m.az1, og.pp.az1 |
| tx-submit-api | tx.{net}.{az} |
tx.m.az1, tx.pv.az1 |
| kupo | ku.{net}.{az} |
ku.m.az1, ku.pp.az1 |
| bursa | bursa.{net} |
bursa.m, bursa.pv |
Then in PodMonitor configs, add podTargetLabels to propagate these labels into Prometheus:
podMonitor:
enabled: true
podTargetLabels:
- alias
- app
- az
- group
- networkThis tells Prometheus to copy the pod labels into the scraped metric labels, making them available for label_values() queries in Grafana.
For Demeter dashboards that pull from two Prometheus instances (GKE cluster + Grafana Cloud), the dashboard-generator tool supports this natively:
datasources:
demeter:
type: prometheus
uid: grafanacloud-prom
url: https://blinklabsio.grafana.net/api/datasources/proxy/uid/grafanacloud-prom
token: $BLINKLABS_GRAFANA_SA_TOKEN
k3s:
type: prometheus
uid: prometheus
token: $GRAFANA_TOKENOnce both clusters use the same label conventions, a single dashboard config generates panels that work against either datasource — same queries, same variable chains, same filtering. Comparison panels can overlay metrics from both clusters side by side.
With standardized labels, I can generate a linked dashboard suite using dashboard-generator:
| Dashboard | Content | Apps |
|---|---|---|
| Overview | Block height, epoch, sync status, peer counts, mempool | cardano-node, dingo |
| Block Production | Forging, adoption, leadership, latency histograms | cardano-node, dingo |
| Peer Health | Hot/warm/cold peers, connections, chainsync clients | cardano-node, dingo |
| Mempool | TX pool depth, evictions, CBOR cache hit ratios | cardano-node, dingo |
| Resources | CPU, memory, GC, goroutines, FDs | dingo, ogmios, bursa |
| Ogmios | Sync %, connections, messages, sessions, heap | ogmios |
| TX Submit | Submissions, failures, request latency | tx-submit-api |
All dashboards share the same variable chain (network → app → alias) and nav links. Switching from my k3s cluster to Demeter's GKE cluster is just changing the datasource — same labels, same queries.
| Current State | Proposed State |
|---|---|
Hardcoded namespace="ext-nodes-m1" in every query |
{network="$network", app="$app"} — auto-discovered |
| Manual variable lists updated per namespace | label_values() queries — auto-populated |
| Different query patterns per dashboard | One consistent filter pattern everywhere |
| O(n×m) alert duplication | Single alert rule with label matchers |
| Adding a cluster = update dashboards + alerts | Adding a cluster = update helmfile values only |
| No cross-cluster comparison | Same labels on both clusters = unified views |
- Now — Merge helm chart PRs (5 PRs, all review-ready, CodeRabbit/Cubic clean)
- Next — Add new labels to infrastructure defaults.yaml (additive, no removals)
- Then — Add
podTargetLabelsto PodMonitor configs per service - Roll out — Per network, starting with preview/preprod, mainnet last
- Finally — Generate and deploy new dashboards via dashboard-generator
Low risk. All changes are:
- Additive — new labels added alongside existing ones, nothing removed
- Non-breaking — existing dashboards continue to work unchanged (they filter by namespace/pod pattern, not these new labels)
- Incremental — can roll out per-network, starting with preview/preprod
- Reversible — removing labels from helmfile values removes them from pods on next deploy
The PodMonitor podTargetLabels change is purely additive — it tells Prometheus to copy pod labels into scraped metrics. It doesn't change what gets scraped or how.
Existing labels like salt (used for operator reconciliation triggers), node-version, role, and cardano.demeter.run/network all remain in place and continue to serve their current purposes.
Two charts currently don't expose Prometheus metrics at the application level:
- dolos — Rust app, no
/metricsendpoint - adder — no
/metricsendpoint
These would need upstream application changes before PodMonitor support is useful. Not blockers for this proposal — we can add monitoring support to these charts once the apps expose metrics.