Skip to content

Instantly share code, notes, and snippets.

@nerdalert
Created April 29, 2026 17:54
Show Gist options
  • Select an option

  • Save nerdalert/4c14d198989e47480831bf0b2f914001 to your computer and use it in GitHub Desktop.

Select an option

Save nerdalert/4c14d198989e47480831bf0b2f914001 to your computer and use it in GitHub Desktop.

MCP Queries

  GPU deep dives:
  - "What is the GPU temperature and power draw right now?"
  - "Show me tensor core activity and memory bandwidth utilization on the GPU"
  - "What are the SM and memory clock speeds — is the GPU throttling?"
  - "How much VRAM is used vs free on each GPU?"
  - "Show me PCIe throughput on the GPU node"

  Prometheus raw queries:
  - "Query DCGM_FI_DEV_FB_USED and DCGM_FI_DEV_FB_FREE and tell me the VRAM breakdown"
  - "Run a PromQL query for rate(DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION[5m]) to show GPU energy consumption rate"
  - "List all available metrics that contain 'DCGM'"
  - "What Prometheus metrics are available for GPU monitoring?"

  Cluster operations:
  - "Show me all pods on the vllm-demo namespace with their restart counts"
  - "Are there any warning events in the vllm-demo namespace?"
  - "What's the CPU and memory capacity on each node?"
  - "Describe the vllm pod — show me its full resource spec and status"
  - "Are there any KServe InferenceServices running?"

  Alerts:
  - "Show me alerts grouped by severity"
  - "Are there any critical or warning alerts firing right now?"

  Investigations (composite multi-source):
  - "Investigate errors in the vllm-demo namespace over the last 30 minutes"
  - "Investigate GPU utilization in the vllm-demo namespace"
  - "Run a latency investigation for the Qwen model"

  Dashboards (if Grafana is connected):
  - "List all Grafana dashboards"
  - "Show me the panels in the GPU monitoring dashboard"
    
    
---------------------------------------------------------------------------------

  Time-windowed analysis:
  
  "What was the average GPU utilization over the last 30 minutes vs the last 5 minutes?"
  "Show me the TPOT trend over the last hour — is latency stable or drifting?"
  "How has GPU temperature changed since the benchmark started?"
  "Compare GPU power draw in the last 10 minutes vs the last hour"

  Cross-signal correlation:
  "When GPU utilization was above 80%, what was the corresponding TTFT and queue depth?"
  "Show me the rate of completed requests per second alongside GPU compute utilization — do they correlate?"
  "Is there a relationship between KV cache usage and end-to-end latency during load?"
  "What was happening on the cluster — pods, events, alerts — during the GPU utilization spike?"

  Throughput and capacity:
  "What is the sustained token generation rate in tokens per second over the last 15 minutes?"
  "How many total requests has vLLM served and what's the error rate?"
  "At what request rate does TPOT start degrading — show me p50 vs p99 divergence"
  "What's the prompt token throughput vs generation token throughput?"

  Operational investigation:
  "Something seems slow — walk me through GPU state, model latency, queue pressure, and any alerts to find the bottleneck"
  "Give me a full health check: GPU utilization, VRAM, temperature, vLLM latency, queue depth, pod status, and active alerts"
  "Are there any Kubernetes events or pod restarts in the last hour that might explain latency changes?"
  "If I doubled the request rate, what metrics suggest the model would start queueing?"

  GPU deep dive:
  "What percentage of time are the tensor cores active vs the overall GPU compute utilization?"
  "Show me memory bandwidth utilization alongside VRAM usage — is the model memory-bound?"
  "What's the GPU energy consumption rate and how does it track with inference load?"
  "Are the SM and memory clocks running at max, or is the GPU throttling?"

  Model-specific:
  "Break down the request lifecycle — how much time is spent in prefill vs decode for the Qwen model?"
  "What's the distribution of output token counts across requests?"
  "Show me the inter-token latency distribution — is it consistent or bursty?"
  "How does the Qwen 0.5B TPOT compare to what we'd expect from an L40S GPU?"
  
  Why am I experiencing latency 

---------------------------------------------------------------------------------

  GPU + Infrastructure:
  "Show me the GPU nodes and what's using them"
  "What does GPU utilization look like across the cluster?"
  "Investigate GPU behavior — compute utilization, VRAM, temperature, and power"

  vLLM Model Performance:
  "What are the TTFT, TPOT, and end-to-end latency percentiles for the Qwen model?"
  "Show me the KV cache usage and request queue depth for vLLM"
  "How many requests has vLLM processed and are there any errors?"
  "Compare TPOT at p50, p95, and p99 — is there latency variance?"
  "What vLLM metrics are available? List them all."

  Correlation Across Signals:
  "Correlate GPU utilization with vLLM request latency — is the GPU the bottleneck?"
  "Show me GPU power draw alongside TPOT — does power correlate with token generation speed?"
  "Are there any alerts firing while vLLM is under load?"

  Operational:
  "Show me the pods and events in the vllm-demo namespace"
  "Are there any active alerts? What's the overall cluster health?"
  "Describe the vLLM pod — resources, restarts, node placement"

  Degraded Mode / Trust:
  "Are there any error logs for the vLLM deployment?" (Loki not configured — shows honest degradation)
  "List all Grafana dashboards" (Grafana not connected — same)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment