Skip to content

Instantly share code, notes, and snippets.

@williamcaban
Last active April 28, 2026 02:07
Show Gist options
  • Select an option

  • Save williamcaban/b9a428ac7ff059a2b773bcdcaff7131d to your computer and use it in GitHub Desktop.

Select an option

Save williamcaban/b9a428ac7ff059a2b773bcdcaff7131d to your computer and use it in GitHub Desktop.

Evaluation flow for Telco Use Cases

Phase 1: Define Evaluation Scope

  1. Map use cases to evaluation dimensions — categorize by: network operations (fault diagnosis, config generation), customer-facing (intent classification, summarization), regulatory (data retention, privacy), and safety-critical (outage triage, escalation routing)
  2. Identify target model(s) — baseline (e.g., Llama-3-8B), candidate, and a reference frontier model for calibration
  3. Set acceptance thresholds per use case — e.g., config generation must be ≥95% syntactically valid; hallucination rate on network terminology must be <2%

Phase 2: Domain Knowledge & Accuracy

  1. Curate Telco-specific benchmarks — collect Q&A pairs from 3GPP specs, ITU-T standards, vendor docs (Nokia, Ericsson, Cisco IOS-XR), and internal runbooks
  2. Run terminology and factual accuracy tests — probe knowledge of: BGP, MPLS, NFV/VNF, O-RAN, 5G NR, slice management, YANG models, NetConf/RESTCONF
  3. Evaluate structured output fidelity — for config generation tasks, validate output against schema (e.g., YANG, Jinja2 templates, JSON/YAML)
  4. Test multi-step reasoning on network scenarios — give incident tickets and measure root-cause reasoning quality (use chain-of-thought scoring)

Phase 3: Reasoning Quality

  1. Use LM Evaluation Harness with reasoning tasks — run arc_challenge, hellaswag, winogrande as baselines; add custom Telco reasoning tasks
  2. Evaluate with self-consistency — run same prompts 5–10x, measure variance in answers (high variance = unreliable reasoning)
  3. Stress-test multi-hop inference — create scenarios requiring 3+ reasoning steps (e.g., "Given these SNMP traps, identify the likely failed component and recommend the rollback procedure")
  4. Score with LLM-as-judge — use a strong model (Sonnet/GPT-4o) as judge with a rubric for: correctness, completeness, safety of recommended actions

Phase 4: Operational Reliability

  1. Measure latency under load — use GuideLLM to profile throughput/latency at P50/P95/P99 for each serving configuration
  2. Test context window handling — feed full incident logs, YANG schemas, and runbooks; measure degradation at context boundaries
  3. Evaluate instruction-following on constrained outputs — Telco ops demand exact formats; test refusal-to-hallucinate on unknown OIDs/interfaces

Phase 5: Safety & Risk Assessment (Tool-Agnostic)

  1. Select a risk assessment framework appropriate to your environment

Choose one or more based on access and maturity:

┌─────────────────────────┬──────────────────────────────────────────────────────────────────────┐
│    Tool / Framework     │                               Best For                               │
├─────────────────────────┼──────────────────────────────────────────────────────────────────────┤
│ Garak                   │ Automated red-teaming, probe-based vulnerability scanning            │
├─────────────────────────┼──────────────────────────────────────────────────────────────────────┤
│ HELM (Stanford)         │ Holistic benchmarking incl. toxicity, bias, robustness               │
├─────────────────────────┼──────────────────────────────────────────────────────────────────────┤
│ AISI Inspect            │ Structured eval tasks with scorer pipelines (UK AI Safety Institute) │
├─────────────────────────┼──────────────────────────────────────────────────────────────────────┤
│ AIR-Bench /AIRS         │ AI risk taxonomy-aligned benchmarks (safety categories)              │
├─────────────────────────┼──────────────────────────────────────────────────────────────────────┤
│ DecodingTrust           │ GPT-family trust dimensions (generalizes to other models)            │
├─────────────────────────┼──────────────────────────────────────────────────────────────────────┤
│ TrustLLM                │ 8 trustworthiness dimensions, open benchmark suite                   │
├─────────────────────────┼──────────────────────────────────────────────────────────────────────┤
│ Custom red-team prompts │ When domain specificity (Telco) outweighs generic coverage           │
└─────────────────────────┴──────────────────────────────────────────────────────────────────────┘
  1. Define risk categories relevant to Telco deployment

Map your probing to these dimensions regardless of tool:

  • Hallucination / Confabulation — fabricating interface names, OIDs, BGP attributes
  • Prompt Injection — critical for agentic NOC workflows where attacker-controlled input (e.g., log lines) reaches the model
  • Data / Credential Leakage — reproduction of credentials, configs, or PII from training data
  • Dangerous Output Generation — syntactically valid but operationally destructive configs (route leaks, ACL removals)
  • Bias & Fairness — relevant for customer-facing use cases (intent routing, support triage)
  • Robustness under adversarial input — paraphrased or noisy prompts from real operator queries
  1. Execute structured probing across each risk category

For each category:

  • Define a minimum of 20–50 test cases (automated or manual)
  • Include both benign and adversarial variants
  • Record: input, model output, expected output, pass/fail verdict, severity
  1. Score and classify findings
┌──────────┬─────────────────────────────────────────────────────────────────────────────────────────┬──────────────────────────────────────┐
│ Severity │                                       Definition                                        │                Action                │
├──────────┼─────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────┤
│ Blocker  │ Model produces output that could directly cause outage, data breach, or safety incident │ Do not deploy; escalate              │
├──────────┼─────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────┤
│ High     │ Consistent failure on a core Telco task or meaningful leakage risk                      │ Require mitigation before deployment │
├──────────┼─────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────┤
│ Medium   │ Occasional failure addressable by prompt guardrails or output filtering                 │ Deploy with controls                 │
├──────────┼─────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────┤
│ Low      │ Rare or edge-case failure with negligible operational impact                            │ Document and monitor                 │
└──────────┴─────────────────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────┘
  1. Apply mitigations and re-assess

Common mitigations before re-testing:

  • System prompt constraints (scope, output format, refusal instructions)
  • RAG grounding with authoritative Telco documentation
  • Output validation layer (schema check, regex, secondary model review)
  • Fine-tuning or RLHF on Telco-safe demonstrations
  1. Produce a Risk Residual Report

Document for each finding: risk category, severity, mitigation applied, residual risk level, and sign-off required. This feeds directly into your decision gate (Step 21).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment