Phase 1: Define Evaluation Scope
- Map use cases to evaluation dimensions — categorize by: network operations (fault diagnosis, config generation), customer-facing (intent classification, summarization), regulatory (data retention, privacy), and safety-critical (outage triage, escalation routing)
- Identify target model(s) — baseline (e.g., Llama-3-8B), candidate, and a reference frontier model for calibration
- Set acceptance thresholds per use case — e.g., config generation must be ≥95% syntactically valid; hallucination rate on network terminology must be <2%
Phase 2: Domain Knowledge & Accuracy
- Curate Telco-specific benchmarks — collect Q&A pairs from 3GPP specs, ITU-T standards, vendor docs (Nokia, Ericsson, Cisco IOS-XR), and internal runbooks
- Run terminology and factual accuracy tests — probe knowledge of: BGP, MPLS, NFV/VNF, O-RAN, 5G NR, slice management, YANG models, NetConf/RESTCONF
- Evaluate structured output fidelity — for config generation tasks, validate output against schema (e.g., YANG, Jinja2 templates, JSON/YAML)
- Test multi-step reasoning on network scenarios — give incident tickets and measure root-cause reasoning quality (use chain-of-thought scoring)
Phase 3: Reasoning Quality
- Use LM Evaluation Harness with reasoning tasks — run arc_challenge, hellaswag, winogrande as baselines; add custom Telco reasoning tasks
- Evaluate with self-consistency — run same prompts 5–10x, measure variance in answers (high variance = unreliable reasoning)
- Stress-test multi-hop inference — create scenarios requiring 3+ reasoning steps (e.g., "Given these SNMP traps, identify the likely failed component and recommend the rollback procedure")
- Score with LLM-as-judge — use a strong model (Sonnet/GPT-4o) as judge with a rubric for: correctness, completeness, safety of recommended actions
Phase 4: Operational Reliability
- Measure latency under load — use GuideLLM to profile throughput/latency at P50/P95/P99 for each serving configuration
- Test context window handling — feed full incident logs, YANG schemas, and runbooks; measure degradation at context boundaries
- Evaluate instruction-following on constrained outputs — Telco ops demand exact formats; test refusal-to-hallucinate on unknown OIDs/interfaces
Phase 5: Safety & Risk Assessment (Tool-Agnostic)
- Select a risk assessment framework appropriate to your environment
Choose one or more based on access and maturity:
┌─────────────────────────┬──────────────────────────────────────────────────────────────────────┐
│ Tool / Framework │ Best For │
├─────────────────────────┼──────────────────────────────────────────────────────────────────────┤
│ Garak │ Automated red-teaming, probe-based vulnerability scanning │
├─────────────────────────┼──────────────────────────────────────────────────────────────────────┤
│ HELM (Stanford) │ Holistic benchmarking incl. toxicity, bias, robustness │
├─────────────────────────┼──────────────────────────────────────────────────────────────────────┤
│ AISI Inspect │ Structured eval tasks with scorer pipelines (UK AI Safety Institute) │
├─────────────────────────┼──────────────────────────────────────────────────────────────────────┤
│ AIR-Bench /AIRS │ AI risk taxonomy-aligned benchmarks (safety categories) │
├─────────────────────────┼──────────────────────────────────────────────────────────────────────┤
│ DecodingTrust │ GPT-family trust dimensions (generalizes to other models) │
├─────────────────────────┼──────────────────────────────────────────────────────────────────────┤
│ TrustLLM │ 8 trustworthiness dimensions, open benchmark suite │
├─────────────────────────┼──────────────────────────────────────────────────────────────────────┤
│ Custom red-team prompts │ When domain specificity (Telco) outweighs generic coverage │
└─────────────────────────┴──────────────────────────────────────────────────────────────────────┘
- Define risk categories relevant to Telco deployment
Map your probing to these dimensions regardless of tool:
- Hallucination / Confabulation — fabricating interface names, OIDs, BGP attributes
- Prompt Injection — critical for agentic NOC workflows where attacker-controlled input (e.g., log lines) reaches the model
- Data / Credential Leakage — reproduction of credentials, configs, or PII from training data
- Dangerous Output Generation — syntactically valid but operationally destructive configs (route leaks, ACL removals)
- Bias & Fairness — relevant for customer-facing use cases (intent routing, support triage)
- Robustness under adversarial input — paraphrased or noisy prompts from real operator queries
- Execute structured probing across each risk category
For each category:
- Define a minimum of 20–50 test cases (automated or manual)
- Include both benign and adversarial variants
- Record: input, model output, expected output, pass/fail verdict, severity
- Score and classify findings
┌──────────┬─────────────────────────────────────────────────────────────────────────────────────────┬──────────────────────────────────────┐
│ Severity │ Definition │ Action │
├──────────┼─────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────┤
│ Blocker │ Model produces output that could directly cause outage, data breach, or safety incident │ Do not deploy; escalate │
├──────────┼─────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────┤
│ High │ Consistent failure on a core Telco task or meaningful leakage risk │ Require mitigation before deployment │
├──────────┼─────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────┤
│ Medium │ Occasional failure addressable by prompt guardrails or output filtering │ Deploy with controls │
├──────────┼─────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────┤
│ Low │ Rare or edge-case failure with negligible operational impact │ Document and monitor │
└──────────┴─────────────────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────┘
- Apply mitigations and re-assess
Common mitigations before re-testing:
- System prompt constraints (scope, output format, refusal instructions)
- RAG grounding with authoritative Telco documentation
- Output validation layer (schema check, regex, secondary model review)
- Fine-tuning or RLHF on Telco-safe demonstrations
- Produce a Risk Residual Report
Document for each finding: risk category, severity, mitigation applied, residual risk level, and sign-off required. This feeds directly into your decision gate (Step 21).