williamcaban/SKILL_Telco.md

Evaluation flow for Telco Use Cases

Phase 1: Define Evaluation Scope

Map use cases to evaluation dimensions — categorize by: network operations (fault diagnosis, config generation), customer-facing (intent classification, summarization), regulatory (data retention, privacy), and safety-critical (outage triage, escalation routing)
Identify target model(s) — baseline (e.g., Llama-3-8B), candidate, and a reference frontier model for calibration
Set acceptance thresholds per use case — e.g., config generation must be ≥95% syntactically valid; hallucination rate on network terminology must be <2%

Phase 2: Domain Knowledge & Accuracy

Curate Telco-specific benchmarks — collect Q&A pairs from 3GPP specs, ITU-T standards, vendor docs (Nokia, Ericsson, Cisco IOS-XR), and internal runbooks
Run terminology and factual accuracy tests — probe knowledge of: BGP, MPLS, NFV/VNF, O-RAN, 5G NR, slice management, YANG models, NetConf/RESTCONF
Evaluate structured output fidelity — for config generation tasks, validate output against schema (e.g., YANG, Jinja2 templates, JSON/YAML)
Test multi-step reasoning on network scenarios — give incident tickets and measure root-cause reasoning quality (use chain-of-thought scoring)

Phase 3: Reasoning Quality

Use LM Evaluation Harness with reasoning tasks — run arc_challenge, hellaswag, winogrande as baselines; add custom Telco reasoning tasks
Evaluate with self-consistency — run same prompts 5–10x, measure variance in answers (high variance = unreliable reasoning)
Stress-test multi-hop inference — create scenarios requiring 3+ reasoning steps (e.g., "Given these SNMP traps, identify the likely failed component and recommend the rollback procedure")
Score with LLM-as-judge — use a strong model (Sonnet/GPT-4o) as judge with a rubric for: correctness, completeness, safety of recommended actions

Phase 4: Operational Reliability

Measure latency under load — use GuideLLM to profile throughput/latency at P50/P95/P99 for each serving configuration
Test context window handling — feed full incident logs, YANG schemas, and runbooks; measure degradation at context boundaries
Evaluate instruction-following on constrained outputs — Telco ops demand exact formats; test refusal-to-hallucinate on unknown OIDs/interfaces

Phase 5: Safety & Risk Assessment (Tool-Agnostic)

Select a risk assessment framework appropriate to your environment

Choose one or more based on access and maturity:

┌─────────────────────────┬──────────────────────────────────────────────────────────────────────┐
│    Tool / Framework     │                               Best For                               │
├─────────────────────────┼──────────────────────────────────────────────────────────────────────┤
│ Garak                   │ Automated red-teaming, probe-based vulnerability scanning            │
├─────────────────────────┼──────────────────────────────────────────────────────────────────────┤
│ HELM (Stanford)         │ Holistic benchmarking incl. toxicity, bias, robustness               │
├─────────────────────────┼──────────────────────────────────────────────────────────────────────┤
│ AISI Inspect            │ Structured eval tasks with scorer pipelines (UK AI Safety Institute) │
├─────────────────────────┼──────────────────────────────────────────────────────────────────────┤
│ AIR-Bench /AIRS         │ AI risk taxonomy-aligned benchmarks (safety categories)              │
├─────────────────────────┼──────────────────────────────────────────────────────────────────────┤
│ DecodingTrust           │ GPT-family trust dimensions (generalizes to other models)            │
├─────────────────────────┼──────────────────────────────────────────────────────────────────────┤
│ TrustLLM                │ 8 trustworthiness dimensions, open benchmark suite                   │
├─────────────────────────┼──────────────────────────────────────────────────────────────────────┤
│ Custom red-team prompts │ When domain specificity (Telco) outweighs generic coverage           │
└─────────────────────────┴──────────────────────────────────────────────────────────────────────┘

Define risk categories relevant to Telco deployment

Map your probing to these dimensions regardless of tool:

Hallucination / Confabulation — fabricating interface names, OIDs, BGP attributes
Prompt Injection — critical for agentic NOC workflows where attacker-controlled input (e.g., log lines) reaches the model
Data / Credential Leakage — reproduction of credentials, configs, or PII from training data
Dangerous Output Generation — syntactically valid but operationally destructive configs (route leaks, ACL removals)
Bias & Fairness — relevant for customer-facing use cases (intent routing, support triage)
Robustness under adversarial input — paraphrased or noisy prompts from real operator queries

Execute structured probing across each risk category

For each category:

Define a minimum of 20–50 test cases (automated or manual)
Include both benign and adversarial variants
Record: input, model output, expected output, pass/fail verdict, severity

Score and classify findings

┌──────────┬─────────────────────────────────────────────────────────────────────────────────────────┬──────────────────────────────────────┐
│ Severity │                                       Definition                                        │                Action                │
├──────────┼─────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────┤
│ Blocker  │ Model produces output that could directly cause outage, data breach, or safety incident │ Do not deploy; escalate              │
├──────────┼─────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────┤
│ High     │ Consistent failure on a core Telco task or meaningful leakage risk                      │ Require mitigation before deployment │
├──────────┼─────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────┤
│ Medium   │ Occasional failure addressable by prompt guardrails or output filtering                 │ Deploy with controls                 │
├──────────┼─────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────┤
│ Low      │ Rare or edge-case failure with negligible operational impact                            │ Document and monitor                 │
└──────────┴─────────────────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────┘

Apply mitigations and re-assess

Common mitigations before re-testing:

System prompt constraints (scope, output format, refusal instructions)
RAG grounding with authoritative Telco documentation
Output validation layer (schema check, regex, secondary model review)
Fine-tuning or RLHF on Telco-safe demonstrations

Produce a Risk Residual Report

Document for each finding: risk category, severity, mitigation applied, residual risk level, and sign-off required. This feeds directly into your decision gate (Step 21).

williamcaban/SKILL_Telco.md

Select an option

No results found

Select an option

No results found

Evaluation flow for Telco Use Cases