Phase 1: Define Evaluation Scope
- Map use cases to evaluation dimensions — categorize by: network operations (fault diagnosis, config generation), customer-facing (intent classification, summarization), regulatory (data retention, privacy), and safety-critical (outage triage, escalation routing)
- Identify target model(s) — baseline (e.g., Llama-3-8B), candidate, and a reference frontier model for calibration
- Set acceptance thresholds per use case — e.g., config generation must be ≥95% syntactically valid; hallucination rate on network terminology must be <2%