Note: Much of this was extracted and synthesized by Copilot. It's intended as research to guide ideas and discussion — not a completely correct or final implementation plan.
We need an agent that can analyze any CI run and give a 100% accurate account of what happened — which tests failed, which are flaky, which are real regressions, and which should be retried. Today we can't do this.