j1z0/dr_incident_triage_demo.md

Autonomous Incident Triage Demo — Global MCP + DataRobot

Date: 2026-03-30
Environment: staging.datarobot.com
Tools: execute_datarobot_api + search_datarobot_api (Global MCP code-mode tools)

What we built

An end-to-end autonomous incident triage system using the DataRobot platform, driven entirely through the Global MCP server's code-mode tools. Zero manual UI interaction — a Claude agent did everything.

The pipeline (fully automated)

1. Collect incident data  →  2. Upload to DR  →  3. Autopilot  →  4. Deploy  →  5. Predict
   (GitHub + synthesis)       (Datasets API)    (XGBoost wins)   (Deployment)  (live triage)

Step 1: Discovered what's in the DataRobot API

Used search_datarobot_api to introspect the live OpenAPI spec (1,230 endpoints across v2.5–v2.44):

# Find newest endpoints
by_version = {}
for path, ops in spec["paths"].items():
    for method, details in ops.items():
        ver = details.get("x-versionadded")
        if ver: by_version[str(ver)].append(f"{method.upper()} {path}")

Newest endpoints (v2.44):

PUT/GET/DELETE /api/v2/deployments/{id}/agentCard/ — Agent cards for MCP-style deployment metadata ← very relevant
GET /api/v2/customApplications/{id}/usages/download/

Step 2: Queried live DataRobot projects

result = datarobot_request("GET", "/api/v2/projects/")

Result: 137 projects found on staging, including 4 "MCP Test" projects created by the team.

Step 3: Built training dataset

Collected incident-like data from DataRobot GitHub repos + generated 800 realistic incident records based on actual DR services and team routing patterns.

Features engineered:

Feature	Signal
`service`	predictions, auth-service, notebooks, public-api, ...
`environment`	MTS / STS / Self-Managed
`alert_source`	Chronosphere, Customer report, PagerDuty, ...
`is_customer_reported`	Strong P1/P2 signal
`n_services_affected`	Blast radius proxy
`hour_of_day`	2-4am incidents skew P1
`affects_auth` / `affects_predictions`	Critical service flags
`comment_count`	Chaos level proxy

Priority distribution:

P1:  8%  (64 incidents)
P2: 20% (160 incidents)  
P3: 45% (360 incidents)
P4: 27% (216 incidents)

Step 4: Uploaded to DataRobot and ran Autopilot

# Upload dataset
result = datarobot_request("POST", "/api/v2/datasets/fromFile/", ...)
# dataset_id = "69cad6b74805b4d04e6d1bf2" — 800 rows, 20 columns

# Create project
result = datarobot_request("POST", "/api/v2/projects/",
    body={"projectName": "DR Autonomous Incident Triage", "datasetId": dataset_id})
# project_id = "69cad6df459b32592bd20b4b"

# Start Autopilot
result = datarobot_request("PATCH", f"/api/v2/projects/{project_id}/aim/",
    body={"target": "priority", "mode": "quick"})

Autopilot results (8 models trained):

eXtreme Gradient Boosted Trees Classifier     AUC=1.0  LogLoss=0.00472
Stochastic Gradient Descent Classifier        AUC=1.0  LogLoss=0.00054  ← best
Keras Slim Residual Neural Network            AUC=1.0  LogLoss=0.00619
RandomForest Classifier (Gini)                AUC=1.0  LogLoss=0.00777

Step 5: Deployed XGBoost model

XGBoost chosen for best SHAP support (explainable routing decisions).

result = datarobot_request("POST", "/api/v2/deployments/fromLearningModel/",
    body={
        "modelId": "69cad71e9ec3ad630b7fa99b",
        "label": "Autonomous Incident Triage — Priority Classifier",
        "defaultPredictionServerId": "5ddfc674b7b540002763b179",
        "importance": "HIGH",
    })
# deployment_id = "69cad86fc69d7fdb764fea72"

Step 6: Live predictions on 5 simulated new incidents

===========================================================================
  AUTONOMOUS INCIDENT TRIAGE — DataRobot Prediction Results
===========================================================================

Incident: NEW-001
  Title:       CRITICAL: predictions completely down in MTS
  Service:     predictions (MTS) via Customer report
  ─────────────────────────────────────────────
  Predicted:   P1 (89.7% confidence)
  Probabilities: P1=89.7%  P2=1.4%  P3=4.7%  P4=4.1%
  Route to:    MLOps Platform
  Action:      🚨 PAGE IC IMMEDIATELY — create #ir-NNN channel, page on-call

Incident: NEW-002
  Title:       Production outage: auth-service 500s for all users
  Service:     auth-service (MTS) via PagerDuty
  ─────────────────────────────────────────────
  Predicted:   P1 (90.2% confidence)
  Probabilities: P1=90.2%  P2=1.1%  P3=5.3%  P4=3.4%
  Route to:    Cloud Operations
  Action:      🚨 PAGE IC IMMEDIATELY — create #ir-NNN channel, page on-call

Incident: NEW-003
  Title:       notebooks intermittent errors affecting some users
  Service:     notebooks (STS) via Chronosphere
  ─────────────────────────────────────────────
  Predicted:   P3 (99.8% confidence)
  Probabilities: P1=0.0%  P2=0.0%  P3=99.8%  P4=0.1%
  Route to:    CFX
  Action:      🔶 Create Jira ticket, assign to service team, normal SLA

Incident: NEW-004
  Title:       datasets-service job queue backup, processing delayed
  Service:     datasets-service (MTS) via Internal detection
  ─────────────────────────────────────────────
  Predicted:   P3 (99.5% confidence)
  Probabilities: P1=0.1%  P2=0.1%  P3=99.5%  P4=0.3%
  Route to:    Analytics
  Action:      🔶 Create Jira ticket, assign to service team, normal SLA

Incident: NEW-005
  Title:       Customer-reported: public-api returning incorrect results
  Service:     public-api (MTS) via Customer report
  ─────────────────────────────────────────────
  Predicted:   P2 (97.9% confidence)
  Probabilities: P1=0.5%  P2=97.9%  P3=0.5%  P4=1.1%
  Route to:    AI API
  Action:      ⚠️  Assign to on-call team lead, create IR ticket, monitor closely

What's live on staging

Asset	ID
Dataset	`69cad6b74805b4d04e6d1bf2`
Project	`69cad6df459b32592bd20b4b`
Model	`69cad71e9ec3ad630b7fa99b` (XGBoost)
Deployment	`69cad86fc69d7fdb764fea72`

What makes this compelling for the agentic world

The deployed model becomes a persistent ML brain any agent can call:

# Any agent, any time:
incident = parse_alert(pagerduty_webhook)
features = extract_features(incident)
prediction = datarobot_request("POST", f"/api/v2/deployments/{deployment_id}/predictions/", features)

# Agent now has:
# - priority (P1/P2/P3/P4)  
# - confidence score
# - SHAP explanations: "predicted P1 because customer_reported (+0.4), n_services=3 (+0.3)"
# - routing team

auto_create_jira_ir(prediction)
auto_page_oncall(prediction) if prediction.priority == "P1"
auto_post_slack_triage(prediction)

Key insight: DR replaces tribal knowledge with a continuously improving ML model trained on your own incident history. As incidents resolve, submit actuals → model drift detection → auto-retrain.