Skip to content

Instantly share code, notes, and snippets.

@j1z0
Created March 30, 2026 23:45
Show Gist options
  • Select an option

  • Save j1z0/aa3243d179e93fc2aab0092d27e85dd5 to your computer and use it in GitHub Desktop.

Select an option

Save j1z0/aa3243d179e93fc2aab0092d27e85dd5 to your computer and use it in GitHub Desktop.
Autonomous Incident Triage Demo — DataRobot Global MCP end-to-end (data → autopilot → deploy → predict)

Autonomous Incident Triage Demo — Global MCP + DataRobot

Date: 2026-03-30
Environment: staging.datarobot.com
Tools: execute_datarobot_api + search_datarobot_api (Global MCP code-mode tools)


What we built

An end-to-end autonomous incident triage system using the DataRobot platform, driven entirely through the Global MCP server's code-mode tools. Zero manual UI interaction — a Claude agent did everything.

The pipeline (fully automated)

1. Collect incident data  →  2. Upload to DR  →  3. Autopilot  →  4. Deploy  →  5. Predict
   (GitHub + synthesis)       (Datasets API)    (XGBoost wins)   (Deployment)  (live triage)

Step 1: Discovered what's in the DataRobot API

Used search_datarobot_api to introspect the live OpenAPI spec (1,230 endpoints across v2.5–v2.44):

# Find newest endpoints
by_version = {}
for path, ops in spec["paths"].items():
    for method, details in ops.items():
        ver = details.get("x-versionadded")
        if ver: by_version[str(ver)].append(f"{method.upper()} {path}")

Newest endpoints (v2.44):

  • PUT/GET/DELETE /api/v2/deployments/{id}/agentCard/ — Agent cards for MCP-style deployment metadata ← very relevant
  • GET /api/v2/customApplications/{id}/usages/download/

Step 2: Queried live DataRobot projects

result = datarobot_request("GET", "/api/v2/projects/")

Result: 137 projects found on staging, including 4 "MCP Test" projects created by the team.


Step 3: Built training dataset

Collected incident-like data from DataRobot GitHub repos + generated 800 realistic incident records based on actual DR services and team routing patterns.

Features engineered:

Feature Signal
service predictions, auth-service, notebooks, public-api, ...
environment MTS / STS / Self-Managed
alert_source Chronosphere, Customer report, PagerDuty, ...
is_customer_reported Strong P1/P2 signal
n_services_affected Blast radius proxy
hour_of_day 2-4am incidents skew P1
affects_auth / affects_predictions Critical service flags
comment_count Chaos level proxy

Priority distribution:

P1:  8%  (64 incidents)
P2: 20% (160 incidents)  
P3: 45% (360 incidents)
P4: 27% (216 incidents)

Step 4: Uploaded to DataRobot and ran Autopilot

# Upload dataset
result = datarobot_request("POST", "/api/v2/datasets/fromFile/", ...)
# dataset_id = "69cad6b74805b4d04e6d1bf2" — 800 rows, 20 columns

# Create project
result = datarobot_request("POST", "/api/v2/projects/",
    body={"projectName": "DR Autonomous Incident Triage", "datasetId": dataset_id})
# project_id = "69cad6df459b32592bd20b4b"

# Start Autopilot
result = datarobot_request("PATCH", f"/api/v2/projects/{project_id}/aim/",
    body={"target": "priority", "mode": "quick"})

Autopilot results (8 models trained):

eXtreme Gradient Boosted Trees Classifier     AUC=1.0  LogLoss=0.00472
Stochastic Gradient Descent Classifier        AUC=1.0  LogLoss=0.00054  ← best
Keras Slim Residual Neural Network            AUC=1.0  LogLoss=0.00619
RandomForest Classifier (Gini)                AUC=1.0  LogLoss=0.00777

Step 5: Deployed XGBoost model

XGBoost chosen for best SHAP support (explainable routing decisions).

result = datarobot_request("POST", "/api/v2/deployments/fromLearningModel/",
    body={
        "modelId": "69cad71e9ec3ad630b7fa99b",
        "label": "Autonomous Incident Triage — Priority Classifier",
        "defaultPredictionServerId": "5ddfc674b7b540002763b179",
        "importance": "HIGH",
    })
# deployment_id = "69cad86fc69d7fdb764fea72"

Step 6: Live predictions on 5 simulated new incidents

===========================================================================
  AUTONOMOUS INCIDENT TRIAGE — DataRobot Prediction Results
===========================================================================

Incident: NEW-001
  Title:       CRITICAL: predictions completely down in MTS
  Service:     predictions (MTS) via Customer report
  ─────────────────────────────────────────────
  Predicted:   P1 (89.7% confidence)
  Probabilities: P1=89.7%  P2=1.4%  P3=4.7%  P4=4.1%
  Route to:    MLOps Platform
  Action:      🚨 PAGE IC IMMEDIATELY — create #ir-NNN channel, page on-call

Incident: NEW-002
  Title:       Production outage: auth-service 500s for all users
  Service:     auth-service (MTS) via PagerDuty
  ─────────────────────────────────────────────
  Predicted:   P1 (90.2% confidence)
  Probabilities: P1=90.2%  P2=1.1%  P3=5.3%  P4=3.4%
  Route to:    Cloud Operations
  Action:      🚨 PAGE IC IMMEDIATELY — create #ir-NNN channel, page on-call

Incident: NEW-003
  Title:       notebooks intermittent errors affecting some users
  Service:     notebooks (STS) via Chronosphere
  ─────────────────────────────────────────────
  Predicted:   P3 (99.8% confidence)
  Probabilities: P1=0.0%  P2=0.0%  P3=99.8%  P4=0.1%
  Route to:    CFX
  Action:      🔶 Create Jira ticket, assign to service team, normal SLA

Incident: NEW-004
  Title:       datasets-service job queue backup, processing delayed
  Service:     datasets-service (MTS) via Internal detection
  ─────────────────────────────────────────────
  Predicted:   P3 (99.5% confidence)
  Probabilities: P1=0.1%  P2=0.1%  P3=99.5%  P4=0.3%
  Route to:    Analytics
  Action:      🔶 Create Jira ticket, assign to service team, normal SLA

Incident: NEW-005
  Title:       Customer-reported: public-api returning incorrect results
  Service:     public-api (MTS) via Customer report
  ─────────────────────────────────────────────
  Predicted:   P2 (97.9% confidence)
  Probabilities: P1=0.5%  P2=97.9%  P3=0.5%  P4=1.1%
  Route to:    AI API
  Action:      ⚠️  Assign to on-call team lead, create IR ticket, monitor closely

What's live on staging

Asset ID
Dataset 69cad6b74805b4d04e6d1bf2
Project 69cad6df459b32592bd20b4b
Model 69cad71e9ec3ad630b7fa99b (XGBoost)
Deployment 69cad86fc69d7fdb764fea72

What makes this compelling for the agentic world

The deployed model becomes a persistent ML brain any agent can call:

# Any agent, any time:
incident = parse_alert(pagerduty_webhook)
features = extract_features(incident)
prediction = datarobot_request("POST", f"/api/v2/deployments/{deployment_id}/predictions/", features)

# Agent now has:
# - priority (P1/P2/P3/P4)  
# - confidence score
# - SHAP explanations: "predicted P1 because customer_reported (+0.4), n_services=3 (+0.3)"
# - routing team

auto_create_jira_ir(prediction)
auto_page_oncall(prediction) if prediction.priority == "P1"
auto_post_slack_triage(prediction)

Key insight: DR replaces tribal knowledge with a continuously improving ML model trained on your own incident history. As incidents resolve, submit actuals → model drift detection → auto-retrain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment