The PR dashboard uses a weighted composite of ~12 features to score open PRs for merge readiness. The weights were originally hypothesized. This experiment uses data from 980 recently merged PRs across 11 dotnet repos to empirically calibrate them.
TL;DR: Discussion complexity is massively underweighted (1.5 should be ~2.5-4.5). CI and maintainer-review are overweighted. Size and community matter more than expected. But the strongest statistical predictor (number of distinct commenters) creates a death-spiral problem if used naively in the score. A dual-score system (merge readiness + deserves attention) may be more useful than a single composite score.
- Fetched 980 merged PRs via GitHub GraphQL across 11 repos (runtime, aspnetcore, roslyn, sdk, maui, msbuild, extensions, machinelearning, aspire, winforms, wpf)
- Extracted: reviews, check runs (Build Analysis specifically), review threads, labels, size, author classification
- Inferred per-repo maintainers from
mergedBydata (much more accurate than a static list) - Fetched linked issue metadata (reactions, labels, comments, milestones, cross-references)
- Ran OLS regression, logistic regression, Random Forest, Gradient Boosting, Lasso, and Ridge
- Bootstrap stability analysis (500 resamples)
- Event-gap analysis (time from each event to merge)
- Per-repo breakdowns
- Dual-score analysis (merge readiness vs. deserves attention)
| Stat | Value |
|---|---|
| Total PRs | 980 |
| Repos | 11 |
| Median merge time | 1.0 days |
| Merged within 1 day | 51% |
| Merged within 7 days | 82% |
| Community PRs (inferred) | 34% |
| Has owner approval | 50% |
| PRs with linked issues | 15% |
| PRs with milestones | 29% |
Fastest repos: aspire (0.2d), winforms (0.2d). Slowest: maui (6.2d median, 76.5d mean).
The number of distinct commenters and review threads is by far the strongest predictor of merge speed, significant in 7 of 11 repos. The current weight of 1.5 dramatically understates its importance.
Decomposition of the discussion signal:
| Component | R-squared alone | Interpretation |
|---|---|---|
| distinct_commenters | 0.228 | More stakeholders = slower |
| total_comments | 0.234 | Raw engagement volume |
| total_threads | 0.118 | Review thread count |
| changes_requested | 0.117 | Explicit review feedback |
| unresolved_threads | 0.036 | Active blockers |
| resolution_rate | 0.018 | Weak signal |
Discussion adds 16.7% R-squared beyond all other features combined. It is not just a proxy for PR size (correlation with size = 0.36).
CI (Build Analysis specifically) doesn't appear predictive in naive regression (p=0.77 with overall CI status). But:
- Using Build Analysis specifically (matching dashboard behavior) makes it significant (p=0.008 in BA-present repos)
- Event-gap analysis shows BA pass is the last gate before merge 70% of the time
- Median time from BA pass to merge: 0.6 hours (53% merge within 1h of CI passing)
- The regression underestimates CI because all merged PRs eventually pass it (survivor bias)
Important: Build Analysis is absent in ~40% of repos (extensions, msbuild, winforms, wpf don't have it). And in maui, BA is red 78% of the time.
The static maintainers.json classified 78% of PR authors as "community." Inferring maintainers from mergedBy data (anyone who merged >=2 PRs) reduced this to 34% and made the community signal significant (p=0.002).
Per-repo maintainer-vs-community merge speed gaps:
- runtime: 3.9d vs 1.2d (3.3x slower for community)
- machinelearning: 3.9d vs 0.8d (4.9x slower)
- roslyn: 0.1d vs 1.1d (community faster -- they tend to submit small PRs)
Using raw thread count in the score creates a death spiral: more discussion on a PR lowers its score, reducing attention, making it staler. The data confirms high-discussion PRs are genuinely the most significant work:
| Thread count | Median lines | Median files | Median commenters | Median age |
|---|---|---|---|---|
| 0-5 | 47 | 3 | 1 | 0.9d |
| 6-15 | 364 | 7 | 4 | 4.6d |
| >15 | 814 | 18 | 4 | 13.2d |
Alternative metrics tested:
| Metric | R-squared (full model) | Death spiral? |
|---|---|---|
| A. Current (raw count) | 0.292 | YES |
| D. Commenters only | 0.315 | Partial |
| E. Hybrid (unresolved + commenters) | 0.219 | No |
| C. Unresolved only | 0.132 | No |
distinct_commenters alone actually predicts better than the current metric.
| Repo | R-squared | Top Predictors | Median Age |
|---|---|---|---|
| sdk | 0.61 | discussion | 1.0d |
| maui | 0.58 | discussion, size, community, align | 5.5d |
| winforms | 0.44 | discussion, approval, size | 0.2d |
| extensions | 0.41 | discussion | 1.5d |
| aspnetcore | 0.41 | discussion | 0.7d |
| aspire | 0.36 | discussion, size | 0.2d |
| runtime | 0.33 | discussion, community | 2.2d |
| roslyn | 0.33 | approval, size | 0.8d |
| msbuild | 0.26 | approval | 1.9d |
| machinelearning | 0.20 | size | 1.7d |
| wpf | 0.05 | (none) | 1.0d |
- msbuild & roslyn: Approval is the key gate (compiler teams need specific reviewers)
- maui: Most complex dynamics; many factors matter
- wpf: Essentially unpredictable from these features
We fetched closingIssuesReferences for all 980 PRs to test whether linked issue engagement (reactions, comments, cross-references, labels) adds predictive value.
Issue features predict merge speed -- but in the WRONG direction:
| Feature | R-squared | Coefficient | p-value |
|---|---|---|---|
| has_linked_issue | 0.141 | +1.321 | <0.001 |
| log_cross_refs | 0.143 | +1.072 | <0.001 |
| log_issue_comments | 0.119 | +0.819 | <0.001 |
| log_issue_reactions | 0.070 | +1.056 | <0.001 |
| is_bug | 0.014 | +1.230 | <0.001 |
| has_milestone | 0.000 | -0.009 | 0.915 |
Positive coefficients mean PRs with more issue engagement take LONGER to merge. This confirms issue engagement signals importance/complexity, not readiness. Adding issue features beyond dashboard sub-scores: R-squared goes from 0.292 to 0.340 (+4.8%).
This motivates a dual-score system:
Score 1: Merge Readiness -- how mechanically close is this PR to merging?
- Inputs: CI, approvals, conflicts, size, resolved threads, alignment
Score 2: Deserves Attention -- how much should a maintainer prioritize this PR?
- Inputs: urgency labels, issue engagement, community demand, effort-at-risk, blockers
The correlation between the two scores is -0.63 -- they genuinely surface different PRs. The negative correlation means PRs that "deserve attention" are typically NOT close to merging.
| HIGH Merge Readiness | LOW Merge Readiness | |
|---|---|---|
| HIGH Attention | Q1: "Help across finish line" (n=176): median 0.9d, 72% community, 14 lines, 20% linked issues | Q2: "Invest review time" (n=355): median 2.5d, 59% community, 173 lines, 26% linked issues |
| LOW Attention | Q3: "Will merge on its own" (n=337): median 0.5d, 0% community, 44 lines, 3% linked issues | Q4: "Deprioritize" (n=112): median 1.8d, 0% community, 358 lines, 4% linked issues |
- Q1 -- Community PRs that are small and nearly ready. Quickest wins for maintainer time.
- Q2 -- Community PRs with complexity, far from merge. Need investment to unblock.
- Q3 -- Internal small PRs, merge quickly without help. Autopilot.
- Q4 -- Internal large PRs, taking time. Lower priority for active triage.
Several features point in OPPOSITE directions for the two scores:
| Feature | Merge Readiness | Deserves Attention | Tension |
|---|---|---|---|
| CI passing | HIGH = ready | Failing = needs help | Opposite |
| Has approval | HIGH = ready | Missing = needs review | Opposite |
| Small size | HIGH = ready | Large = significant work | CONFLICT |
| Internal author | HIGH = ready | Community = waiting on us | CONFLICT |
| Few commenters | HIGH = ready | Many = important to community | CONFLICT |
| Issue reactions | (not used) | HIGH = community demand | Attention only |
| Bug/regression label | (not used) | HIGH = urgency | Attention only |
| Milestone | (not used) | Has deadline | Attention only |
| Cross-references | (not used) | Broad impact | Attention only |
URGENCY (0-4 pts): regression +4, security +4, bug +1, milestone +1
COMMUNITY DEMAND (0-3): issue thumbsup (>=10: +2, >=3: +1), comments (>=20: +1.5),
cross-references (>=3: +1)
EFFORT-AT-RISK (0-3): community author +2, has reviews but no approval +1,
large change (>200 lines) +0.5
BLOCKED (0-2): CI failing +1, unresolved feedback +1, no approval +1.5
Key differences from merge readiness:
- Community PRs score HIGHER (they're waiting on maintainer action)
- Large/complex PRs score HIGHER (significant work at stake)
- Issue engagement is a NEW signal (not in merge score)
- CI failing scores HIGHER (needs help, not just "not ready")
- No penalty for many commenters (avoids death spiral)
| Feature | Current | Recommended | Change | Confidence | Rationale |
|---|---|---|---|---|---|
| ciScore | 3.0 | 2.5 | -0.5 | Moderate | Gate (last gate 70%); but BA absent in many repos |
| conflictScore | 3.0 | 3.0 | 0.0 | N/A | Hard gate; can't measure historically |
| approvalScore | 2.0 | 2.5 | +0.5 | Moderate | Gate (40% merge within 1h of approval) |
| maintScore | 3.0 | 1.5 | -1.5 | Lower | Overlaps approval; Lasso drops it; unstable |
| feedbackScore | 2.0 | 2.5 | +0.5 | High | Redesign: unresolved threads + changes_requested |
| discussionScore | 1.5 | 2.5 | +1.0 | Very High | Redesign: based on distinct_commenters; cap at 0.5 min |
| sizeScore | 1.0 | 2.0 | +1.0 | High | Significant in 6/11 repos |
| communityScore | 0.5 | 1.0 | +0.5 | High | Significant with inferred maintainers |
| stalenessScore | 1.5 | 1.0 | -0.5 | Low | Can't validate from post-merge data |
| freshScore | 1.0 | 0.7 | -0.3 | Low | Overlaps staleness |
| alignScore | 1.0 | 0.5 | -0.5 | Lower | Weak predictor; only 2/11 repos |
| velocityScore | 0.5 | 0.3 | -0.2 | Low | Can't validate |
| TOTAL | 20.0 | 20.0 |
-
Consider a dual-score system: A single score conflates "close to merge" with "needs attention." The dual-score analysis shows these are anti-correlated (r=-0.63). Display both, or let users toggle sort mode between "ready to merge" and "needs review."
-
Split discussion into feedback + engagement: Separate "unresolved blocking feedback" (actionable, in feedbackScore) from "stakeholder complexity" (informational, in engagementScore). This avoids the death spiral.
-
Cap engagement penalty:
distinct_commentersshould reduce score to 0.5 at worst, never 0.0. Complex PRs need attention, not burial. -
Use Build Analysis specifically for CI: Overall CI status is noise in repos like runtime where some leg is always red.
-
Infer maintainers from merge history:
mergedBydata is far more accurate than a static list. -
Consider showing complexity separately: Display thread/commenter count as a separate column to set expectations about timeline, rather than penalizing in the sort score.
-
Incorporate linked issue engagement for attention: Issue reactions, cross-references, and bug/regression labels are strong signals for "deserves attention" even though they predict SLOWER merges.
- Multiple model types agree (OLS, RF, GB, Lasso, Logistic all give consistent rankings)
- Bootstrap analysis confirms discussion and size are very stable estimates (CV 8% and 36%)
- Event-gap analysis captures gate behavior that regression misses
- Robust across outcome definitions (merge within 1d, 7d, 30d all show same feature ranking)
- Survivor bias: Only merged PRs analyzed. Abandoned PRs would better show CI/conflict as blockers.
- Snapshot, not trajectory: We see cumulative state at merge time, not the journey. A PR that went through 10 review rounds looks the same as one that was clean from the start.
- ~65% unexplained: Reviewer timezone, release schedule, PR priority, and dependency chains are likely the dominant factors but aren't measurable from API data.
- Temporal features untestable: staleness/freshness/velocity (3.0 combined current weight) can't be validated from post-merge data since all merged PRs are "fresh" at merge time.
- Conflict untestable: Historical mergeability state not available. The 3.0 weight is an assumption.
- Global weights: Modestly. 980 PRs gives stable estimates for the top features.
- Per-repo weights: Yes, especially for wpf (R-squared=0.05). 200+ per repo would be more reliable.
- Temporal features: Needs a fundamentally different approach (time-series snapshots of open PRs).
- Gate features: Need to include abandoned/closed-without-merge PRs.
Analysis scripts and data collection code available on request.