Skip to content

Instantly share code, notes, and snippets.

@srid
Created April 26, 2026 19:37
Show Gist options
  • Select an option

  • Save srid/14708dec0bf246e7444589dd7eeed3d3 to your computer and use it in GitHub Desktop.

Select an option

Save srid/14708dec0bf246e7444589dd7eeed3d3 to your computer and use it in GitHub Desktop.
Comparing 4 LLM agents (GPT 5.5, Opus 4.7, Kimi, GLM) on the same issue (juspay/kolu#712)

Four agents, one issue: comparing implementations of juspay/kolu#712

All four PRs respond to the same task — issue #712:

In Debug → Diagnostic info, add information about the various processes the server is running and holding in memory:

  • Active filesystem watches (agents — sqlite, jsonl — sidebar tree, etc.)
  • Server memory usage & uptime
  • Collapse "Recent events" by default, expand on click
  • Group xterm-related stuff
  • Group server-related stuff separately

Each agent worked independently on the same master baseline. Below is a side-by-side comparison.

At a glance

PR Agent Files +Lines -Lines Commits Approach
#741 GPT 5.5 (codex) 43 1348 466 11 New runtime-diagnostics package, integrations rewired, e2e test
#743 Opus 4.7 15 387 77 12 Categorical watch RPC; long refactor tail (hickey/lowy/elegance)
#746 Kimi 5 148 34 1 Adds memory/uptime + section grouping; does not deliver the watches list
#747 GLM 13 293 47 4 New watch-registry.ts, three sections, fact-check fix

What each one did

#741 — GPT 5.5 (codex): biggest blast radius

Carved out a brand-new packages/runtime-diagnostics workspace, then threaded a register/cleanup resource API into every integration package (anyagent, claude-code, codex, git, github, opencode) so file watches, timers, subscriptions, and SQLite handles all funnel through one registry. The client dialog was split into three component files (BrowserDiagnosticsSection, ServerDiagnosticsSection, XtermDiagnosticsSection) plus a format.ts and a useDiagnosticSnapshot.ts hook. Also added a Cucumber e2e (diagnostic-info.feature + step defs) and patched default.nix so the new package ships.

Trade-offs. The only PR that truly encapsulates the runtime-resource concern across the codebase, and the only one with an e2e. But 1.3k lines and a new package for what was scoped as a debug-dialog enhancement is heavy — and the changes to claude-code/core.ts, wal-subscription.ts, and three session-watchers each carry their own regression surface.

#743 — Opus 4.7: surgical, with a discipline-heavy refactor tail

Introduced a one-shot server.diagnostics RPC returning a categorical view of watches — git-head per terminal, claude-transcript per active session, shared agent-external:* per provider kind — instead of trying to enumerate every fs.watch call site (the PR description explicitly calls out "instrumenting every fs.watch site would be invasive churn for modest payoff" — exactly the cost #741 paid). Reorganized the dialog into Browser / Server / Watches / Terminals / WebGL sections. Native <details> for the collapsible Recent events. <Switch>/<Match> for explicit error/loading/empty/data branching.

The commit log is the most interesting part: 12 commits, with eight tagged refactor(hickey), refactor(lowy), or refactor(police) — each a single, named structural improvement (atomic snapshot, extract captureMetrics, move pluralization off the server, narrow accessor return shape, drop redundant snapshot().server indirection, remove dead countActiveClaudeSessions). Reads like the /do workflow's quality passes actually firing.

Trade-offs. No new tests. Doesn't enumerate individual watcher handles — it's a count by category design. That's a deliberate scoping call but it does mean a future watcher kind needs a registry update.

#746 — Kimi: smallest, and misses the headline ask

A single squash commit. Adds a Server section with hostname, uptime, RSS + heap. Reorganizes into Server / Browser / Session / Terminals / xterm.js / WebGL groups. Collapses User agent and Recent events.

Trade-offs. Cleanest diff by far (148/-34 across 5 files, no new packages). But the issue's first bullet — "active file system watches" — is not addressed at all. It's a partial solution that nails the cosmetic asks and skips the substantive one.

#747 — GLM: pragmatic middle ground

Adds a server.diagnostics RPC returning memory, uptime, PID, Node version, active watch counts, and session/publisher counts. Introduces a small packages/server/src/watch-registry.ts (24 lines) — a centralized registry that git-HEAD watchers and agent session watchers register/unregister against. Three sections: Client / Server / xterm. Recent events collapsed by default.

Notable commit: fix(police): fact-check — don't unregister process-lifetime watch on terminal cleanup. The agent caught its own bug — external-change watchers are installed once per provider kind and live for the whole process; unregistering when the first installing terminal shut down would have left the registry under-counting. Catching that during self-review is the kind of thing the /do pipeline is supposed to surface.

Trade-offs. Lighter than Opus's refactor tail and lighter than codex's package split, but the watch-registry module only tracks the watchers the author remembered to wire up — no compile-time guarantee that future fs.watch callers will register.

Comparison axes

Coverage of the issue checklist

Requirement #741 #743 #746 #747
Active FS watches ✅ (registry across integrations) ✅ (categorical) ✅ (small registry, server-only)
Memory + uptime
Collapse Recent events ✅ (native <details>)
Group xterm
Group server

Architectural decisions

  • #741 treats this as "the codebase needs a runtime-resource concept" — the most ambitious read. The new package is reusable; the cost is touching every integration.
  • #743 treats this as "the dialog needs facts the server already knows" — minimal new infrastructure, categorical aggregation, wire shape stays facts-only with rendering on the client (a Lowy-style boundary).
  • #747 lands between: a small registry on the server side only, no integration churn, fewer abstractions.
  • #746 treats this as "polish the dialog" and stops at memory/uptime.

Refactor / self-review discipline

  • #743 is the standout: 8 named refactor commits, each with a one-line why. Several reference Hickey/Lowy frameworks by name (the project ships /hickey and /lowy skills). Looks like the agent actually ran them.
  • #747 has one substantive self-found bug fix (fact-check commit) and one hickey pass (deduplicate type, surface errors).
  • #741 has multiple refactor commits but most have terse one-line bodies; harder to audit why each landed.
  • #746 is one squash commit — no visible self-review trail.

Boundary hygiene (Lowy lens)

  • #743 is the cleanest example: server emits { kind, sharedReconcilers: number }, client decides "shared across N terminal(s)". UI-layer pluralization stays out of the wire.
  • #741 does the moral equivalent by restricting the public server.diagnostics shape and keeping Claude-specific counters in an internal periodic log.
  • #746 ships a string User agent and untyped sections; cosmetic only.
  • #747 mixes counters and identity directly; no obvious wire/UI separation issue but no explicit boundary either.

Risk

  • #741: highest. Touches six integration packages and adds a new workspace member. Any regression in WAL subscription teardown or session-watcher cleanup ships under this PR.
  • #743: low. Confined to DiagnosticInfo.tsx, diagnostics.ts, meta/agent.ts, plus a terminals.ts cleanup of an orphaned counter.
  • #746: lowest. Tiny additive change.
  • #747: low–medium. New watch-registry module is small, but the registration sites are scattered and there's no test.

Verdict (subjective)

For this issue as written, #743 (Opus 4.7) is the best fit: it answers every bullet, picks a defensible aggregation strategy, and the commit log is a model of how the project's /do and /hickey skills are supposed to compose. #747 (GLM) is a strong runner-up — pragmatic, found its own bug, lighter on refactor ceremony.

#741 (codex) would be the right answer to a different, larger question — "how should runtime resource accounting work across this codebase?" — and as a self-contained PR for a debug dialog it overshoots scope.

#746 (Kimi) ships the cosmetic half cleanly but skips the load-bearing requirement; it would need a follow-up to actually close #712.


Generated 2026-04-26 from PR metadata at the time of writing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment