Created
April 17, 2026 20:10
-
-
Save denniswon/44c9b2f407a0f8cd0790b464698324e0 to your computer and use it in GitHub Desktop.
Streaming consensus re-evaluation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| ❯ Refer to @docs/STREAMING_CONSENSUS.md , that contains our architectural improvement design for improving the performance / latency / | |
| efficiency of the protocol. recall from previous memory about our discussion, analysis, trade offs, decision choices and rationales. We | |
| will go back to this architectural design and critically re-evaluate the architecture and design decisions from the first principle, | |
| evaluating against other approaches in terms of performance, latency, reliability, scalability, throughput, security, fault tolerance, | |
| and (least prioritized) decentralization | |
| ⎿ Read docs/STREAMING_CONSENSUS.md (1772 lines) | |
| ⎿ Loaded docs/CLAUDE.md | |
| ⏺ I'll invoke the brainstorming skill to properly evaluate this design against alternatives from first principles. | |
| ⏺ Skill(superpowers:brainstorming) | |
| ⎿ Successfully loaded skill | |
| ⏺ Memory check: I searched memory/MEMORY.md and found no entries on NATS / streaming consensus / messaging architecture from prior | |
| conversations. If you're recalling specific prior constraints we agreed on (e.g., "we already ruled out X because Y"), flag them so I | |
| don't re-tread. | |
| Having reread the doc against architecture.md and lessons.md, I want to surface one load-bearing weakness before we commit to an | |
| evaluation frame: | |
| ★ Insight ───────────────────────────────────── | |
| The doc's strongest move is picking NATS on complexity+reliability grounds. Its weakest is the diagnosis itself. Line 83 admits the real | |
| cause: "Each operator has a single worker task that processes requests sequentially." That's an operator-internal architecture choice, | |
| not an HTTP/2 transport limit. HTTP/2 already supports multiplexing and bidirectional streams — the gateway could already fan out | |
| non-blockingly today. If we swap transports without fixing operator concurrency, the projected 3-10× gain is aspirational. | |
| The other underweighted issue: adding NATS as a required path makes the gateway+NATS a tighter centralization point. The doc lists | |
| decentralization as "least prioritized", but in an EigenLayer AVS, a central message bus is a new censorship/liveness chokepoint that | |
| the trust model didn't assume. | |
| ───────────────────────────────────────────────── | |
| Other weak spots I'll want to evaluate (but don't need to resolve now): (1) gRPC server-streaming and QUIC/HTTP3 server-push are | |
| conspicuously absent from the comparison; (2) interaction with the aggregator_block == taskCreatedBlock invariant (lessons.md) under | |
| early-quorum-exit semantics is unspecified; (3) the 200 vs 1200 LoC comparison is stacked (NATS-with-JetStream vs | |
| WebSocket-from-scratch). | |
| Before I run the full matrix across your eight dimensions, one framing question: | |
| Which mode do you want? | |
| (A) Premise-challenging first — interrogate whether streaming-over-new-transport is even the right lever given Newton's BLS block | |
| invariants, operator-concurrency root cause, and decentralization posture, before comparing transports. | |
| (B) Transport shootout — accept "we need streaming", evaluate NATS vs gRPC server-streaming vs HTTP/2 SSE vs QUIC vs libp2p gossip vs | |
| just-fix-HTTP/2-dispatch. | |
| (C) Both, (A) then (B) — my recommendation. (A) is load-bearing: if the diagnosis is wrong, (B) is rearranging deck chairs. | |
| Which do you want? (Or propose a different frame.) | |
| ✻ Sautéed for 2m 18s | |
| ❯ "Each operator has a single worker task that processes requests sequentially. --> is this true? didn't we update this to 8 max | |
| concurrency recently? | |
| ⏺ Good catch — let me verify instead of trusting the doc's claim. | |
| Searched for 2 patterns (ctrl+o to expand) | |
| ⏺ Let me dig into the actual operator dispatch path and the gateway's per-operator concurrency setting. | |
| Searched for 2 patterns (ctrl+o to expand) | |
| ⏺ You're right — the doc is stale. Evidence: | |
| - crates/gateway/src/config.rs:370: max_concurrent_requests: 8 (default) | |
| - crates/gateway/src/config.rs:358-361 comment: "Each request is dispatched concurrently via HTTP/2 multiplexing. With N gateway | |
| replicas, each operator sees up to N * max_concurrent_requests." | |
| - crates/gateway/src/task/operator.rs:624: Semaphore::new(self.max_concurrent_requests) inside each per-operator worker loop | |
| - crates/gateway/README.md:128: "Per-Operator Workers: Dedicated workers with semaphore-bounded concurrent dispatch (configurable | |
| max_concurrent_requests, default 8) utilizing HTTP/2 multiplexing" | |
| - crates/operator/src/builder.rs:701, 1128: register_async_method("newt_fetchPolicyData", ...) / newt_evaluateAndSign — jsonrpsee async | |
| handlers, not serialized | |
| So the gateway already dispatches 8 concurrent in-flight requests per operator over a multiplexed HTTP/2 connection, and the operator | |
| handles each call as an independent async task. STREAMING_CONSENSUS.md was drafted 2026-01-31 (per its own header) — it's describing a | |
| prior architecture. | |
| ★ Insight ───────────────────────────────────── | |
| Two of the doc's four named bottlenecks (lines 80-84) are already neutralized by the current code: | |
| 1. "Sequential Per-Operator Processing" — FALSE now. Semaphore-bounded 8-wide concurrent dispatch over HTTP/2. | |
| 2. "No Pipelining" — PARTIALLY FALSE within a phase. Multiple tasks/phases to the same operator run concurrently. | |
| The remaining two ARE still real: | |
| 3. "Two-Phase Serialization" — Prepare still must globally complete before Commit broadcasts. Per-operator pipelining (Commit Op A | |
| while Op B still in Prepare) is not implemented. | |
| 4. "Late Quorum Detection" — quorum still checked after response collection in a loop, not on the critical path per-message. | |
| Bottlenecks (3) and (4) are gateway orchestration problems in sync.rs, not transport problems. Fixing them needs: (a) a streaming | |
| response collector that checks quorum on every arrival, (b) per-operator pipelined two-phase, (c) fire-and-forget non-blocking | |
| publishes. All three are achievable on HTTP/2 today — tokio::select! over FuturesUnordered + mpsc — no message bus required. | |
| ───────────────────────────────────────────────── | |
| This materially weakens the doc's 3-10× projection. Much of that headline gain is already delivered by the recent concurrency work; what | |
| NATS would actually buy is (3) + (4), which are single-digit-hundred-millisecond wins, not multi-second. | |
| Want me to keep this correction pinned and continue with the first-principles re-evaluation? My suggested path: | |
| (A) Re-baseline: get actual measured p50/p95/p99 of Prepare→Commit gap and Commit→quorum on the current code. Without this, every | |
| projection is conjecture. | |
| (B) Then compare four concrete alternatives on the dimensions you listed: | |
| 1. No transport change — just fix gateway orchestration (streaming collector + per-op pipelining) | |
| 2. gRPC server-streaming — bidirectional streams over HTTP/2, keeps existing tonic-style bindings | |
| 3. NATS (current doc proposal) | |
| 4. libp2p gossipsub — operator-to-operator fan-out, minimizes gateway centralization | |
| Proceed with (A) → (B)? Or do you want to go straight to (B) and flag measurement as an open TODO? | |
| ✻ Baked for 1m 1s | |
| ※ recap: We're critically re-evaluating docs/STREAMING_CONSENSUS.md from first principles; I just showed the doc's "sequential | |
| per-operator" premise is stale since the gateway already runs 8-wide concurrent dispatch. Next: you pick path (A) measure-first or | |
| straight to (B) transport shootout. (disable recaps in /config) | |
| ❯ Let's first update the doc to be updated per the current state of the architecture/project as gateway, operators, etc. had several | |
| architecture improvements/changes since the doc was drafted. Also, how is "3. "Two-Phase Serialization" — Prepare still must globally | |
| complete before Commit broadcasts. Per-operator pipelining (Commit Op A while Op B still in Prepare) is not implemented." --> this | |
| possible? Don't operators need to come to consensus on the policy task data before Commit happens? | |
| ⏺ You're right — I misstated the pipelining constraint. Let me verify the actual current flow before proposing doc edits. | |
| Searched for 2 patterns (ctrl+o to expand) | |
| ⏺ You're right, and my "per-operator pipelining" was sloppy — the protocol invariant you're pointing at IS binding. Let me untangle: | |
| What's impossible (as you intuited): | |
| Commit requires all operators to BLS-sign the same consensus policyTaskData, which is derived from a quorum of Prepare responses | |
| (median-based, crates/aggregator/src/consensus.rs). So Op A can't start Commit before Prepare quorum exists, and Op A + Op B can't sign | |
| different consensus values (BLS aggregation needs one digest). True per-operator interleave is fundamentally incompatible with two-phase | |
| consensus. | |
| What IS possible — and already implemented: | |
| Early phase transition: once Prepare quorum arrives, gateway broadcasts Commit immediately; slow Preparers drop their Prepare work. | |
| Evidence: | |
| - crates/gateway/src/task/operator.rs:1606-1641 — broadcast_fetch_policy_data uses FuturesUnordered with quorum_reached early-exit (line | |
| 1613, returns at line 1631) | |
| - crates/gateway/src/task/operator.rs:1870-1910 — same pattern for broadcast_evaluate_and_sign | |
| - crates/gateway/src/rpc/api/sync.rs:859 — broadcast_two_phase_and_aggregate chains these; Commit fires the moment Prepare quorum hits | |
| So the doc's four named bottlenecks re-scored against current code: | |
| ┌─────┬─────────────────────────┬───────────────────────────────────────────────────────┐ | |
| │ # │ Bottleneck │ Status │ | |
| ├─────┼─────────────────────────┼───────────────────────────────────────────────────────┤ | |
| │ 1 │ Sequential per-operator │ Fixed — semaphore + HTTP/2 mux │ | |
| ├─────┼─────────────────────────┼───────────────────────────────────────────────────────┤ | |
| │ 2 │ Two-phase serialization │ Fixed — early phase transition on quorum │ | |
| ├─────┼─────────────────────────┼───────────────────────────────────────────────────────┤ | |
| │ 3 │ No pipelining │ Protocol-constrained; nothing more to win beyond (2) │ | |
| ├─────┼─────────────────────────┼───────────────────────────────────────────────────────┤ | |
| │ 4 │ Late quorum detection │ Fixed — per-response quorum check in FuturesUnordered │ | |
| └─────┴─────────────────────────┴───────────────────────────────────────────────────────┘ | |
| ★ Insight ───────────────────────────────────── | |
| The doc's core thesis — "NATS enables these wins" — is retrospective: all four were addressed in-place on HTTP/2 since 2026-01-31. What | |
| NATS would actually buy today is small: (a) a binary codec instead of JSON-RPC serialization (achievable on HTTP/2 alone), (b) | |
| cold-connection elimination (already handled by pooling), (c) response push before request-response return (already non-blocking via | |
| FuturesUnordered). None justify a new infra dependency + centralization vector. | |
| ───────────────────────────────────────────────── | |
| Proposed doc update strategy — three options: | |
| (i) Surgical edits — patch the ~8 load-bearing sections (Executive Summary, Problem Statement, Current Latency Breakdown, Key | |
| Bottlenecks, Expected Latency Improvement, Pipelined Two-Phase advanced section, add "What NATS Actually Buys Today", Changelog). | |
| Preserves structure, reviewer can diff. | |
| (ii) Deprecate + new doc — mark this one Status: Superseded, write a fresh STREAMING_CONSENSUS_V2.md that re-does the first-principles | |
| evaluation with the current baseline. Cleaner but loses the decision-history trail. | |
| (iii) Archive + replace with evaluation doc — rename this to archive/STREAMING_CONSENSUS_2026-01-31.md, write a new | |
| LATENCY_EVALUATION.md that's not pre-committed to "streaming" as the answer — it re-frames the question as "where is remaining latency, | |
| what's the cheapest lever." | |
| My recommendation: (i) first, to set the record straight on the existing doc (other engineers may be reading it and making decisions off | |
| its claims). Then (iii) as a separate follow-up if we want a fresh first-principles evaluation. | |
| Which do you want? Also — before I touch the doc, do you want me to measure the actual current p50/p95 of Prepare→Commit and | |
| Commit→quorum, so the "Current Latency Breakdown" table has real numbers instead of the stale 30s timeout / 15s timeout placeholders? | |
| ✻ Cogitated for 1m 58s | |
| ※ recap: Goal: critically re-evaluate STREAMING_CONSENSUS.md against the current architecture. Current task: update the doc since the | |
| recent HTTP/2 concurrency work already addresses 3 of its 4 named bottlenecks. Next action: you pick option (i), (ii), or (iii) for the | |
| update strategy. (disable recaps in /config) | |
| ❯ first let's update the doc. Instead of archiving, for this case, you can just update in inline | |
| ⏺ ★ Insight ───────────────────────────────────── | |
| These edits preserve the original NATS proposal as historical context while adding 2026-04-17 update callouts that correct the record | |
| with file:line evidence. Better than deletion — future readers can see the original thinking and understand what superseded it. | |
| ───────────────────────────────────────────────── | |
| Executing 8 surgical edits to STREAMING_CONSENSUS.md in parallel. | |
| ⏺ Update(docs/STREAMING_CONSENSUS.md) | |
| ⎿ Added 3 lines, removed 3 lines | |
| 1 # Streaming Consensus Architecture | |
| 2 | |
| 3 -> **Status**: Planning | |
| 4 -> **Branch**: `bls-block-time` (follow-up implementation) | |
| 5 -> **Last Updated**: 2026-01-31 | |
| 3 +> **Status**: Partially Superseded — HTTP/2 improvements since drafting have addressed three of the four claimed bottlenec | |
| +ks. NATS migration no longer recommended without fresh measurement. | |
| 4 +> **Branch**: `bls-block-time` (original follow-up implementation, never executed) | |
| 5 +> **Last Updated**: 2026-04-17 (original: 2026-01-31) | |
| 6 | |
| 7 > **Terminology Note**: This document uses "Phase 1" and "Phase 2" terminology in code examples and metric names for brevi | |
| ty. In production code and documentation, these map to: | |
| 8 > | |
| ⏺ Update(docs/STREAMING_CONSENSUS.md) | |
| ⎿ Added 4 lines, removed 2 lines | |
| 11 | |
| 12 ## Executive Summary | |
| 13 | |
| 14 -This document describes the migration from HTTP-based request-response communication between Gateway and Operators to a s | |
| -treaming architecture using NATS messaging. The goal is to reduce end-to-end task latency by 3-10x through pipelining, ea | |
| -rly quorum exit, and non-blocking message dispatch. | |
| 14 +> **2026-04-17 update**: The original 3-10× latency projection was predicated on four bottlenecks in the HTTP/2 codepath. | |
| + Three of those four have been addressed in-place since this doc was drafted — semaphore-bounded concurrent dispatch (def | |
| +ault 8 in-flight per operator), HTTP/2 multiplexing, `FuturesUnordered`-based streaming response collection with per-resp | |
| +onse quorum check, and early phase transition on Prepare quorum. The residual latency wins attributable to a NATS migrati | |
| +on are now small and do not justify the new infrastructure dependency + centralization vector. This document is preserved | |
| + for reference but the recommendation to migrate to NATS is withdrawn pending fresh measurement. See [What NATS Actually | |
| +Buys Today](#what-nats-actually-buys-today). | |
| 15 | |
| 16 -**Key Decision**: Use NATS over WebSocket due to superior reliability, built-in reconnection, and lower implementation co | |
| -mplexity. | |
| 16 +This document originally described the migration from HTTP-based request-response communication between Gateway and Opera | |
| +tors to a streaming architecture using NATS messaging. The original goal was to reduce end-to-end task latency by 3-10× t | |
| +hrough pipelining, early quorum exit, and non-blocking message dispatch. | |
| 17 | |
| 18 +**Original Key Decision**: Use NATS over WebSocket due to superior reliability, built-in reconnection, and lower implemen | |
| +tation complexity. | |
| 19 + | |
| 20 --- | |
| 21 | |
| 22 ## Table of Contents | |
| ⏺ Update(docs/STREAMING_CONSENSUS.md) | |
| ⎿ Added 2 lines | |
| 38 | |
| 39 ### Current Architecture Limitations | |
| 40 | |
| 41 +> **2026-04-17 update**: This section describes the codebase state as of 2026-01-31. Since then, the gateway has landed s | |
| +emaphore-bounded concurrent dispatch (`max_concurrent_requests: 8`, `crates/gateway/src/config.rs:370`), HTTP/2 multiplex | |
| +ing (`crates/gateway/src/config.rs:358-361`), streaming response collection with per-response quorum check (`crates/gatew | |
| +ay/src/task/operator.rs:1123-1211`, `1606-1641`, `1870-1910`), and early phase transition on Prepare quorum (`broadcast_t | |
| +wo_phase_and_aggregate` in `crates/gateway/src/rpc/api/sync.rs:859`). The sub-sections below are retained for historical | |
| +context; see [Key Bottlenecks](#key-bottlenecks) for the 2026-04-17 re-score. | |
| 42 + | |
| 43 The current HTTP/2-based communication between Gateway and Operators has several latency bottlenecks: | |
| 44 | |
| 45 ```mermaid | |
| ⏺ Update(docs/STREAMING_CONSENSUS.md) | |
| ⎿ Added 21 lines | |
| 87 3. **No Pipelining**: Cannot start Evaluate phase for Operator A while Operator B is still in Prepare phase | |
| 88 4. **Late Quorum Detection**: Quorum is checked only after collecting all responses | |
| 89 | |
| 90 +> **2026-04-17 re-score**: | |
| 91 +> | |
| 92 +> | # | Original claim | Current status | Evidence | | |
| 93 +> |---|---|---|---| | |
| 94 +> | 1 | Sequential Per-Operator Processing | **Addressed** — 8-wide semaphore + HTTP/2 multiplexing dispatches multiple | |
| +in-flight requests per operator | `crates/gateway/src/config.rs:370`, `crates/gateway/src/task/operator.rs:624` | | |
| 95 +> | 2 | Two-Phase Serialization | **Addressed** — Commit broadcasts the moment Prepare quorum arrives; slow Preparers ab | |
| +andon their work and start Commit | `crates/gateway/src/task/operator.rs:1606-1641`, `crates/gateway/src/rpc/api/sync.rs | |
| +:859` | | |
| 96 +> | 3 | No Pipelining (per-operator Phase 2 interleave) | **Protocol-constrained** — Op A cannot start Commit before Pre | |
| +pare quorum exists; all operators must BLS-sign the same consensus `policyTaskData`. The only valid pipelining is (2), w | |
| +hich is done | See [Protocol Constraint on Per-Operator Pipelining](#protocol-constraint-on-per-operator-pipelining) | | |
| 97 +> | 4 | Late Quorum Detection | **Addressed** — quorum checked per-response in `FuturesUnordered` loop, early-exit when | |
| +threshold hit | `crates/gateway/src/task/operator.rs:1123-1200`, `1606-1631`, `1870-1910` | | |
| 98 +> | |
| 99 +> Bottlenecks 1, 2, 4 have been addressed on the existing HTTP/2 transport. Bottleneck 3 is bounded by the consensus pro | |
| +tocol, not the transport. | |
| 100 + | |
| 101 +### Protocol Constraint on Per-Operator Pipelining | |
| 102 + | |
| 103 +The original proposal included an "advanced pipelined mode" where Op A enters Commit while Op B is still in Prepare (see | |
| + [Pipelined Two-Phase (Advanced)](#22-pipelined-two-phase-advanced)). This is **not implementable** for Newton's two-pha | |
| +se consensus: | |
| 104 + | |
| 105 +- **Commit requires a single consensus digest** across all signers. BLS aggregation verifies that all operators signed t | |
| +he same `policyTaskData`. | |
| 106 +- **Consensus is derived from Prepare quorum** (median-based, `crates/aggregator/src/consensus.rs`). Until a quorum of P | |
| +repare responses arrives, the consensus data does not exist — nothing for Op A to sign. | |
| 107 +- **Different consensus inputs → different digests → BLS pairing check fails**. If Op A signs a median computed from 2 P | |
| +repare responses and Op B signs a median computed from 3 different Prepare responses, aggregation fails at `BLSSignature | |
| +Checker.checkSignatures` with `InvalidQuorumApkHash` or a silent pairing mismatch (see `.claude/rules/lessons.md` "BLS a | |
| +ggregation block invariants"). | |
| 108 + | |
| 109 +The only valid pipelining permitted by the protocol is: **once Prepare quorum is reached, broadcast Commit to all operat | |
| +ors — including those still working on Prepare, who abandon Prepare and start Commit**. This is already in production. | |
| 110 + | |
| 111 --- | |
| 112 | |
| 113 ## Architecture Comparison | |
| ⏺ Update(docs/STREAMING_CONSENSUS.md) | |
| ⎿ Added 2 lines | |
| 291 | |
| 292 ### Expected Latency Improvement | |
| 293 | |
| 294 +> **2026-04-17 update**: The table below was computed against the 2026-01-31 baseline. With the in-place HTTP/2 improvem | |
| +ents, the current baseline is closer to the "With NATS Streaming" column for bottlenecks 1, 2, and 4. Actual p50/p95/p99 | |
| + numbers require measurement (none currently on record). The 3-10× projection no longer applies. See [What NATS Actually | |
| + Buys Today](#what-nats-actually-buys-today) for the current realistic accounting. | |
| 295 + | |
| 296 | Metric | Current (HTTP/2) | With NATS Streaming | Improvement | | |
| 297 |----------------------------|-----------------------|------------------------|--------------------| | |
| 298 | **Typical E2E** | 2-8s | 300ms-1.5s | **3-10x faster** | | |
| ⏺ Update(docs/STREAMING_CONSENSUS.md) | |
| ⎿ Added 34 lines | |
| 301 | **Message dispatch** | Sequential HTTP calls | Single publish | **10-100x faster** | | |
| 302 | **Worst case (timeout)** | 45s | 45s | No change | | |
| 303 | |
| 304 +### What NATS Actually Buys Today | |
| 305 + | |
| 306 +With HTTP/2 multiplexing + semaphore-bounded concurrency + streaming quorum-exit already in place, the residual latency | |
| +wins attributable to a NATS migration are: | |
| 307 + | |
| 308 +| Candidate win | Realistic impact | Achievable without NATS? | |
| + | | |
| 309 +|-----------------------------------------------------|----------------------------------|------------------------------ | |
| +----------------------------------------| | |
| 310 +| Binary codec instead of JSON-RPC (~1-5ms per call) | ~1-5ms per phase | **Yes** — swap HTTP/2 body co | |
| +dec to `bincode` / `capnp` / `protobuf` | | |
| 311 +| Eliminate cold-connection latency | Negligible after warmup | **Yes** — already handled by | |
| +connection pooling | | |
| 312 +| Response push before request-response return | Negligible | **Yes** — already non-blockin | |
| +g via `FuturesUnordered` | | |
| 313 +| JetStream persistence / replay | N/A for current use case | N/A | |
| + | | |
| 314 +| Operator-to-operator fan-out | N/A — we don't use P2P messaging | N/A | |
| + | | |
| 315 + | |
| 316 +Costs NATS would introduce: | |
| 317 + | |
| 318 +- **+1-2ms per message hop** (Gateway → NATS → Operator vs direct HTTP/2) | |
| 319 +- **New infrastructure SPoF** — NATS cluster availability, TLS, auth, subject authz | |
| 320 +- **New attack surface** — subject-based authz is weaker than mTLS-bound endpoints | |
| 321 +- **Centralization vector** — in an EigenLayer AVS, a shared message bus becomes a censorship/liveness chokepoint that t | |
| +he trust model did not originally assume. Decentralization was marked "least prioritized" in the original framing, but i | |
| +t remains a real cost | |
| 322 +- **State management** — JetStream adds another stateful component to operate, monitor, and alert on | |
| 323 + | |
| 324 +**Recommendation (2026-04-17)**: Do **not** pursue the NATS migration as originally proposed. Instead, the cheapest next | |
| + lever for additional latency reduction is, in order: | |
| 325 + | |
| 326 +1. **Measure first** — record actual p50/p95/p99 of Prepare→Commit gap and Commit→quorum for current production traffic. | |
| + Without this, every projection remains conjecture. | |
| 327 +2. **If measurement shows wire-format is the bottleneck** — swap JSON-RPC body codec to `bincode` / `capnp` / `protobuf` | |
| + on the existing HTTP/2 transport. | |
| 328 +3. **If measurement shows operator-side policy evaluation is the bottleneck** — optimize WASM execution, caching, or bat | |
| +ched policy evaluation. | |
| 329 +4. **If measurement shows remaining gateway orchestration overhead** — reconsider messaging architectures, but evaluate | |
| +gRPC server-streaming (HTTP/2-native, no new infra) and HTTP/3/QUIC alongside NATS. | |
| 330 + | |
| 331 +Alternatives under-considered in the original comparison (NATS vs WebSocket vs Redis) that deserve weight in any future | |
| +evaluation: | |
| 332 + | |
| 333 +- **gRPC server-streaming** — bidirectional streams over HTTP/2, native Rust support via `tonic`, reuses existing connec | |
| +tion pool | |
| 334 +- **HTTP/3 / QUIC** — built-in server push, 0-RTT resumption, UDP-based (no TCP head-of-line blocking) | |
| 335 +- **libp2p gossipsub** — operator-to-operator fan-out, aligns with AVS decentralization posture | |
| 336 +- **Binary codec on existing HTTP/2** — smallest change surface, preserves all existing reliability properties | |
| 337 + | |
| 338 --- | |
| 339 | |
| 340 ## Streaming Protocol Design | |
| ⏺ Update(docs/STREAMING_CONSENSUS.md) | |
| ⎿ Added 2 lines | |
| 963 | |
| 964 #### 2.2 Pipelined Two-Phase (Advanced) | |
| 965 | |
| 966 +> **2026-04-17 update**: The "per-operator pipelining" described in this section (Op A in Phase 2 while Op B in Phase 1 | |
| +producing a distinct median) is **not implementable** given Newton's two-phase consensus semantics — see [Protocol Const | |
| +raint on Per-Operator Pipelining](#protocol-constraint-on-per-operator-pipelining). The implementable version — "as soon | |
| + as Prepare quorum arrives, broadcast Commit immediately" — is already in production via `FuturesUnordered` early-exit i | |
| +n `crates/gateway/src/task/operator.rs:1606-1641` and the chained call pattern in `crates/gateway/src/rpc/api/sync.rs:85 | |
| +9`. The code sample below is retained for historical context but does not reflect a viable implementation path. | |
| 967 + | |
| 968 For even lower latency, we can pipeline Phase 2 per-operator: | |
| 969 | |
| 970 ```rust | |
| ⏺ Update(docs/STREAMING_CONSENSUS.md) | |
| ▐▛███▜▌ Claude Code v2.1.112 | |
| ▝▜█████▛▘ Opus 4.7 (1M context) with max effort · Claude API | |
| ▘▘ ▝▝ ~/projects/newton-prover-avs | |
| ✻ Conversation compacted (ctrl+o for history) | |
| ❯ /compact | |
| ⎿ Compacted (ctrl+o to see full summary) | |
| PreCompact [~/.claude/hooks/save-compact-state.sh] completed successfully | |
| PreCompact [node "${CLAUDE_PLUGIN_ROOT}/scripts/pre-compact.mjs"] completed successfully: {"continue":true,"systemMessage":"# | |
| PreCompact Checkpoint\n\nCreated: 2026-04-17T19:46:51.182Z\nTrigger: manual\n\n## Active Modes\n\n- Ecomode\n Prompt: this is really | |
| important: we should NOT guess what...\n\n---\nNote: This checkpoint preserves critical state before compaction.\nReview active | |
| modes to ensure continuity after compaction."} | |
| PreCompact [node "${CLAUDE_PLUGIN_ROOT}/scripts/project-memory-precompact.mjs"] completed successfully: | |
| {"continue":true,"systemMessage":"# Project Memory (Post-Compaction Recovery)\n\nThe following project context and user directives | |
| must be preserved after compaction:\n\n[Project Environment] Rust | using cargo | Build: cargo build | Test: cargo test\n\nKey | |
| Directories:\n- bin: Executable scripts\n- docs: Documentation\n- scripts: Build/utility scripts\n- circuits/scripts: Build/utility | |
| scripts\n- integration-tests/src: Source code\n\nIMPORTANT: These user directives must be followed throughout the session, even after | |
| compaction."} | |
| ⎿ Referenced file docs/STREAMING_CONSENSUS.md | |
| ⎿ Read docs/CLAUDE.md (7 lines) | |
| ⎿ Read .claude/rules/CLAUDE.md (7 lines) | |
| ⎿ Read .claude/rules/agent-guide.md (137 lines) | |
| ⎿ Read .claude/rules/git-hygiene.md (52 lines) | |
| ⎿ Skills restored (superpowers:brainstorming) | |
| ❯ continue | |
| ⏺ The 8 inline edits to docs/STREAMING_CONSENSUS.md are complete. The doc now reflects the current architecture with 2026-04-17 callouts, | |
| file:line evidence, and honest accounting of what NATS would still buy. | |
| ★ Insight ───────────────────────────────────── | |
| - The original doc's "3-10× latency reduction" claim rested on bottlenecks that no longer exist: sequential per-operator dispatch (now | |
| 8-concurrent via HTTP/2 multiplexing) and late quorum detection (now streaming via FuturesUnordered with early exit at | |
| crates/gateway/src/task/operator.rs:1123). | |
| - The remaining "bottleneck" — per-operator Commit/Prepare interleave — is protocol-impossible, not transport-impossible: BLS | |
| aggregation requires all signers to sign the same consensus digest, which can only be computed after Prepare quorum. | |
| - Residual NATS wins (~1-5ms binary codec savings) are achievable on HTTP/2 via bincode/capnp without the new infra. | |
| ───────────────────────────────────────────────── | |
| Where do you want to go next? | |
| (a) Verify the 8 edits look right (I can show the diff) | |
| (b) Return to the original first-principles re-eval, now with a corrected baseline | |
| (c) Measure real p50/p95/p99 of Prepare→Commit and Commit→quorum first, so any next design decision has data behind it | |
| My recommendation: (c) — the stale "30s timeout / 15s timeout" numbers in the old doc were projections, not measurements. Real numbers | |
| would either validate "architecture is fine, stop optimizing" or point at a specific bottleneck worth targeting. | |
| ※ recap: We're re-evaluating the STREAMING_CONSENSUS.md architecture doc; I finished 8 inline edits marking it partially superseded with | |
| current state evidence. Next: decide whether to measure real p50/p95/p99 latency before further design work, or return to the | |
| first-principles re-eval. | |
| ❯ (b) | |
| ⏺ Let me do this properly — first-principles means starting from protocol invariants and working outward, not starting from "NATS vs | |
| HTTP/2." | |
| 1. Protocol invariants (what cannot change) | |
| These are load-bearing and define the lower bound on any design: | |
| ┌─────────────────────────────────┬───────────────────────────────────────────────────────────┬─────────────────────────────────────┐ | |
| │ Invariant │ Source │ Implication │ | |
| ├─────────────────────────────────┼───────────────────────────────────────────────────────────┼─────────────────────────────────────┤ | |
| │ Two phases are sequential │ Consensus requires Prepare quorum before Commit digest is │ ≥2 RTT floor │ | |
| │ │ computable │ │ | |
| ├─────────────────────────────────┼───────────────────────────────────────────────────────────┼─────────────────────────────────────┤ | |
| │ BLS aggregation needs identical │ eigensdk BLS math │ No per-operator Commit interleave │ | |
| │ digest │ │ │ | |
| ├─────────────────────────────────┼───────────────────────────────────────────────────────────┼─────────────────────────────────────┤ | |
| │ Signatures bound to │ BLSSignatureChecker.checkSignatures │ Block mismatch → │ | |
| │ taskCreatedBlock │ │ InvalidQuorumApkHash │ | |
| ├─────────────────────────────────┼───────────────────────────────────────────────────────────┼─────────────────────────────────────┤ | |
| │ Median consensus computed at │ crates/aggregator/src/consensus.rs │ Gateway is load-bearing │ | |
| │ gateway │ │ orchestrator │ | |
| ├─────────────────────────────────┼───────────────────────────────────────────────────────────┼─────────────────────────────────────┤ | |
| │ Policy bytes content-addressed │ NewtonPolicy.policyCodeHash │ Cache-friendly, but first fetch │ | |
| │ via IPFS │ │ unavoidable │ | |
| └─────────────────────────────────┴───────────────────────────────────────────────────────────┴─────────────────────────────────────┘ | |
| Everything else — transport, codec, topology — is a choice, not a requirement. | |
| 2. Latency budget (where time actually goes) | |
| Without real measurements this is speculative, but based on code reading: | |
| ┌──────────────────────────────┬───────────────────────────────────────┬───────────────────────────────────────────────────────────┐ | |
| │ Stage │ Likely dominant cost │ Controllable? │ | |
| ├──────────────────────────────┼───────────────────────────────────────┼───────────────────────────────────────────────────────────┤ | |
| │ createTask on-chain │ Block time + gas estimation + nonce │ Mostly no (chain-bound); BatchTaskManager batches to │ | |
| │ submission │ │ amortize │ | |
| ├──────────────────────────────┼───────────────────────────────────────┼───────────────────────────────────────────────────────────┤ | |
| │ Prepare broadcast │ 1 RTT + JSON encode/decode │ Yes — codec, connection reuse │ | |
| ├──────────────────────────────┼───────────────────────────────────────┼───────────────────────────────────────────────────────────┤ | |
| │ Operator: IPFS fetch │ Network + multihash verify │ Yes — cache hit rate │ | |
| ├──────────────────────────────┼───────────────────────────────────────┼───────────────────────────────────────────────────────────┤ | |
| │ Operator: data-provider │ External API latency │ No (external) │ | |
| │ fetch │ │ │ | |
| ├──────────────────────────────┼───────────────────────────────────────┼───────────────────────────────────────────────────────────┤ | |
| │ Gateway: median consensus │ Pure compute, <1ms │ Already cheap │ | |
| ├──────────────────────────────┼───────────────────────────────────────┼───────────────────────────────────────────────────────────┤ | |
| │ Commit broadcast │ 1 RTT + Rego eval + BLS sign │ Rego eval dominates per my read of rego/mod.rs │ | |
| ├──────────────────────────────┼───────────────────────────────────────┼───────────────────────────────────────────────────────────┤ | |
| │ Aggregator: BLS aggregation │ FuturesUnordered with quorum │ Already streaming │ | |
| │ │ early-exit │ │ | |
| ├──────────────────────────────┼───────────────────────────────────────┼───────────────────────────────────────────────────────────┤ | |
| │ respondToTask on-chain │ Block time │ No │ | |
| └──────────────────────────────┴───────────────────────────────────────┴───────────────────────────────────────────────────────────┘ | |
| ★ Insight ───────────────────────────────────── | |
| The "transport layer is the bottleneck" framing in the original NATS doc conflated two different things: message delivery time (truly | |
| small on modern HTTP/2) and phase serialization time (bounded below by the protocol itself). No transport choice can reduce the | |
| protocol's RTT floor — it's 2 × max(network_rtt, operator_eval_time) + chain_confirmation. | |
| The real latency levers are: (1) IPFS cache hit rate, (2) Rego eval speed, (3) on-chain block time — none of which NATS addresses. | |
| ───────────────────────────────────────────────── | |
| 3. Score current HTTP/2 architecture | |
| Dimension-by-dimension, using the current code (operator.rs, sync.rs, operator_pool): | |
| ┌──────────────────┬───────┬────────────────────────────────────────────────────────────────────────────────────────────────────────┐ | |
| │ Dimension │ Score │ Rationale │ | |
| ├──────────────────┼───────┼────────────────────────────────────────────────────────────────────────────────────────────────────────┤ | |
| │ Latency │ B+ │ RTT-bound, streaming quorum at operator.rs:1080, early phase transition on quorum. ~10-20% from │ | |
| │ │ │ lower bound on transport. │ | |
| ├────────────────────┼───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤ | |
| │ Performance │ B │ JSON encoding is hottest non-crypto path. Binary codec would help. │ | |
| │ (CPU/mem) │ │ │ | |
| ├────────────────────┼───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤ | |
| │ Reliability │ A- │ Per-operator circuit breakers (circuit_breaker.rs), semaphore backpressure, retry with │ | |
| │ │ │ classification. Receipt timeout (60s L1 / 30s L2) plugs the get_receipt() hang. │ | |
| ├────────────────────┼───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤ | |
| │ Scalability │ A │ Stateless gateways; horizontal scale path is clean. Batch pipeline (batch_submitter.rs) separates TX │ | |
| │ │ │ broadcast from receipt polling. │ | |
| ├────────────────────┼───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤ | |
| │ Throughput │ B+ │ Per-gateway limited by BLS aggregation + chain submission, not transport. BatchTaskManager amortizes │ | |
| │ │ │ chain costs. │ | |
| ├────────────────────┼───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤ | |
| │ Security │ A- │ TLS on HTTP/2, selector-gated revert classification, content-addressed policy bytes, no wildcard │ | |
| │ │ │ error matching. │ | |
| ├────────────────────┼───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤ | |
| │ Fault tolerance │ A- │ Streaming quorum tolerates 1+ slow operator; collect_non_transient_errors surfaces failures; │ | |
| │ │ │ spawn_monitored survives panics. │ | |
| ├────────────────────┼───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤ | |
| │ Decentralization │ C+ │ Gateway is a centralized orchestrator. Operator pool is decentralized, but consensus computation is │ | |
| │ │ │ gateway-local. │ | |
| └────────────────────┴───────┴──────────────────────────────────────────────────────────────────────────────────────────────────────┘ | |
| The current architecture is a solid B+ to A- overall. Not great candidates for replacement. | |
| 4. Alternatives scored honestly | |
| Let me evaluate each realistic alternative on the same axes, relative to current: | |
| Option A: Stay on HTTP/2, add binary codec (bincode/capnp/postcard) | |
| ┌─────────────────┬────────────────────┬─────────────────────────────────────────────┐ | |
| │ Dimension │ Δ vs current │ Notes │ | |
| ├─────────────────┼────────────────────┼─────────────────────────────────────────────┤ | |
| │ Latency │ +1 to +5ms per hop │ Measurable but small │ | |
| ├─────────────────┼────────────────────┼─────────────────────────────────────────────┤ | |
| │ Performance │ Better │ Less CPU on encode/decode, less GC pressure │ | |
| ├─────────────────┼────────────────────┼─────────────────────────────────────────────┤ | |
| │ Reliability │ No change │ Same transport │ | |
| ├─────────────────┼────────────────────┼─────────────────────────────────────────────┤ | |
| │ Scalability │ No change │ Same connection model │ | |
| ├─────────────────┼────────────────────┼─────────────────────────────────────────────┤ | |
| │ Security │ No change │ Same TLS │ | |
| ├─────────────────┼────────────────────┼─────────────────────────────────────────────┤ | |
| │ Fault tolerance │ No change │ │ | |
| ├─────────────────┼────────────────────┼─────────────────────────────────────────────┤ | |
| │ Complexity │ Low │ Contained to serde impls on RPC types │ | |
| └─────────────────┴────────────────────┴─────────────────────────────────────────────┘ | |
| Verdict: Clear win if measurement shows JSON is a hot spot. Low risk. | |
| Option B: gRPC (tonic) with server-streaming | |
| ┌─────────────────┬───────────────────┬──────────────────────────────────────────────────┐ | |
| │ Dimension │ Δ vs current │ Notes │ | |
| ├─────────────────┼───────────────────┼──────────────────────────────────────────────────┤ | |
| │ Latency │ Small improvement │ Protobuf + native server streaming │ | |
| ├─────────────────┼───────────────────┼──────────────────────────────────────────────────┤ | |
| │ Performance │ Better than JSON │ Similar to binary codec on HTTP/2 │ | |
| ├─────────────────┼───────────────────┼──────────────────────────────────────────────────┤ | |
| │ Reliability │ Same │ Still HTTP/2 │ | |
| ├─────────────────┼───────────────────┼──────────────────────────────────────────────────┤ | |
| │ Scalability │ Same │ │ | |
| ├─────────────────┼───────────────────┼──────────────────────────────────────────────────┤ | |
| │ Security │ Same │ TLS + mTLS native │ | |
| ├─────────────────┼───────────────────┼──────────────────────────────────────────────────┤ | |
| │ Fault tolerance │ Similar │ │ | |
| ├─────────────────┼───────────────────┼──────────────────────────────────────────────────┤ | |
| │ Complexity │ Medium │ Need .proto schema, new codegen, migrate clients │ | |
| └─────────────────┴───────────────────┴──────────────────────────────────────────────────┘ | |
| Verdict: Marginal latency gain over "Option A" at much higher migration cost. Only justified if server-streaming (single request, many | |
| responses) fits the model — which it doesn't, since quorum requires per-operator responses anyway. | |
| Option C: NATS/JetStream (original proposal) | |
| ┌──────────────────┬───────────────────────────────────────┬─────────────────────────────────────────────────────────────────┐ | |
| │ Dimension │ Δ vs current │ Notes │ | |
| ├──────────────────┼───────────────────────────────────────┼─────────────────────────────────────────────────────────────────┤ | |
| │ Latency │ -1 to -2ms, maybe │ Two hops (client→broker→operator), but potentially faster codec │ | |
| ├──────────────────┼───────────────────────────────────────┼─────────────────────────────────────────────────────────────────┤ | |
| │ Performance │ Similar or worse │ Extra hop; broker CPU │ | |
| ├──────────────────┼───────────────────────────────────────┼─────────────────────────────────────────────────────────────────┤ | |
| │ Reliability │ Worse (clustered NATS), same (single) │ SPoF or +ops overhead │ | |
| ├──────────────────┼───────────────────────────────────────┼─────────────────────────────────────────────────────────────────┤ | |
| │ Scalability │ Similar │ Scales with cluster │ | |
| ├──────────────────┼───────────────────────────────────────┼─────────────────────────────────────────────────────────────────┤ | |
| │ Security │ Worse │ New auth surface; broker becomes target │ | |
| ├──────────────────┼───────────────────────────────────────┼─────────────────────────────────────────────────────────────────┤ | |
| │ Fault tolerance │ Worse │ Broker outage = entire protocol down │ | |
| ├──────────────────┼───────────────────────────────────────┼─────────────────────────────────────────────────────────────────┤ | |
| │ Decentralization │ Worse │ Centralizes on broker infra │ | |
| ├──────────────────┼───────────────────────────────────────┼─────────────────────────────────────────────────────────────────┤ | |
| │ Complexity │ High │ New infra, new ops, new failure modes │ | |
| └──────────────────┴───────────────────────────────────────┴─────────────────────────────────────────────────────────────────┘ | |
| Verdict: Not justified given the protocol-constrained invariants. Original NATS case rested on bottlenecks that are now addressed. | |
| Option D: HTTP/3 / QUIC | |
| ┌─────────────────┬────────────────────┬───────────────────────────────────────────┐ | |
| │ Dimension │ Δ vs current │ Notes │ | |
| ├─────────────────┼────────────────────┼───────────────────────────────────────────┤ | |
| │ Latency │ 0-RTT on reconnect │ Only helps if connections drop frequently │ | |
| ├─────────────────┼────────────────────┼───────────────────────────────────────────┤ | |
| │ Performance │ Slightly better │ No head-of-line blocking │ | |
| ├─────────────────┼────────────────────┼───────────────────────────────────────────┤ | |
| │ Reliability │ Similar │ │ | |
| ├─────────────────┼────────────────────┼───────────────────────────────────────────┤ | |
| │ Scalability │ Similar │ │ | |
| ├─────────────────┼────────────────────┼───────────────────────────────────────────┤ | |
| │ Security │ Same │ QUIC native TLS │ | |
| ├─────────────────┼────────────────────┼───────────────────────────────────────────┤ | |
| │ Fault tolerance │ Slightly better │ Connection migration │ | |
| ├─────────────────┼────────────────────┼───────────────────────────────────────────┤ | |
| │ Complexity │ Medium-High │ Rust QUIC stack still maturing │ | |
| └─────────────────┴────────────────────┴───────────────────────────────────────────┘ | |
| Verdict: Too early. Revisit in 12-18 months when ecosystem matures. | |
| Option E: libp2p gossipsub | |
| ┌──────────────────┬─────────────────────────┬───────────────────────────────────────────────────┐ | |
| │ Dimension │ Δ vs current │ Notes │ | |
| ├──────────────────┼─────────────────────────┼───────────────────────────────────────────────────┤ | |
| │ Latency │ Worse │ Gossip fanout adds ms; not designed for consensus │ | |
| ├──────────────────┼─────────────────────────┼───────────────────────────────────────────────────┤ | |
| │ Performance │ Worse │ P2P overhead │ | |
| ├──────────────────┼─────────────────────────┼───────────────────────────────────────────────────┤ | |
| │ Reliability │ Different failure modes │ Eventual delivery, not synchronous │ | |
| ├──────────────────┼─────────────────────────┼───────────────────────────────────────────────────┤ | |
| │ Scalability │ Better for broadcast │ But consensus isn't broadcast, it's solicited │ | |
| ├──────────────────┼─────────────────────────┼───────────────────────────────────────────────────┤ | |
| │ Security │ Worse │ Peer discovery attack surface │ | |
| ├──────────────────┼─────────────────────────┼───────────────────────────────────────────────────┤ | |
| │ Decentralization │ Best │ Operators could gossip directly │ | |
| ├──────────────────┼─────────────────────────┼───────────────────────────────────────────────────┤ | |
| │ Complexity │ Very high │ │ | |
| └──────────────────┴─────────────────────────┴───────────────────────────────────────────────────┘ | |
| Verdict: Wrong protocol fit. Orchestrated request/response ≠ gossip. | |
| Option F: Optimize non-transport hotspots | |
| ┌──────────────────────────────────────────────────────┬─────────────────────────┬────────────┐ | |
| │ Target │ Estimated gain │ Complexity │ | |
| ├──────────────────────────────────────────────────────┼─────────────────────────┼────────────┤ | |
| ▐▛███▜▌ Claude Code v2.1.112 | |
| ▝▜█████▛▘ Opus 4.7 (1M context) with max effort · Claude API | |
| ▘▘ ▝▝ ~/projects/newton-prover-avs | |
| ✻ Conversation compacted (ctrl+o for history) | |
| ❯ /compact | |
| ⎿ Compacted (ctrl+o to see full summary) | |
| PreCompact [~/.claude/hooks/save-compact-state.sh] completed successfully | |
| PreCompact [node "${CLAUDE_PLUGIN_ROOT}/scripts/pre-compact.mjs"] completed successfully: {"continue":true,"systemMessage":"# | |
| PreCompact Checkpoint\n\nCreated: 2026-04-17T19:46:51.182Z\nTrigger: manual\n\n## Active Modes\n\n- Ecomode\n Prompt: this is really | |
| important: we should NOT guess what...\n\n---\nNote: This checkpoint preserves critical state before compaction.\nReview active | |
| modes to ensure continuity after compaction."} | |
| PreCompact [node "${CLAUDE_PLUGIN_ROOT}/scripts/project-memory-precompact.mjs"] completed successfully: | |
| {"continue":true,"systemMessage":"# Project Memory (Post-Compaction Recovery)\n\nThe following project context and user directives | |
| must be preserved after compaction:\n\n[Project Environment] Rust | using cargo | Build: cargo build | Test: cargo test\n\nKey | |
| Directories:\n- bin: Executable scripts\n- docs: Documentation\n- scripts: Build/utility scripts\n- circuits/scripts: Build/utility | |
| scripts\n- integration-tests/src: Source code\n\nIMPORTANT: These user directives must be followed throughout the session, even after | |
| compaction."} | |
| ⎿ Referenced file docs/STREAMING_CONSENSUS.md | |
| ⎿ Read docs/CLAUDE.md (7 lines) | |
| ⎿ Read .claude/rules/CLAUDE.md (7 lines) | |
| ⎿ Read .claude/rules/agent-guide.md (137 lines) | |
| ⎿ Read .claude/rules/git-hygiene.md (52 lines) | |
| ⎿ Skills restored (superpowers:brainstorming) | |
| ❯ continue | |
| ⏺ The 8 inline edits to docs/STREAMING_CONSENSUS.md are complete. The doc now reflects the current architecture with 2026-04-17 callouts, | |
| file:line evidence, and honest accounting of what NATS would still buy. | |
| ★ Insight ───────────────────────────────────── | |
| - The original doc's "3-10× latency reduction" claim rested on bottlenecks that no longer exist: sequential per-operator dispatch (now | |
| 8-concurrent via HTTP/2 multiplexing) and late quorum detection (now streaming via FuturesUnordered with early exit at | |
| crates/gateway/src/task/operator.rs:1123). | |
| - The remaining "bottleneck" — per-operator Commit/Prepare interleave — is protocol-impossible, not transport-impossible: BLS | |
| aggregation requires all signers to sign the same consensus digest, which can only be computed after Prepare quorum. | |
| - Residual NATS wins (~1-5ms binary codec savings) are achievable on HTTP/2 via bincode/capnp without the new infra. | |
| ───────────────────────────────────────────────── | |
| Where do you want to go next? | |
| (a) Verify the 8 edits look right (I can show the diff) | |
| (b) Return to the original first-principles re-eval, now with a corrected baseline | |
| (c) Measure real p50/p95/p99 of Prepare→Commit and Commit→quorum first, so any next design decision has data behind it | |
| My recommendation: (c) — the stale "30s timeout / 15s timeout" numbers in the old doc were projections, not measurements. Real numbers | |
| would either validate "architecture is fine, stop optimizing" or point at a specific bottleneck worth targeting. | |
| ※ recap: We're re-evaluating the STREAMING_CONSENSUS.md architecture doc; I finished 8 inline edits marking it partially superseded with | |
| current state evidence. Next: decide whether to measure real p50/p95/p99 latency before further design work, or return to the | |
| first-principles re-eval. | |
| ❯ (b) | |
| ⏺ Let me do this properly — first-principles means starting from protocol invariants and working outward, not starting from "NATS vs | |
| HTTP/2." | |
| 1. Protocol invariants (what cannot change) | |
| These are load-bearing and define the lower bound on any design: | |
| ┌─────────────────────────────────┬───────────────────────────────────────────────────────────┬─────────────────────────────────────┐ | |
| │ Invariant │ Source │ Implication │ | |
| ├─────────────────────────────────┼───────────────────────────────────────────────────────────┼─────────────────────────────────────┤ | |
| │ Two phases are sequential │ Consensus requires Prepare quorum before Commit digest is │ ≥2 RTT floor │ | |
| │ │ computable │ │ | |
| ├─────────────────────────────────┼───────────────────────────────────────────────────────────┼─────────────────────────────────────┤ | |
| │ BLS aggregation needs identical │ eigensdk BLS math │ No per-operator Commit interleave │ | |
| │ digest │ │ │ | |
| ├─────────────────────────────────┼───────────────────────────────────────────────────────────┼─────────────────────────────────────┤ | |
| │ Signatures bound to │ BLSSignatureChecker.checkSignatures │ Block mismatch → │ | |
| │ taskCreatedBlock │ │ InvalidQuorumApkHash │ | |
| ├─────────────────────────────────┼───────────────────────────────────────────────────────────┼─────────────────────────────────────┤ | |
| │ Median consensus computed at │ crates/aggregator/src/consensus.rs │ Gateway is load-bearing │ | |
| │ gateway │ │ orchestrator │ | |
| ├─────────────────────────────────┼───────────────────────────────────────────────────────────┼─────────────────────────────────────┤ | |
| │ Policy bytes content-addressed │ NewtonPolicy.policyCodeHash │ Cache-friendly, but first fetch │ | |
| │ via IPFS │ │ unavoidable │ | |
| └─────────────────────────────────┴───────────────────────────────────────────────────────────┴─────────────────────────────────────┘ | |
| Everything else — transport, codec, topology — is a choice, not a requirement. | |
| 2. Latency budget (where time actually goes) | |
| Without real measurements this is speculative, but based on code reading: | |
| ┌──────────────────────────────┬───────────────────────────────────────┬───────────────────────────────────────────────────────────┐ | |
| │ Stage │ Likely dominant cost │ Controllable? │ | |
| ├──────────────────────────────┼───────────────────────────────────────┼───────────────────────────────────────────────────────────┤ | |
| │ createTask on-chain │ Block time + gas estimation + nonce │ Mostly no (chain-bound); BatchTaskManager batches to │ | |
| │ submission │ │ amortize │ | |
| ├──────────────────────────────┼───────────────────────────────────────┼───────────────────────────────────────────────────────────┤ | |
| │ Prepare broadcast │ 1 RTT + JSON encode/decode │ Yes — codec, connection reuse │ | |
| ├──────────────────────────────┼───────────────────────────────────────┼───────────────────────────────────────────────────────────┤ | |
| │ Operator: IPFS fetch │ Network + multihash verify │ Yes — cache hit rate │ | |
| ├──────────────────────────────┼───────────────────────────────────────┼───────────────────────────────────────────────────────────┤ | |
| │ Operator: data-provider │ External API latency │ No (external) │ | |
| │ fetch │ │ │ | |
| ├──────────────────────────────┼───────────────────────────────────────┼───────────────────────────────────────────────────────────┤ | |
| │ Gateway: median consensus │ Pure compute, <1ms │ Already cheap │ | |
| ├──────────────────────────────┼───────────────────────────────────────┼───────────────────────────────────────────────────────────┤ | |
| │ Commit broadcast │ 1 RTT + Rego eval + BLS sign │ Rego eval dominates per my read of rego/mod.rs │ | |
| ├──────────────────────────────┼───────────────────────────────────────┼───────────────────────────────────────────────────────────┤ | |
| │ Aggregator: BLS aggregation │ FuturesUnordered with quorum │ Already streaming │ | |
| │ │ early-exit │ │ | |
| ├──────────────────────────────┼───────────────────────────────────────┼───────────────────────────────────────────────────────────┤ | |
| │ respondToTask on-chain │ Block time │ No │ | |
| └──────────────────────────────┴───────────────────────────────────────┴───────────────────────────────────────────────────────────┘ | |
| ★ Insight ───────────────────────────────────── | |
| The "transport layer is the bottleneck" framing in the original NATS doc conflated two different things: message delivery time (truly | |
| small on modern HTTP/2) and phase serialization time (bounded below by the protocol itself). No transport choice can reduce the | |
| protocol's RTT floor — it's 2 × max(network_rtt, operator_eval_time) + chain_confirmation. | |
| The real latency levers are: (1) IPFS cache hit rate, (2) Rego eval speed, (3) on-chain block time — none of which NATS addresses. | |
| ───────────────────────────────────────────────── | |
| 3. Score current HTTP/2 architecture | |
| Dimension-by-dimension, using the current code (operator.rs, sync.rs, operator_pool): | |
| ┌────────────────────┬───────┬──────────────────────────────────────────────────────────────────────────────────────────────────────┐ | |
| │ Dimension │ Score │ Rationale │ | |
| ├────────────────────┼───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤ | |
| │ Latency │ B+ │ RTT-bound, streaming quorum at operator.rs:1080, early phase transition on quorum. ~10-20% from │ | |
| │ │ │ lower bound on transport. │ | |
| ├────────────────────┼───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤ | |
| │ Performance │ B │ JSON encoding is hottest non-crypto path. Binary codec would help. │ | |
| │ (CPU/mem) │ │ │ | |
| ├────────────────────┼───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤ | |
| │ Reliability │ A- │ Per-operator circuit breakers (circuit_breaker.rs), semaphore backpressure, retry with │ | |
| │ │ │ classification. Receipt timeout (60s L1 / 30s L2) plugs the get_receipt() hang. │ | |
| ├────────────────────┼───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤ | |
| │ Scalability │ A │ Stateless gateways; horizontal scale path is clean. Batch pipeline (batch_submitter.rs) separates TX │ | |
| │ │ │ broadcast from receipt polling. │ | |
| ├────────────────────┼───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤ | |
| │ Throughput │ B+ │ Per-gateway limited by BLS aggregation + chain submission, not transport. BatchTaskManager amortizes │ | |
| │ │ │ chain costs. │ | |
| ├────────────────────┼───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤ | |
| │ Security │ A- │ TLS on HTTP/2, selector-gated revert classification, content-addressed policy bytes, no wildcard │ | |
| │ │ │ error matching. │ | |
| ├────────────────────┼───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤ | |
| │ Fault tolerance │ A- │ Streaming quorum tolerates 1+ slow operator; collect_non_transient_errors surfaces failures; │ | |
| │ │ │ spawn_monitored survives panics. │ | |
| ├────────────────────┼───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤ | |
| │ Decentralization │ C+ │ Gateway is a centralized orchestrator. Operator pool is decentralized, but consensus computation is │ | |
| │ │ │ gateway-local. │ | |
| └────────────────────┴───────┴──────────────────────────────────────────────────────────────────────────────────────────────────────┘ | |
| The current architecture is a solid B+ to A- overall. Not great candidates for replacement. | |
| 4. Alternatives scored honestly | |
| Let me evaluate each realistic alternative on the same axes, relative to current: | |
| Option A: Stay on HTTP/2, add binary codec (bincode/capnp/postcard) | |
| ┌─────────────────┬────────────────────┬─────────────────────────────────────────────┐ | |
| │ Dimension │ Δ vs current │ Notes │ | |
| ├─────────────────┼────────────────────┼─────────────────────────────────────────────┤ | |
| │ Latency │ +1 to +5ms per hop │ Measurable but small │ | |
| ├─────────────────┼────────────────────┼─────────────────────────────────────────────┤ | |
| │ Performance │ Better │ Less CPU on encode/decode, less GC pressure │ | |
| ├─────────────────┼────────────────────┼─────────────────────────────────────────────┤ | |
| │ Reliability │ No change │ Same transport │ | |
| ├─────────────────┼────────────────────┼─────────────────────────────────────────────┤ | |
| │ Scalability │ No change │ Same connection model │ | |
| ├─────────────────┼────────────────────┼─────────────────────────────────────────────┤ | |
| │ Security │ No change │ Same TLS │ | |
| ├─────────────────┼────────────────────┼─────────────────────────────────────────────┤ | |
| │ Fault tolerance │ No change │ │ | |
| ├─────────────────┼────────────────────┼─────────────────────────────────────────────┤ | |
| │ Complexity │ Low │ Contained to serde impls on RPC types │ | |
| └─────────────────┴────────────────────┴─────────────────────────────────────────────┘ | |
| Verdict: Clear win if measurement shows JSON is a hot spot. Low risk. | |
| Option B: gRPC (tonic) with server-streaming | |
| ┌─────────────────┬───────────────────┬──────────────────────────────────────────────────┐ | |
| │ Dimension │ Δ vs current │ Notes │ | |
| ├─────────────────┼───────────────────┼──────────────────────────────────────────────────┤ | |
| │ Latency │ Small improvement │ Protobuf + native server streaming │ | |
| ├─────────────────┼───────────────────┼──────────────────────────────────────────────────┤ | |
| │ Performance │ Better than JSON │ Similar to binary codec on HTTP/2 │ | |
| ├─────────────────┼───────────────────┼──────────────────────────────────────────────────┤ | |
| │ Reliability │ Same │ Still HTTP/2 │ | |
| ├─────────────────┼───────────────────┼──────────────────────────────────────────────────┤ | |
| │ Scalability │ Same │ │ | |
| ├─────────────────┼───────────────────┼──────────────────────────────────────────────────┤ | |
| │ Security │ Same │ TLS + mTLS native │ | |
| ├─────────────────┼───────────────────┼──────────────────────────────────────────────────┤ | |
| │ Fault tolerance │ Similar │ │ | |
| ├─────────────────┼───────────────────┼──────────────────────────────────────────────────┤ | |
| │ Complexity │ Medium │ Need .proto schema, new codegen, migrate clients │ | |
| └─────────────────┴───────────────────┴──────────────────────────────────────────────────┘ | |
| Verdict: Marginal latency gain over "Option A" at much higher migration cost. Only justified if server-streaming (single request, many | |
| responses) fits the model — which it doesn't, since quorum requires per-operator responses anyway. | |
| Option C: NATS/JetStream (original proposal) | |
| ┌──────────────────┬───────────────────────────────────────┬─────────────────────────────────────────────────────────────────┐ | |
| │ Dimension │ Δ vs current │ Notes │ | |
| ├──────────────────┼───────────────────────────────────────┼─────────────────────────────────────────────────────────────────┤ | |
| │ Latency │ -1 to -2ms, maybe │ Two hops (client→broker→operator), but potentially faster codec │ | |
| ├──────────────────┼───────────────────────────────────────┼─────────────────────────────────────────────────────────────────┤ | |
| │ Performance │ Similar or worse │ Extra hop; broker CPU │ | |
| ├──────────────────┼───────────────────────────────────────┼─────────────────────────────────────────────────────────────────┤ | |
| │ Reliability │ Worse (clustered NATS), same (single) │ SPoF or +ops overhead │ | |
| ├──────────────────┼───────────────────────────────────────┼─────────────────────────────────────────────────────────────────┤ | |
| │ Scalability │ Similar │ Scales with cluster │ | |
| ├──────────────────┼───────────────────────────────────────┼─────────────────────────────────────────────────────────────────┤ | |
| │ Security │ Worse │ New auth surface; broker becomes target │ | |
| ├──────────────────┼───────────────────────────────────────┼─────────────────────────────────────────────────────────────────┤ | |
| │ Fault tolerance │ Worse │ Broker outage = entire protocol down │ | |
| ├──────────────────┼───────────────────────────────────────┼─────────────────────────────────────────────────────────────────┤ | |
| │ Decentralization │ Worse │ Centralizes on broker infra │ | |
| ├──────────────────┼───────────────────────────────────────┼─────────────────────────────────────────────────────────────────┤ | |
| │ Complexity │ High │ New infra, new ops, new failure modes │ | |
| └──────────────────┴───────────────────────────────────────┴─────────────────────────────────────────────────────────────────┘ | |
| Verdict: Not justified given the protocol-constrained invariants. Original NATS case rested on bottlenecks that are now addressed. | |
| Option D: HTTP/3 / QUIC | |
| ┌─────────────────┬────────────────────┬───────────────────────────────────────────┐ | |
| │ Dimension │ Δ vs current │ Notes │ | |
| ├─────────────────┼────────────────────┼───────────────────────────────────────────┤ | |
| │ Latency │ 0-RTT on reconnect │ Only helps if connections drop frequently │ | |
| ├─────────────────┼────────────────────┼───────────────────────────────────────────┤ | |
| │ Performance │ Slightly better │ No head-of-line blocking │ | |
| ├─────────────────┼────────────────────┼───────────────────────────────────────────┤ | |
| │ Reliability │ Similar │ │ | |
| ├─────────────────┼────────────────────┼───────────────────────────────────────────┤ | |
| │ Scalability │ Similar │ │ | |
| ├─────────────────┼────────────────────┼───────────────────────────────────────────┤ | |
| │ Security │ Same │ QUIC native TLS │ | |
| ├─────────────────┼────────────────────┼───────────────────────────────────────────┤ | |
| │ Fault tolerance │ Slightly better │ Connection migration │ | |
| ├─────────────────┼────────────────────┼───────────────────────────────────────────┤ | |
| │ Complexity │ Medium-High │ Rust QUIC stack still maturing │ | |
| └─────────────────┴────────────────────┴───────────────────────────────────────────┘ | |
| Verdict: Too early. Revisit in 12-18 months when ecosystem matures. | |
| Option E: libp2p gossipsub | |
| ┌──────────────────┬─────────────────────────┬───────────────────────────────────────────────────┐ | |
| │ Dimension │ Δ vs current │ Notes │ | |
| ├──────────────────┼─────────────────────────┼───────────────────────────────────────────────────┤ | |
| │ Latency │ Worse │ Gossip fanout adds ms; not designed for consensus │ | |
| ├──────────────────┼─────────────────────────┼───────────────────────────────────────────────────┤ | |
| │ Performance │ Worse │ P2P overhead │ | |
| ├──────────────────┼─────────────────────────┼───────────────────────────────────────────────────┤ | |
| │ Reliability │ Different failure modes │ Eventual delivery, not synchronous │ | |
| ├──────────────────┼─────────────────────────┼───────────────────────────────────────────────────┤ | |
| │ Scalability │ Better for broadcast │ But consensus isn't broadcast, it's solicited │ | |
| ├──────────────────┼─────────────────────────┼───────────────────────────────────────────────────┤ | |
| │ Security │ Worse │ Peer discovery attack surface │ | |
| ├──────────────────┼─────────────────────────┼───────────────────────────────────────────────────┤ | |
| │ Decentralization │ Best │ Operators could gossip directly │ | |
| ├──────────────────┼─────────────────────────┼───────────────────────────────────────────────────┤ | |
| │ Complexity │ Very high │ │ | |
| └──────────────────┴─────────────────────────┴───────────────────────────────────────────────────┘ | |
| Verdict: Wrong protocol fit. Orchestrated request/response ≠ gossip. | |
| Option F: Optimize non-transport hotspots | |
| ┌──────────────────────────────────────────────────────┬─────────────────────────┬────────────┐ | |
| │ Target │ Estimated gain │ Complexity │ | |
| ├──────────────────────────────────────────────────────┼─────────────────────────┼────────────┤ | |
| │ IPFS cache warming (pre-fetch known policies) │ High │ Low │ | |
| ├──────────────────────────────────────────────────────┼─────────────────────────┼────────────┤ | |
| │ Rego eval profiling (hot policies) │ Medium-High │ Medium │ | |
| ├──────────────────────────────────────────────────────┼─────────────────────────┼────────────┤ | |
| │ Policy-bytes L1/L2 cache TTL tuning │ Medium │ Low │ | |
| ├──────────────────────────────────────────────────────┼─────────────────────────┼────────────┤ | |
| │ BatchTaskManager tuning (batch size, flush interval) │ Medium │ Low │ | |
| ├──────────────────────────────────────────────────────┼─────────────────────────┼────────────┤ | |
| │ taskCreationBufferWindow tuning per chain │ Already done 2026-04-08 │ — │ | |
| └──────────────────────────────────────────────────────┴─────────────────────────┴────────────┘ | |
| Verdict: Most likely to move the needle. Transport is ~1-5% of end-to-end latency; Rego eval + IPFS + chain confirmation are probably | |
| 90%+. | |
| 5. First-principles recommendation | |
| Based on the analysis: | |
| Keep the HTTP/2 + tokio + FuturesUnordered stack. It's the right fit for orchestrated two-phase consensus. The transport is not the | |
| bottleneck. | |
| Do three things in order: | |
| 1. Measure first. Instrument sync.rs with spans around: Prepare broadcast start → first response → quorum reached → Commit broadcast | |
| start → first response → quorum reached → BLS aggregation complete → on-chain receipt. Get p50/p95/p99 over 24h on stagef. Without this, | |
| every further decision is speculation. | |
| 2. Harvest the cheap wins. Binary codec (postcard/bincode) behind a feature flag, measure again. This is ~1 week of work, zero | |
| architectural risk. | |
| 3. Then decide. If measurement shows Rego eval dominates, invest in regorus performance. If IPFS dominates, invest in cache warming. If | |
| BLS aggregation dominates, profile eigensdk. If the chain submission dominates, tune batcher flush interval. | |
| Do not migrate to NATS. The original case was built on bottlenecks that already got fixed (semaphore concurrency, streaming quorum). | |
| What remains is either (a) not a transport problem at all, or (b) achievable cheaper on the existing stack. | |
| Revisit decentralization separately. Moving consensus computation off-gateway is a different project (probably involves a threshold BLS | |
| scheme or operator-led aggregation). Don't bundle it with transport. | |
| ★ Insight ───────────────────────────────────── | |
| The most valuable output of this re-eval isn't the NATS verdict — it's the measurement plan. Designs without numbers optimize the wrong | |
| thing. The original doc's "3-10×" figure was a projection; the new 2026-04-17 callouts were corrections; the next step (p50/p95/p99 | |
| spans) is where ground truth lives. | |
| A classic staff-engineer move: when the premise of a proposal changes, don't just update the doc — update the question. The question | |
| isn't "should we migrate to NATS?" anymore. It's "where does our latency budget actually go?" | |
| ───────────────────────────────────────────────── | |
| Want me to write the measurement plan as a follow-up doc (what spans to add, where, what thresholds define "hot"), or is this evaluation | |
| sufficient for now? | |
| ✻ Baked for 1m 36s | |
| ❯ yes, so let's decide to not migrate to NATS. But in the future, would we ever consider p2p? | |
| ⏺ Good question. Let me think about this in terms of triggers rather than "someday maybe" — P2P is a tool for specific jobs, and the | |
| question is whether Newton will ever face those jobs. | |
| What P2P is actually good at | |
| P2P transports (libp2p, gossipsub, mesh) earn their complexity budget when you have: | |
| ┌──────────────────────────────┬────────────────────────────────────────────┬────────────────────────────────────────┐ | |
| │ Need │ Why HTTP/2 fails │ Why P2P wins │ | |
| ├──────────────────────────────┼────────────────────────────────────────────┼────────────────────────────────────────┤ | |
| │ Large N broadcast (N > ~500) │ N² connections from orchestrator │ Log(N) fanout via mesh │ | |
| ├──────────────────────────────┼────────────────────────────────────────────┼────────────────────────────────────────┤ | |
| │ Decentralized discovery │ Static operator registry │ DHT + peer exchange │ | |
| ├──────────────────────────────┼────────────────────────────────────────────┼────────────────────────────────────────┤ | |
| │ Censorship resistance │ SPoF at orchestrator │ Routes around coercion │ | |
| ├──────────────────────────────┼────────────────────────────────────────────┼────────────────────────────────────────┤ | |
| │ Network partition tolerance │ Split-brain requires manual reconciliation │ Eventual consistency is native │ | |
| ├──────────────────────────────┼────────────────────────────────────────────┼────────────────────────────────────────┤ | |
| │ Permissionless entry │ Gateway gatekeeps access │ Peers self-admit via capability proofs │ | |
| └──────────────────────────────┴────────────────────────────────────────────┴────────────────────────────────────────┘ | |
| P2P is bad at: | |
| ┌───────────────────────────────────────────────────────────┬─────────────────────────────────────────────────────────┐ | |
| │ Need │ Why P2P fails │ | |
| ├───────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────────┤ | |
| │ Synchronous consensus with tight deadlines │ Gossip convergence adds 50-500ms per hop │ | |
| ├───────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────────┤ | |
| │ Orchestrator-computed aggregation (like median consensus) │ There is no orchestrator — someone still has to compute │ | |
| ├───────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────────┤ | |
| │ Deterministic latency │ Gossipsub latency has long tails │ | |
| ├───────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────────┤ | |
| │ Small operator sets (<50) │ Direct connections are strictly simpler and faster │ | |
| └───────────────────────────────────────────────────────────┴─────────────────────────────────────────────────────────┘ | |
| The three triggers that would force the conversation | |
| Trigger 1: Gateway decentralization becomes load-bearing | |
| Today, Newton's gateway is a single operational entity per deployment. That's a centralization risk most users tolerate because the | |
| gateway can't forge BLS signatures — operators still sign independently, and the gateway just orchestrates and submits. But: | |
| - If regulators treat gateway-operator as "the protocol" and impose liability | |
| - If economic attacks exploit gateway availability (DoS the gateway → protocol down) | |
| - If Newton commits to "no privileged actors" as a product promise | |
| ...then Newton needs multiple gateways that coordinate. That's the first realistic P2P case. Not gateway-to-operator P2P — that doesn't | |
| help. Gateway-to-gateway P2P for things like: | |
| - Deduplicating tasks submitted to multiple gateways (avoid double BLS aggregation) | |
| - Sharing operator health state | |
| - Leader election for on-chain submission (avoid duplicate TXs and nonce races) | |
| - Consensus on which gateway submits a given task | |
| This could be done with Redis + Postgres + leader election (today's tools), or with libp2p gossipsub between gateway peers. The P2P | |
| choice wins if you want operators to be able to run their own gateway instance without trusting the Newton team's infrastructure. | |
| Trigger 2: Operator count grows 10-100× | |
| The current operator model scales fine to ~50-100 operators per chain. operator_pool maintains an HTTP/2 connection per operator with | |
| 8-concurrent streams. At 50 operators × 2 phases × 1 connection = 100 streams across 50 sockets. Fine. | |
| At 1000 operators, it breaks: | |
| - Per-gateway outbound socket pressure (1000+ live TLS connections) | |
| - Per-operator inbound connection fanout from multiple gateways | |
| - Broadcast amplification (each gateway sends the same task to 1000 operators) | |
| Here, hierarchical relays or regional gossip meshes become compelling. Operators in a region subscribe to a gossip topic; gateway | |
| publishes once, regional relays fan out. libp2p gossipsub is purpose-built for this. | |
| But — this is an EigenLayer AVS. Restaked operator economics limit how many operators you actually want. High-quorum sets with 500+ | |
| operators are rare. So this trigger probably fires only if Newton's TVL and operator base grow roughly 20-50× from today. | |
| Trigger 3: Censorship resistance becomes a primary invariant | |
| If Newton's authorization layer must work even when a state-level adversary targets the Newton team, gateway infrastructure, or specific | |
| cloud providers — P2P gossip for task distribution becomes necessary. Operators gossip tasks to each other; anyone can inject a valid | |
| task; the orchestration role becomes a rotating duty rather than a fixed role. | |
| This is a very different protocol. It's not a transport upgrade; it's a full redesign of who can create tasks and who aggregates | |
| responses. The P2P layer is the smallest piece — the bigger questions are: | |
| - Can anyone create a task? Then how do you prevent DoS? | |
| - Who aggregates when there's no gateway? Rotating aggregator? Threshold aggregator? | |
| - How do on-chain submissions happen without a gateway wallet? Operator-submitted with reimbursement? | |
| What P2P fundamentally can't change for Newton | |
| ┌──────────────────────────────────────────────────┬───────────────────────────────────────────────────────────┐ | |
| │ Invariant │ Why P2P doesn't help │ | |
| ├──────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤ | |
| │ Two-phase consensus is sequential │ Protocol requirement, not transport │ | |
| ├──────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤ | |
| │ BLS aggregation needs identical digest │ Cryptography, not networking │ | |
| ├──────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤ | |
| │ Median consensus needs a single compute location │ Math, not topology (unless you use threshold aggregation) │ | |
| ├──────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤ | |
| │ On-chain submission needs a signer │ Someone pays gas; gossip doesn't mint tokens │ | |
| ├──────────────────────────────────────────────────┼───────────────────────────────────────────────────────────┤ | |
| │ Tasks are latency-sensitive (sub-second target) │ Gossipsub is not deterministic-latency │ | |
| └──────────────────────────────────────────────────┴───────────────────────────────────────────────────────────┘ | |
| So even in a P2P world, the structure of Newton's consensus doesn't change — you just move the orchestrator role around. That's useful | |
| for decentralization, useless for latency. | |
| The hybrid that probably wins | |
| If any trigger fires, my bet is a layered design rather than full P2P: | |
| ┌────────────────────────────────────────┐ | |
| │ Gateway peer ←─ libp2p gossip ─→ Gateway peer │ ← gateways coordinate | |
| └────────────┬───────────────────────┬───┘ | |
| │ HTTP/2 │ HTTP/2 | |
| ▼ ▼ | |
| ┌──────────┐ ┌──────────┐ | |
| │ Operator │ │ Operator │ ← consensus path stays HTTP/2 | |
| └──────────┘ └──────────┘ | |
| - Gateway↔Gateway: libp2p for coordination (leader election, task dedup, operator set convergence) | |
| - Gateway↔Operator: HTTP/2 stays, because consensus is orchestrated and synchronous | |
| - Operator↔Operator: optional libp2p for liveness heartbeats, not consensus | |
| This way, P2P earns its complexity exactly where it's strongest (broadcast coordination among a dynamic peer set) and avoids the places | |
| it's weakest (synchronous tight-deadline consensus). | |
| When to revisit | |
| Concrete triggers, not calendar-based: | |
| ┌─────────────────────────────────────┬─────────────────────────────────────────────┬──────────────────────────────────────────────┐ | |
| │ Trigger │ Signal │ Action │ | |
| ├─────────────────────────────────────┼─────────────────────────────────────────────┼──────────────────────────────────────────────┤ | |
| │ Gateway decentralization becomes a │ Product/legal mandates multi-operator │ Design gateway-to-gateway P2P; HTTP/2 │ | |
| │ product requirement │ gateway tier │ down-path stays │ | |
| ├─────────────────────────────────────┼─────────────────────────────────────────────┼──────────────────────────────────────────────┤ | |
| │ Operator count crosses ~500 per │ Prometheus shows socket pressure on gateway │ Introduce regional relays or gossip mesh │ | |
| │ chain │ or fanout latency climbing │ │ | |
| ├─────────────────────────────────────┼─────────────────────────────────────────────┼──────────────────────────────────────────────┤ | |
| │ Censorship resistance becomes a │ Regulatory / adversary threat model change │ Full protocol redesign — P2P is one of many │ | |
| │ primary invariant │ │ changes, not the main one │ | |
| ├─────────────────────────────────────┼─────────────────────────────────────────────┼──────────────────────────────────────────────┤ | |
| │ Threshold aggregation lands │ Phase 2D+ work removes gateway from │ Operator↔Operator P2P for partial signature │ | |
| │ │ aggregation path │ exchange │ | |
| └─────────────────────────────────────┴─────────────────────────────────────────────┴──────────────────────────────────────────────┘ | |
| The last one is interesting — threshold BLS/FROST work (already planned in the DKG Phase 2 work per memory) naturally pulls the | |
| aggregation role away from a single gateway, and at that point operator↔operator partial-signature exchange could use libp2p. That's the | |
| most probable near-term P2P entry point in Newton's roadmap, not "migrate the gateway broadcast." | |
| ★ Insight ───────────────────────────────────── | |
| The "P2P vs centralized" question often gets framed as transport, but it's really organizational. The technical transport is almost | |
| never the reason to adopt libp2p — the reason is usually "we can't have a single entity control this." When that constraint bites, P2P | |
| is one of several tools (alongside threshold crypto, rotating leaders, and content-addressed storage). Treating P2P as a latency or | |
| scaling decision usually leads to regret. | |
| Newton's decentralization story is already heading toward cryptographic decentralization (threshold DKG, FROST, PSS) before network | |
| decentralization. That's the right order — you can't meaningfully decentralize the network layer while a single entity holds the | |
| aggregate private key. Once threshold signing lands, network-layer P2P becomes coherent; before that, it's premature. | |
| ───────────────────────────────────────────────── | |
| Short answer | |
| Yes, under specific triggers. Most likely entry point: operator-to-operator P2P for threshold partial-signature exchange after FROST/PSS | |
| lands (that's Phase 2D territory). Second most likely: gateway-to-gateway P2P when multi-gateway decentralization becomes a product | |
| requirement. Least likely and probably never: full gateway→operator P2P — it doesn't fit two-phase synchronous consensus. | |
| Want me to save this as a project memory or a future-work note in docs/ROADMAP.md? This feels like the kind of analysis that's worth | |
| having on file for when someone re-opens the "should we use libp2p?" question in 12-18 months. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment