-
-
Save denniswon/8a55e803f87d96eca546ac7aeea8297a to your computer and use it in GitHub Desktop.
| --- | |
| Gateway Scalability & Throughput Analysis | |
| End-to-End Flow Timeline | |
| Here's the create_task critical path with approximate timings: | |
| Client Request | |
| │ | |
| ├── Phase 0: Request Setup ..................... ~1-5ms | |
| │ Permission check, task_id, chain resolution, parse intent | |
| │ | |
| ├── Phase 1: Block & Config Fetch ............. ~10-100ms | |
| │ policy_config RPC + block_cache (parallel tokio::try_join!) | |
| │ | |
| ├── Phase 2: Task Creation Queue .............. ~0-1ms | |
| │ mpsc::send to batch submitter (non-blocking) | |
| │ | |
| ├── Phase 3: BLS Aggregation Init ............. ~1-5ms | |
| │ DashMap insert, oneshot channel setup | |
| │ | |
| ├── Phase 4: PREPARE Phase .................... ~100-30000ms ⭐ PRIMARY LATENCY | |
| │ EIP712 hash + ECDSA sign (CPU) | |
| │ broadcast_fetch_policy_data_unified → FuturesUnordered | |
| │ Per-response: 2× RwLock reads for stake data | |
| │ Exit on stake-weighted quorum | |
| │ | |
| ├── Phase 5: CONSENSUS ........................ ~1-10ms | |
| │ Median normalization + tolerance check (CPU) | |
| │ | |
| ├── Phase 6: COMMIT Phase ..................... ~100-30000ms ⭐ PRIMARY LATENCY | |
| │ broadcast_evaluate_and_sign → FuturesUnordered | |
| │ Waits for ALL operators (NO early exit on quorum!) | |
| │ | |
| ├── Phase 7: BLS Aggregation .................. ~5-50ms | |
| │ Sequential verify + aggregate per response (CPU) | |
| │ | |
| ├── Phase 8: Response Construction ............ ~1-10ms | |
| │ signature_data, certificate, validate_calldata | |
| │ | |
| └── Return to Client | |
| │ | |
| └── Phase 9: Background On-Chain .......... (async, decoupled) | |
| Batch submitter accumulates → pipeline_depth=3 → receipt tracker | |
| ★ Insight ───────────────────────────────────── | |
| The client-facing latency is dominated by the two sequential network broadcast rounds (Prepare + Commit). Everything else is sub-10ms. | |
| The key insight is that the Commit phase waits for ALL operators, not quorum — a single slow operator delays the entire response even | |
| when quorum stake is already met. This is the single biggest latency win available. | |
| ───────────────────────────────────────────────── | |
| --- | |
| Bottleneck Inventory | |
| CRITICAL — Directly impacts per-request latency | |
| ┌─────┬────────────────────────────────────────────────────────┬───────────────────────┬─────────────────────────────────────────────┐ | |
| │ ID │ Bottleneck │ Location │ Impact │ | |
| ├─────┼────────────────────────────────────────────────────────┼───────────────────────┼─────────────────────────────────────────────┤ | |
| │ B1 │ Commit phase waits for ALL operators, no quorum early │ operator.rs:1770-1802 │ p100 latency instead of p50-p75 │ | |
| │ │ exit │ │ │ | |
| ├─────┼────────────────────────────────────────────────────────┼───────────────────────┼─────────────────────────────────────────────┤ | |
| │ B2 │ ECDSA signer re-loaded from config on every task │ common.rs:312-313 │ File I/O + key derivation per task │ | |
| ├─────┼────────────────────────────────────────────────────────┼───────────────────────┼─────────────────────────────────────────────┤ | |
| │ B3 │ Policy config RPC call per task (no caching) │ common.rs:121-124 │ 10-100ms RPC per task, even for same │ | |
| │ │ │ │ policy_client │ | |
| ├─────┼────────────────────────────────────────────────────────┼───────────────────────┼─────────────────────────────────────────────┤ | |
| │ B4 │ 2N RwLock acquisitions per broadcast (stake reads per │ operator.rs:1072,1252 │ Lock contention scales with operator count │ | |
| │ │ response) │ │ │ | |
| └─────┴────────────────────────────────────────────────────────┴───────────────────────┴─────────────────────────────────────────────┘ | |
| HIGH — Impacts throughput under load | |
| ┌─────┬──────────────────────────────────────────────┬────────────────────────┬──────────────────────────────────────────────────────┐ | |
| │ ID │ Bottleneck │ Location │ Impact │ | |
| ├─────┼──────────────────────────────────────────────┼────────────────────────┼──────────────────────────────────────────────────────┤ | |
| │ B5 │ gateway_tasks RwLock write in hot path │ sync.rs:98-103 │ Serializes all concurrent task creations │ | |
| ├─────┼──────────────────────────────────────────────┼────────────────────────┼──────────────────────────────────────────────────────┤ | |
| │ B6 │ Batch submitter pipeline semaphore blocks │ batch_submitter.rs:551 │ Back-pressure cascades to RPC handler when 3 TXs │ | |
| │ │ accumulator │ │ in-flight │ | |
| ├─────┼──────────────────────────────────────────────┼────────────────────────┼──────────────────────────────────────────────────────┤ | |
| │ B7 │ RPC response cloned for every operator │ operator.rs:1245 │ ConsensusPrepareResponse can be large; N clones per │ | |
| │ │ response │ │ broadcast │ | |
| ├─────┼──────────────────────────────────────────────┼────────────────────────┼──────────────────────────────────────────────────────┤ | |
| │ B8 │ Cold block cache on first request after │ block_cache.rs:77 │ 100-800ms + 3 retries on cache miss │ | |
| │ │ startup │ │ │ | |
| └─────┴──────────────────────────────────────────────┴────────────────────────┴──────────────────────────────────────────────────────┘ | |
| MEDIUM — Reliability and resource usage | |
| ┌─────┬────────────────────────────────────────────┬──────────────────────────┬──────────────────────────────────────────────────────┐ | |
| │ ID │ Bottleneck │ Location │ Impact │ | |
| ├─────┼────────────────────────────────────────────┼──────────────────────────┼──────────────────────────────────────────────────────┤ | |
| │ B9 │ No startup validation of operator │ handler/mod.rs:1050-1130 │ Misconfigured sockets fail at first request, not │ | |
| │ │ reachability │ │ startup │ | |
| ├─────┼────────────────────────────────────────────┼──────────────────────────┼──────────────────────────────────────────────────────┤ | |
| │ B10 │ No startup validation of chain RPC health │ handler/mod.rs:340-718 │ Unhealthy RPC discovered at request time │ | |
| ├─────┼────────────────────────────────────────────┼──────────────────────────┼──────────────────────────────────────────────────────┤ | |
| │ B11 │ No circuit breaker on chain RPC endpoints │ Throughout │ Cascading failures when RPC degrades │ | |
| ├─────┼────────────────────────────────────────────┼──────────────────────────┼──────────────────────────────────────────────────────┤ | |
| │ B12 │ No backpressure signaling to clients │ sync.rs, channel.rs │ Silent queueing → eventual timeout │ | |
| └─────┴────────────────────────────────────────────┴──────────────────────────┴──────────────────────────────────────────────────────┘ | |
| --- | |
| Proposed Improvements | |
| Tier 1: High-Impact, Low-Risk (implement first) | |
| 1. Quorum-based early exit for Commit phase | |
| The Commit phase (broadcast_evaluate_and_sign at operator.rs:1676) collects ALL responses from eligible operators with no quorum check. | |
| Compare with Prepare phase which exits as soon as stake threshold is met. | |
| Change: Refactor Commit broadcast to use broadcast_with_unified_quorum pattern — exit as soon as stake-weighted quorum is met. The BLS | |
| aggregation only needs quorum anyway; extra signatures are discarded. | |
| Expected improvement: With 5 operators at equal stake and 67% quorum, latency drops from p100 (slowest operator) to ~p60 (3rd fastest). | |
| 2. Pass pre-loaded ECDSA signer instead of re-loading per task | |
| build_fetch_policy_data_request calls load_ecdsa(signer) on every invocation (common.rs:312). The signer is already loaded at | |
| GatewayServiceImpl construction (handler/mod.rs:407-408) as task_generator_signer. | |
| Change: Modify build_fetch_policy_data_request signature to accept &PrivateKeySigner directly. Pass self.task_generator_signer from | |
| sync.rs and channel.rs. | |
| 3. Moka cache for PolicyConfig per (chain_id, policy_client) | |
| get_policy_config_for_client makes an on-chain RPC call for every task (common.rs:121-124). Policy configs rarely change — typically only | |
| on setPolicy calls. | |
| Change: Add moka::future::Cache<(u64, Address), PolicyConfig> to ChainService with 10-30s TTL. Check cache before RPC. Moka is already a | |
| workspace dependency. | |
| Expected improvement: Eliminates ~10-100ms RPC for repeat policy clients (the common case in production). | |
| 4. Snapshot stake data once per broadcast, not per response | |
| Every operator response in the quorum loop triggers operator_stake.read().await and total_stake.read().await (operator.rs:1072,1252). | |
| With N operators, that's 2N async RwLock acquisitions. | |
| Change: At the start of each broadcast, snapshot operator_stake and total_stake into local HashMaps. Use the snapshots for all quorum | |
| calculations within that broadcast. Stake data changes only on OperatorAdded/Removed events — negligible during a single 30s broadcast | |
| window. | |
| Expected improvement: Reduces async lock contention from O(N) to O(1) per broadcast. Significant at high operator counts. | |
| 5. Replace gateway_tasks with DashMap | |
| gateway_tasks: Arc<tokio::sync::RwLock<HashMap<String, HashSet<TaskId>>>> requires a write lock for every task creation (sync.rs:98-103). | |
| Under concurrent load, this serializes all task initiations. | |
| Change: DashMap<String, DashSet<TaskId>> — lock-free per-API-key granularity. DashMap is already used for aggregator.task_states. | |
| Tier 2: Medium-Impact, Medium-Risk | |
| 6. Pre-warm block caches on startup | |
| First task after gateway start hits a cold block cache → cache miss → RPC call with up to 3 retries at 100-800ms backoff | |
| (block_cache.rs:21-26). | |
| Change: In Gateway::from_multi_chain_config(), call block_cache.refresh() for all chains before returning. Fail startup if RPC | |
| unreachable. | |
| 7. Avoid response clones in normal (quorum success) path | |
| all_responses.push((operator_id, response_result.clone())) at operator.rs:1245 clones the entire RPC response for every operator. | |
| ConsensusPrepareResponse includes policyTaskData which can be large. | |
| Change: Only store (OperatorId, is_success) in the tracking vec. Keep the full response_result only in results (move, not clone). On | |
| quorum failure, reconstruct errors from the tracked failures. | |
| 8. Startup validation: operator reachability + contract existence | |
| Changes: | |
| - After operator pool init, send a lightweight JSON-RPC ping (e.g., newt_getPublicKey or custom newt_health) to each operator. Log | |
| warnings for unreachable operators. Pre-open circuit breakers. | |
| - During ChainService init, call provider.get_code(contract_addr) for TaskManager, IdentityRegistry, ConfidentialDataRegistry. Fail | |
| startup if code is empty (zero-address check is not enough — the address could be non-zero but point to an EOA or wrong contract). | |
| 9. Chain-level RPC circuit breaker | |
| Currently circuit breakers exist per-operator but not per-chain-RPC. If the Ethereum Sepolia RPC goes down, every task attempting block | |
| cache refresh or policy config fetch hangs for the full timeout. | |
| Change: Add a failsafe circuit breaker wrapping the ChainRpcProviderConfig HTTP provider. Open after N consecutive failures, fail-fast | |
| subsequent requests, half-open after cooldown. | |
| Tier 3: Architectural Improvements (higher effort, plan for later) | |
| 10. Streaming consensus: incremental median during Prepare | |
| Instead of waiting for all Prepare responses, maintain a running median as responses arrive. When quorum is met, consensus is already | |
| computed — the 1-10ms compute_consensus_from_unsigned call becomes near-zero. | |
| 11. Request deduplication for identical tasks | |
| Hash (policy_client, intent, wasm_args) → check if an in-flight task with the same hash exists → return the pending result. Under burst | |
| traffic with duplicates, eliminates redundant operator broadcasts. | |
| 12. Concurrent operator requests via HTTP/2 multiplexing | |
| Each OperatorWorker processes requests sequentially (one at a time from its mpsc channel). HTTP/2 already supports multiplexing at the | |
| transport layer. Allow N concurrent in-flight requests per operator via semaphore-bounded processing, not strict sequential. | |
| 13. Backpressure signaling | |
| When batch submitter queue depth exceeds a threshold, surface this to the RPC handler and return HTTP 503 (Service Unavailable) with a | |
| Retry-After header. Currently, tasks silently queue and may hit TaskCreatedBlockTooOld on-chain if they sit too long. | |
| --- | |
| Fail-Fast & Reliability Checklist | |
| Startup | |
| ┌──────────────────────────┬────────────────────────────────┬──────────────────────────────────────────────────┐ | |
| │ Check │ Current State │ Proposed │ | |
| ├──────────────────────────┼────────────────────────────────┼──────────────────────────────────────────────────┤ | |
| │ Operator reachable │ Not checked │ Ping each operator, warn + pre-open CB │ | |
| ├──────────────────────────┼────────────────────────────────┼──────────────────────────────────────────────────┤ | |
| │ Chain RPC healthy │ Implicit (first request fails) │ Eager block cache refresh, fail startup on error │ | |
| ├──────────────────────────┼────────────────────────────────┼──────────────────────────────────────────────────┤ | |
| │ Contract deployed │ Zero-address check only │ get_code() check for critical contracts │ | |
| ├──────────────────────────┼────────────────────────────────┼──────────────────────────────────────────────────┤ | |
| │ Database connectivity │ Checked │ Already good │ | |
| ├──────────────────────────┼────────────────────────────────┼──────────────────────────────────────────────────┤ | |
| │ Redis connectivity │ Silently degrades │ Log prominently, mark feature as disabled │ | |
| ├──────────────────────────┼────────────────────────────────┼──────────────────────────────────────────────────┤ | |
| │ Batch submitter contract │ Zero-address fallback │ Already good │ | |
| └──────────────────────────┴────────────────────────────────┴──────────────────────────────────────────────────┘ | |
| Runtime | |
| ┌──────────────────────────┬─────────────────────────────┬─────────────────────────────────────────────┐ | |
| │ Check │ Current State │ Proposed │ | |
| ├──────────────────────────┼─────────────────────────────┼─────────────────────────────────────────────┤ | |
| │ Chain RPC degradation │ No circuit breaker │ Add chain-level circuit breaker │ | |
| ├──────────────────────────┼─────────────────────────────┼─────────────────────────────────────────────┤ | |
| │ Batch queue backpressure │ Silent queueing │ 503 when queue > threshold │ | |
| ├──────────────────────────┼─────────────────────────────┼─────────────────────────────────────────────┤ | |
| │ Operator pool empty │ Returns error │ Already good, but add metric for visibility │ | |
| ├──────────────────────────┼─────────────────────────────┼─────────────────────────────────────────────┤ | |
| │ Block cache staleness │ TTL-based refresh │ Add staleness metric, alert if > 2×TTL │ | |
| ├──────────────────────────┼─────────────────────────────┼─────────────────────────────────────────────┤ | |
| │ Aggregator zombie slots │ Periodic cleanup (30s-6min) │ Already good (cancel_task in all paths) │ | |
| └──────────────────────────┴─────────────────────────────┴─────────────────────────────────────────────┘ | |
| --- | |
| Execution Order | |
| I'd suggest: | |
| 1. B1 (Commit quorum early exit) + B2 (signer caching) + B4 (stake snapshot) — these are all in the hot path and can be done in parallel | |
| across files | |
| 2. B3 (policy config cache) — needs Moka cache struct added to ChainService | |
| 3. B5 (gateway_tasks DashMap) — mechanical type replacement | |
| 4. B7 (response clone avoidance) + B8 (block cache pre-warm) — cleanup | |
| 5. B9+B10 (startup validation) — reliability, not latency | |
| 6. B11 (chain RPC circuit breaker) — requires design discussion on failure modes | |
| The most impactful single change is B1 — making the Commit phase quorum-aware. Combined with B4 (stake snapshot), this removes the two | |
| biggest latency contributors in the normal path. | |
| --- | |
| What's your priority? Want me to start with the Tier 1 changes (B1-B5), or do you want to discuss the approach for any of these first? |
Now I have the full picture. Let me trace the exact timeline and identify every blocking point.
Gateway Two-Phase Consensus: Streaming Pipeline Analysis
Current Sequential Timeline
┌─── Client Request ───┐
│ │
t=0 │ Block fetch (cache) │ ~0ms (lock-free ArcSwap read)
│ Build Prepare RPC │ ~1ms (EIP712 hash + ECDSA sign)
│ │
t=1 ├─── PREPARE PHASE ────┤
│ │
│ → broadcast to N ops │ ~0ms (non-blocking try_send)
│ ← wait for quorum │ 100-30000ms ⬛ NETWORK + OPERATOR WORK
│ │
t=2 ├─── CONSENSUS ────────┤
│ │
│ Median computation │ 1-10ms (CPU, fast)
│ Collect partials │ <1ms (iterate prepare responses)
│ │
t=3 ├─── COMMIT PHASE ─────┤
│ │
│ Build Commit RPC │ <1ms (serialization)
│ → broadcast to ops │ ~0ms (non-blocking try_send)
│ ← wait for quorum │ 100-30000ms ⬛ NETWORK + OPERATOR WORK
│ │
t=4 ├─── BLS AGGREGATE ────┤
│ │
│ Compute digests │ <1ms (hash per response)
│ Fetch operator state │ 10-100ms ⬛ RPC CALL (BLOCKING!)
│ Fetch quorum state │ 10-100ms ⬛ RPC CALL (BLOCKING!)
│ Aggregate signatures │ <1ms (BLS point addition)
│ │
t=5 ├─── RETURN ───────────┤
│ Build response │ 1-10ms
└──────────────────────┘
Client-facing latency = t1 + t2 + t3 + t4 + t5
Where:
- t1 (Prepare) = dominant — operator WASM execution + network
- t2 (Consensus) = negligible (1-10ms)
- t3 (Commit) = dominant — operator policy eval + BLS sign + network
- t4 (BLS Aggregate) = 10-100ms wasted on two sequential RPC calls that could be pre-fetched
- t5 (Response) = negligible
★ Insight ─────────────────────────────────────
Three concrete streaming opportunities, in order of impact:
-
Pre-fetch operator/quorum registry state during Commit broadcast (~10-100ms saved)
The two RPC calls in aggregate_responses (get_operators_avs_state_at_block and get_quorums_avs_state_at_block) depend only on
task_created_block and quorum_numbers — both known BEFORE Commit starts. These can run in parallel with the Commit broadcast, so by the
time Commit responses arrive, the state is already fetched. -
Streaming BLS aggregation during Commit (merges t3+t4 into one phase)
Instead of collecting all Commit responses into a Vec then calling aggregate_responses, process each response as it arrives — compute
digest and accumulate BLS signature immediately. When quorum is met, the aggregation result is ready instantly. This eliminates the
sequential aggregation step entirely. -
Policy-aware single-phase fast path (halves latency for deterministic policies)
The two-phase protocol exists because operators independently fetch time-varying data (oracle prices) that may differ. But for policies
that use only deterministic data sources (identity, confidential, allowlist — no WASM data fetching), all operators will produce
identical results. For these policies, skip Prepare entirely and go straight to Commit — operators evaluate + BLS sign in one round-trip.
─────────────────────────────────────────────────
Proposed Streaming Architecture
Level 1: Pre-fetch operator state (easy, ~10-100ms)
CURRENT: PROPOSED:
Commit broadcast ────────── wait Commit broadcast ────────── wait
│ │
↓ Pre-fetch state ────────────┤
Fetch operator state ────── wait │
Fetch quorum state ──────── wait ↓
│ Aggregate (state ready) ─── done
Aggregate ────────────────── done
In broadcast_two_phase_and_aggregate, spawn the state fetch before starting the Commit broadcast:
// Start pre-fetching operator/quorum state — runs in parallel with Commit broadcast.
// These depend on aggregator_block + quorum_numbers, both known before Commit starts.
let prefetch_handle = {
let aggregator = cs.aggregator.clone();
let quorum_nums = quorum_numbers.to_vec();
tokio::spawn(async move {
aggregator.prefetch_operator_state(aggregator_block, &quorum_nums).await
})
};
// Commit broadcast (runs in parallel with prefetch)
let commit_responses = cs.operator_pool.broadcast_evaluate_and_sign(...).await?;
// State already fetched (or nearly done) — no RPC wait
let prefetched_state = prefetch_handle.await??;
let response = cs.aggregator.aggregate_with_prefetched_state(
task_id, commit_responses, prefetched_state
).await?;
Level 2: Streaming BLS during Commit (medium, eliminates aggregate step)
Instead of the current pattern:
broadcast_evaluate_and_sign → Vec
aggregate_responses(Vec) → BlsAggregationServiceResponse
Merge them into a single streaming pipeline:
broadcast_and_aggregate → BlsAggregationServiceResponse
Each Commit response is immediately fed into the BLS aggregation:
loop {
response = next_commit_response().await;
// Immediately compute digest + accumulate BLS signature
aggregation_state.process_response(response, &operator_state)?;
// Check if quorum met with current aggregate
if aggregation_state.is_quorum_met() {
break; // Aggregation result ready NOW
}
}
return aggregation_state.finalize();
This means the BLS aggregation result is available the exact moment quorum is met — zero additional processing time.
Level 3: Single-phase fast path for deterministic policies (halves latency)
For policies with no time-varying data sources, the Prepare phase is redundant — operators will produce identical policyTaskData. The
gateway could detect this and skip straight to a combined Prepare+Commit call:
CURRENT (two-phase, all policies):
Prepare (5-30s) → Consensus → Commit (5-30s) = 10-60s
PROPOSED (fast path, deterministic policies):
CombinedEvaluateAndSign (5-30s) = 5-30s (50% reduction!)
Detection: add a deterministic: bool flag to the policy metadata. When true, operators fetch + evaluate + sign in a single RPC call. The
gateway skips Prepare entirely.
This is a protocol-level change that requires a new operator RPC method (newt_evaluateTask — combined fetch + evaluate + sign). It should
be a follow-up after Levels 1 and 2.
Recommendation
┌───────────────────────────┬───────────────────────────┬────────────────────────┬──────────────┐
│ Level │ Impact │ Complexity │ Priority │
├───────────────────────────┼───────────────────────────┼────────────────────────┼──────────────┤
│ 1: Pre-fetch state │ 10-100ms saved │ Low │ Do now │
├───────────────────────────┼───────────────────────────┼────────────────────────┼──────────────┤
│ 2: Streaming BLS │ Eliminates aggregate step │ Medium │ Do now │
├───────────────────────────┼───────────────────────────┼────────────────────────┼──────────────┤
│ 3: Single-phase fast path │ 50% latency reduction │ High (protocol change) │ Follow-up PR │
└───────────────────────────┴───────────────────────────┴────────────────────────┴──────────────┘
Levels 1+2 can be done on the current branch. Level 3 requires a new operator RPC method and policy metadata — it's the biggest win but
needs design discussion.
Want me to implement Levels 1 and 2?
✻ Baked for 2m 24s
❯ Pre-fetch operator/quorum registry state during Commit broadcast (~10-100ms saved) --> for this, why not just fetch once in Operator
registry every epoch, and use that? what are risks? I thought we already do this
Searched for 2 patterns (ctrl+o to expand)
⏺ Let me check the EigenSDK AvsRegistryServiceChainCaller cache:
Read 1 file (ctrl+o to expand)
⏺ ★ Insight ─────────────────────────────────────
You're right — we already have a cache. The AvsRegistryServiceChainCaller has an operator_state_cache with:
- Per-quorum caching (operator_state_cache.get(quorum_num))
- Block-based validity: entry.block_num >= block_num (serves cached data for same or newer blocks)
- TTL-based expiry: CACHE_TTL_SECS
So get_operators_avs_state_at_block is already fast on cache hit. The 10-100ms I estimated was the cold/miss case. In production, the
cache is warm from the previous task, so these calls are likely <1ms.
─────────────────────────────────────────────────
Let me verify the TTL and check if it's the same for quorum state:
Searched for 1 pattern (ctrl+o to expand)
⏺ 300 seconds (5 minutes) TTL. Both operator state and quorum state use the same cache with the same TTL.
So here's the reality:
Level 1 (pre-fetch) is unnecessary. The operator/quorum registry state is already cached with a 5-minute TTL. After the first task warms
the cache, subsequent tasks hit the cache in <1ms. The 10-100ms RPC call only happens once every 5 minutes or on the very first request
(which is already handled by the startup pre-warm in Tier 1).
Risks of relying on the cache:
- Stale operator set: If an operator registers or deregisters mid-epoch, the cache serves stale data for up to 5 minutes. BLS aggregation
could use wrong APK → InvalidQuorumApkHash. This is already documented in lessons.md as a known issue ("Operator registry cache is
static... restart the gateway after deregistering an operator"). - Block number mismatch: The cache uses entry.block_num >= block_num, meaning it can serve data from a NEWER block when asked for a
historical block. This is a known gotcha from lessons.md.
These risks already exist and are accepted. The pre-fetch optimization would have the same risks since it would hit the same cache.
Bottom line: Level 1 is already solved by the existing cache. Skip it entirely.
That leaves:
- Level 2 (Streaming BLS during Commit) — merges Commit + Aggregate, saves the sequential aggregation step
- Level 3 (Single-phase fast path for deterministic policies) — 50% latency reduction, protocol change
For Level 2, given that aggregate_responses is mostly cache hits + fast CPU work (<5ms total with warm cache), the win is also marginal.
The real dominance is the two network-bound phases.
Level 3 is by far the biggest win. Halving latency for deterministic policies (identity, confidential, allowlist — the majority of
production use cases) from 2 round-trips to 1 is a fundamentally larger improvement than shaving milliseconds off the aggregation step.
Want me to design Level 3 (single-phase fast path) as the next optimization? Or do you still want Level 2 first?
✻ Worked for 1m 29s
no Level 3 (single-phase fast path) is too unreliable because we have to trust the developer that they correctly set that param on the
poilcy or policy data contract. also, policy can always change what policy data(s) to use
⏺ You're right — trusting a developer-set flag is fragile. Policy clients can change their policyData references at any time, and a
previously-deterministic policy could start using time-varying data sources without updating the flag. The gateway would skip consensus
when it shouldn't, producing mismatched BLS signatures.
So the honest assessment:
The two-phase sequential pipeline is fundamentally correct for the problem it solves — you can't skip Prepare when you don't know if
operators will agree. And the remaining optimizations (Level 1 pre-fetch, Level 2 streaming BLS) save at most 5-10ms on a 5-30 second
pipeline — not meaningful.
The real latency bottleneck is operator processing time during both phases (WASM execution, on-chain data fetching, Rego evaluation, BLS
signing), not gateway architecture. The gateway is already doing the right thing: non-blocking broadcast, quorum-based early exit, stake
snapshots, concurrent per-operator requests.
At this point, the gateway optimizations across all three tiers are solid:
┌─────────────────────────────────┬────────────────────────────────────────────────────┐
│ What we optimized │ Where the time goes now │
├─────────────────────────────────┼────────────────────────────────────────────────────┤
│ Gateway overhead between phases │ <10ms (negligible) │
├─────────────────────────────────┼────────────────────────────────────────────────────┤
│ Lock contention │ Eliminated (DashMap, stake snapshots) │
├─────────────────────────────────┼────────────────────────────────────────────────────┤
│ Redundant RPC calls │ Eliminated (PolicyConfig to operator, block cache) │
├─────────────────────────────────┼────────────────────────────────────────────────────┤
│ Tail latency │ Reduced (quorum early exit in both phases) │
├─────────────────────────────────┼────────────────────────────────────────────────────┤
│ Head-of-line blocking │ Eliminated (concurrent operator requests) │
├─────────────────────────────────┼────────────────────────────────────────────────────┤
│ Remaining: operator processing │ 5-30 seconds per phase — this is the next frontier │
└─────────────────────────────────┴────────────────────────────────────────────────────┘
The next meaningful improvement would be operator-side optimization — faster WASM execution, faster Rego evaluation, operator-side
caching of policy data/rules, or reducing the work operators do per task. That's a different domain from gateway architecture.
Want to shift focus to operator-side performance, or is there anything else on the gateway side you want to explore?
❯ what about concurrency in terms of multiple tasks being processed? does gateway already support concurrent tasks in parallel?
⏺ Yes — let me trace exactly how concurrent tasks flow through the gateway to verify there are no hidden serialization points.
Searched for 1 pattern (ctrl+o to expand)
⏺ Let me trace the concurrency model from HTTP request arrival through to completion:
Read 1 file (ctrl+o to expand)
⏺ ★ Insight ─────────────────────────────────────
Yes, the gateway is fully concurrent for multiple tasks. Here's why:
Axum's concurrency model: Each HTTP request to /rpc is handled by its own tokio task. Axum spawns a task per connection, and within each
connection, each request is processed concurrently. There's no global mutex or serialization point at the HTTP layer.
The full concurrency trace for 3 simultaneous tasks:
Task A (t=0): rpc_handler → create_task → broadcast_two_phase_and_aggregate
↓
Prepare broadcast (async, waits on operators)
↓
Task B (t=0): rpc_handler → create_task → broadcast_two_phase_and_aggregate
↓
Prepare broadcast (async, waits on operators)
↓
Task C (t=0): rpc_handler → create_task → broadcast_two_phase_and_aggregate
↓
Prepare broadcast (async, waits on operators)
All three tasks proceed through their two-phase pipeline independently and concurrently. While Task A is waiting for Prepare responses,
Tasks B and C are also broadcasting and waiting — they don't block each other.
Every shared resource is concurrent-safe:
- DashMap for gateway_tasks — lock-free per-key
- DashMap for aggregator.task_states — lock-free per-task
- ArcSwap for block_cache — lock-free reads
- Arc with per-operator workers — concurrent dispatch (Tier 3)
- Arc with mpsc channels — non-blocking send
- Operator stake snapshots — local per-broadcast, no contention
─────────────────────────────────────────────────
There are zero serialization points between concurrent tasks. The only shared contention point was gateway_tasks with RwLock, which we
already replaced with DashMap in Tier 1.
The sendTask async path (channel.rs) is even more concurrent — it spawns the entire two-phase pipeline into a background tokio::spawn and
returns immediately with a task ID, so the HTTP handler is freed instantly.
One area worth examining: the batch submitter's mpsc channel. When many tasks complete BLS aggregation simultaneously, they all send to
the same combined_tx channel. But mpsc::send is non-blocking until the channel is full (capacity = 240), and we added backpressure 503 in
Tier 2 to handle that. So this is already covered.
The gateway's concurrent task processing is solid. No hidden bottlenecks.
From our original analysis, the Tier 3 architectural improvements are:
┌─────┬───────────────────────────────────────────────────────────────┬───────────────────────────────────┬────────────┐
│ # │ Change │ Impact │ Complexity │
├─────┼───────────────────────────────────────────────────────────────┼───────────────────────────────────┼────────────┤
│ T10 │ Streaming consensus: incremental median during Prepare │ Shaves 1-10ms from consensus step │ Medium │
├─────┼───────────────────────────────────────────────────────────────┼───────────────────────────────────┼────────────┤
│ T11 │ Request deduplication for identical tasks │ Burst traffic reduction │ Medium │
├─────┼───────────────────────────────────────────────────────────────┼───────────────────────────────────┼────────────┤
│ T12 │ Concurrent operator requests via HTTP/2 multiplexing │ Higher per-operator throughput │ Medium │
├─────┼───────────────────────────────────────────────────────────────┼───────────────────────────────────┼────────────┤
│ T13 │ Parallelize BLS signature verification in aggregate_responses │ CPU-bound aggregation speedup │ Low-medium │
└─────┴───────────────────────────────────────────────────────────────┴───────────────────────────────────┴────────────┘
★ Insight ─────────────────────────────────────
Tier 3 reality check: These are diminishing returns compared to Tier 1+2. The consensus computation (T10) takes 1-10ms — negligible vs.
the 100-30000ms broadcast phases. Request dedup (T11) helps under burst but adds complexity. HTTP/2 concurrent requests (T12) requires
rethinking the per-operator sequential worker. BLS parallelization (T13) is the simplest win — spawn_blocking for CPU-bound verification.
The highest ROI for Tier 3 is T13 (BLS parallelization) followed by T11 (request dedup), then T12 (concurrent operator requests). T10 is
barely worth the complexity.
─────────────────────────────────────────────────
Do you want to proceed with all four, or focus on the highest-impact ones (T13 + T11)?
we don;'t need T11 (request dedup) because task request id is created always by gateway. LEt's do Parallelize BLS signature verification
in aggregate_responses and then look into Concurrent operator requests via HTTP/2 multiplexing
⏺ You're right — since we always auto-generate unique request IDs (H2 from Tier 1), dedup is unnecessary. Let me start with T13 (BLS
parallelization), then T12 (concurrent operator requests).
Let me create a new branch for Tier 3 off the current tier 2:
⏺ Bash(git checkout -b perf/gateway-tier3-parallelism)
⎿ Switched to a new branch 'perf/gateway-tier3-parallelism'
⏺ Let me explore the BLS aggregation code to understand what's CPU-bound and where parallelization helps:
Searched for 2 patterns (ctrl+o to expand)
⏺ Let me read the core aggregate_responses implementation and process_signed_response to understand the sequential flow:
Read 2 files (ctrl+o to expand)
⏺ Now let me read the aggregate_responses function in the core that processes multiple responses:
Read 1 file (ctrl+o to expand)
⏺ ★ Insight ─────────────────────────────────────
Now I understand the full aggregate_responses flow. The CPU-bound work is:
state (aggregated_operators_map), so it can't easily be parallelized
The first loop (digest computation) is embarrassingly parallel — each compute_consensus_digest is independent. The second loop (BLS
aggregation) is inherently sequential because each call mutates the running aggregate.
However, looking more carefully: the real bottleneck is the two RPC calls at lines 1894-1907 (get_operators_avs_state_at_block and
get_quorums_avs_state_at_block). These are async I/O, not CPU-bound. The digest computation loop with 5-10 operators takes <1ms. The BLS
aggregation loop is also fast.
T13 has diminishing returns for the current operator count. The digest computation is O(N) where N is typically 5-10 operators —
sub-millisecond. Parallelizing it would save microseconds.
─────────────────────────────────────────────────
The real opportunity is actually in the first loop (lines 1785-1794) — we can compute all digests in parallel using rayon or
tokio::task::spawn_blocking. But with 5-10 operators, the overhead of spawning tasks exceeds the savings.
Let me look at T12 (concurrent operator requests) instead — that has much higher impact because it affects the 100-30000ms broadcast
phases:
Read 1 file (ctrl+o to expand)
⏺ ★ Insight ─────────────────────────────────────
The sequential bottleneck is real but not where I expected. Each OperatorWorker processes messages one at a time (while let Some(msg) =
self.rx.recv().await at line 608). This means if operator A is handling a Prepare request (which takes 5-30 seconds), a Commit request
for a DIFFERENT task must wait in the mpsc queue until Prepare finishes.
However, this only matters when multiple tasks are in-flight concurrently to the same operator. HTTP/2 multiplexing is already enabled at
the transport layer (http2_prior_knowledge() on the reqwest client). The issue is the OperatorWorker serializes at the application
layer.
But wait — looking at the broadcast pattern: each broadcast creates fresh oneshot channels per operator. The worker processes them
sequentially, but since Prepare and Commit for the SAME task are always sequential (Prepare finishes before Commit starts), the
single-worker model doesn't bottleneck a single task flow. It only bottlenecks when multiple concurrent tasks compete for the same
operator.
For the current architecture (one gateway serving 10+ tps), concurrent operator requests would help. The fix: replace the sequential
while let loop with a semaphore-bounded spawn per request.
─────────────────────────────────────────────────
This is a meaningful change that affects the core worker model. Let me think about the approach and whether it's safe:
Read 1 file (ctrl+o to expand)
⏺ Now I have the full picture. The change is: replace the sequential while let loop with a semaphore-bounded concurrent model. Each
incoming request gets its own spawned task, but a semaphore limits concurrency per operator (e.g., max 4 concurrent requests).
Key safety considerations:
quorum), not by the worker.
Add the max_concurrent_requests field to the worker struct and thread it through spawn:
Now to answer your earlier question:
★ Insight ─────────────────────────────────────
Why default 4, and how to set it smartly?
The default of 4 is based on:
operator
throughput gain
Smart sizing formula: max_concurrent_requests = ceil(target_tps / num_operators) * 2
~8-10
For production with 10 replicas, 4 is reasonable. If operators have headroom, bump to 8.
★ Insight ─────────────────────────────────────
The "right" default depends on what bottleneck you're solving for.
The two-phase protocol means each task generates exactly 2 sequential operator requests (Prepare, then Commit). The question is: how many
different tasks should overlap on the same operator from a single gateway?
Key variables:
Per-operator concurrent requests from one gateway:
At steady state, each gateway has T/N tasks per second per operator. Each task occupies the operator for P + C seconds total (though
Prepare and Commit don't overlap for the same task). So the average in-flight requests per operator per gateway is:
in_flight = (T / N) * (P + C)
Example: 10 tps per gateway, 5 operators, 4s average (P+C):
in_flight = (10 / 5) * 4 = 8
So max_concurrent_requests = 8 would be right for this scenario.
With 10 replicas, each gateway handles total_tps / 10:
The default 4 is conservative — it assumes ~5 tps per gateway with fast operators (~2s P+C). For production with higher throughput
targets, it should be 8-12.
─────────────────────────────────────────────────
My recommendation: change the default to 8. It handles up to 10 tps/gateway with 5 operators at 4s latency, which matches your production
topology (10 replicas, ~100 tps aggregate target). 4 is too conservative and would cause head-of-line blocking under your actual load.
Want me to bump the default to 8?