Skip to content

Instantly share code, notes, and snippets.

@LiranCohen
Last active March 24, 2026 02:02
Show Gist options
  • Select an option

  • Save LiranCohen/bf4f119b8af05e4a66105c9a6d7b7af4 to your computer and use it in GitHub Desktop.

Select an option

Save LiranCohen/bf4f119b8af05e4a66105c9a6d7b7af4 to your computer and use it in GitHub Desktop.
RFC drafts: causal scoped replication and DWN delivery/sync design

Proposal: DWN-to-DWN Sync and Multi-Party Record Delivery

Status

Draft — capturing research and design direction for future implementation.

Problem Statement

DWN synchronization today is entirely agent-initiated. The user's agent (running on their device) holds a local DWN and orchestrates bidirectional sync with each remote DWN listed in their DID document's #dwn service endpoint:

Agent (local DWN)  <--pull/push-->  Remote DWN A (provider 1)
Agent (local DWN)  <--pull/push-->  Remote DWN B (provider 2)

This creates two related problems:

Problem 1: Cross-Provider Sync

If a user has DWN endpoints at two providers, those providers cannot sync with each other. All data must flow through the agent:

Provider A  <---agent--->  Agent  <---agent--->  Provider B
  • The agent must be online for data to propagate between providers
  • Every byte flows through the user's device twice
  • Mobile/constrained agents become a bottleneck
  • If the agent is offline, providers diverge until the next agent sync

Problem 2: Multi-Party Record Delivery

Multi-party protocols use an owner-centric model. A chat thread with 5,000 participants:

Alice creates the thread in HER DWN
Bob writes a chat message to ALICE'S DWN
Carol writes a chat message to ALICE'S DWN
...all 5,000 participants read/write to Alice's DWN

There is no delivery mechanism. When Bob writes a chat message to Alice's DWN, Carol doesn't get notified unless she's actively subscribed or her agent is polling. There's nothing pushing that message to Carol's DWN.

Both problems involve a DWN server proactively sending messages to another DWN server. The difference is:

  • Sync: messages flow between servers that host the same DID's data
  • Delivery: messages flow between servers that host different DIDs who participate in the same protocol context

For a large provider, both can be handled by the same outbound delivery infrastructure.


Part 1: Provider-to-Provider Sync (Same-Tenant, Cross-Provider)

Approach: Delegated Sync Grants

The agent grants each provider DWN a permission to sync on its behalf with the tenant's other DWN endpoints. The provider runs a ServerSyncEngine that uses the existing MessagesSync/MessagesRead protocol, authenticated via delegated grants.

How It Works

  1. During registration (or via a new protocol message), the agent creates permission grants for the provider's server DID:

    • Messages/Sync -- compare tree state on peer DWNs
    • Messages/Read -- fetch missing messages from peer DWNs
  2. The provider resolves the tenant's DID document #dwn service endpoints to discover peer DWN URLs.

  3. The provider runs a ServerSyncEngine per tenant that:

    • Performs SMT tree-walk against each peer DWN (same algorithm as SyncEngineLevel.walkTreeDiff())
    • Pulls missing messages from peers via MessagesRead and writes them locally via processMessage()
    • Pushes local messages to peers — the messages are already signed by their original authors, so the receiving DWN authorizes based on the original signature, not the provider's identity
    • Uses MessagesSubscribe on peers for real-time sync (live mode)
Provider A  <--MessagesSync/Read (delegated grant)-->  Provider B
     |                                                      |
     +-- Agent grants sync delegation to both providers ----+

Why No RecordsWrite Grant Is Needed

The messages being synced are already signed by their original author. When Provider A pushes a message to Provider B, it sends the original RecordsWrite (with the author's signature intact). Provider B's processMessage() validates the original author's signature and authorization. The provider only needs:

  • Messages/Sync -- to run the SMT tree-walk against the peer
  • Messages/Read -- to fetch the full message + data from the peer by messageCid

This means the delegation is read-only sync authority -- the provider can read and relay messages between peers but cannot fabricate new writes. This is a much smaller trust surface.

Fan-Out on Receipt

When a provider receives a message via sync from a peer, it SHOULD NOT trigger outbound delivery/fan-out for that message. Only messages received via direct author writes trigger delivery. This prevents amplification loops. A convention or flag distinguishes "written by an author" from "arrived via sync."

What Needs to Change

  • Agent: New registerSyncDelegation() method that creates and distributes the grants
  • DWN Server: New ServerSyncEngine class (reuses walkTreeDiff and pullMessages/pushMessages logic from the agent's SyncEngineLevel)
  • DWN Server: Tenant registry extended with peer DWN URLs and grant references
  • Spec: No changes -- everything uses existing message types and authorization

Part 2: Multi-Party Record Delivery

Foundational Properties

Two properties of DWN make delivery tractable:

  1. Messages are self-certifying. The original author's signature travels with the message. Any intermediary can relay it. The receiving DWN verifies the author's signature and protocol $actions independently -- no trust in the relay required.

  2. processMessage() is idempotent. Duplicates return 409. Over-delivery is free (just wasted bandwidth), eliminating the need for exactly-once coordination and allowing layered strategies where fast-path and slow-path overlap without correctness risk.

Context Establishment (The Bootstrap)

Before any delivery strategy works, the participant's DWN needs the minimum viable context:

  • The thread/channel root record
  • The participant's own role record
  • The participant's context encryption key (if the protocol uses encryption)

This initial fan-out happens at participant-add time using direct delivery, regardless of the ongoing delivery strategy. When Alice writes a thread/participant role record with recipient: bob.did:

  1. Alice's provider delivers to Bob's DWN:
    • The thread root record
    • Bob's participant role record
    • The context key record (for decryption)
  2. Bob's DWN now has everything it needs to authorize subsequent messages via standard protocol $actions

This is the "join handshake" that sets up everything else.

Determining Delivery Targets

A provider can determine who needs a message entirely from local data:

Given a new RecordsWrite for (protocol, protocolPath, contextId):

1. Explicit recipient: descriptor.recipient (if present)

2. Role-based discovery:
   - Get $actions rules at this protocolPath from the protocol definition
   - For each rule with a role granting "read":
     Query locally for role records:
       { protocol, protocolPath: <rolePath>,
         isLatestBaseState: true,
         contextId: <prefix matching parent context> }
   - Each matching record's descriptor.recipient is a delivery target

3. Actor-based discovery:
   - For rules with who: "author"/"recipient" of: <path>:
     Walk the record chain via parentId to find the ancestor
     Extract the author or recipient of that ancestor

4. Group targets by DWN service endpoint (resolved from DID documents)

All routing metadata (protocol, protocolPath, contextId, recipient, role records) is in cleartext on the message. Encrypted data payloads are opaque bytes. Relays never need to decrypt. All three delivery strategies work identically for encrypted and unencrypted protocols.

Delivery Strategy: $delivery

Protocol authors declare a delivery strategy at the protocol path level using a new $delivery directive. Different paths in the same protocol can use different strategies.


Strategy 1: Direct Delivery ($delivery: "direct")

For: Small participant sets (2-50 participants) across a handful of providers.

The provider hosting the DWN where a write lands resolves the participant set, groups them by provider, and pushes the message directly to each provider.

Flow

1. RecordsWrite arrives at Alice's DWN on Provider A
2. Provider A queries the participant set for this context
3. Groups participants by DWN service endpoint
4. For each distinct provider endpoint:
   - Sends the original signed message (+ data) once
   - Includes a delivery manifest: tenant DIDs at that provider who should receive it
5. Receiving provider calls processMessage(tenantDid, message) for each listed tenant
   - 202: stored
   - 409: already have it (no-op)
6. Co-located participants (also on Provider A): local processMessage()

Cost

O(P) cross-network messages where P = number of distinct providers. For 30 participants across 4 providers, that's 3 outbound requests (the 4th is local).

Best For

DMs, small group chats, request/response workflows (credential issuance, purchase flows), any protocol with a small, explicit participant set.


Strategy 2: Provider-Relay Delivery ($delivery: "relay")

For: Medium to large participant sets (50-10,000+) spread across many providers (10-100+).

Delivery responsibility is shared across participating providers using deterministic relay assignment, so no single provider bears the full fan-out cost.

Core Mechanism: Rendezvous Hashing

When a message needs delivery across P providers:

1. Compute the set of participating providers for this context
   (derived from participant DIDs -> DWN service endpoints, cached)
2. For each provider Pi, compute:
   score_i = HMAC-SHA256(contextId || messageCid, Pi.endpoint)
3. Sort providers by score descending
4. Top-k providers are "relay coordinators" (k = ceil(log2(P)), min 2, max 5)
5. Origin sends message to the k relay coordinators
6. Each coordinator forwards to a deterministic subset of remaining providers

Properties of rendezvous hashing:

  • Deterministic: every provider independently computes the same assignment with no coordination
  • Resilient: when a provider goes down, its responsibilities redistribute automatically
  • Stable: adding/removing providers only affects 1/P of the assignments
  • No shared state: requires only a shared list of participating providers

Flow

1. RecordsWrite arrives at Alice's DWN on Provider A
2. Provider A computes relay assignment:
   - Relay coordinators: [Provider A (self), Provider C, Provider F]
   - Provider A's relay targets: [B, D, E]
   - Provider C's relay targets: [G, H, I, J]
   - Provider F's relay targets: [K, L, M]
3. Provider A delivers to:
   - Local tenants (processMessage)
   - Providers B, D, E (direct, as relay coordinator)
   - Providers C and F (with relay manifest -- they forward to their targets)
4. Provider C receives, delivers locally, forwards to G, H, I, J
5. Provider F receives, delivers locally, forwards to K, L, M

Overlap for Reliability

Each relay coordinator also delivers to one target from an adjacent coordinator's list. Since processMessage() is idempotent, this means a few duplicates in exchange for fault tolerance -- if a coordinator is down, the overlap ensures its targets still get the message within one hop.

Cost

  • Origin load: O(k) where k = ceil(log2(P)), typically 2-5
  • Total cross-network messages: P - 1 (same as direct, but load distributed)
  • Max latency: 2 hops (origin -> coordinator -> final provider)

Provider Set Caching

The mapping from contextId -> participant DIDs -> provider endpoints changes only when participants are added/removed. Providers cache this mapping and invalidate on role record writes/deletes for the context.

Topology Considerations

  • 3-4 big providers: k=2 coordinators, minimal overhead -- barely different from direct
  • 100 small providers: k=5 coordinators each forward to ~20 -- manageable, well-distributed
  • The rendezvous hashing distributes evenly regardless of provider size asymmetry

Best For

Group chats, community channels, collaborative workspaces, any protocol with a medium-to-large participant set where the participant set is enumerable via role records.


Strategy 3: Subscribed Delivery ($delivery: "subscribe")

For: Asymmetric fan-out (few writers, many readers), public/broadcast contexts, or contexts where enumerating participants is impractical.

This inverts the push model. Participant providers subscribe to the origin DWN for new records in the protocol context, using DWN's existing RecordsSubscribe infrastructure.

When This Makes Sense

  • A social media feed: one user posts, thousands follow
  • A protocol with $actions: [{ who: "anyone", can: ["read"] }] -- you can't enumerate "everyone"
  • A bulletin board / announcement channel where readers vastly outnumber writers
  • Any context where published: true records are the norm

Flow

1. Carol's provider wants records from Alice's DWN for context X
   (Carol has a role granting read, or records are published)
2. Carol's provider opens a RecordsSubscribe WebSocket to Alice's DWN:
   - filter: { protocol, contextId, protocolPath }
   - protocolRole: "thread/participant" (or omitted for published records)
   - cursor: last known position (for catch-up)
3. Alice's DWN streams matching events:
   - Catch-up: replay stored events since cursor
   - EOSE marker
   - Live: new events as they occur
4. Carol's provider calls processMessage(carol.did, message) locally
5. On disconnect, Carol's provider reconnects with last cursor (gapless resume)

Provider-Grouped Subscriptions

If Provider B hosts 500 users who follow Alice, Provider B opens one subscription to Alice's DWN (not 500). Provider B distributes locally. This mirrors the Matrix "send once per server" and ActivityPub "shared inbox" optimizations.

Hybrid With Push

For latency-sensitive records (mentions, direct replies with an explicit recipient), the origin provider MAY additionally perform direct delivery -- even in a subscribe-mode context. Subscribe handles bulk fan-out; direct delivery handles notifications.

Cost

  • O(1) from origin's perspective per subscribing provider (WebSocket + event stream)
  • The cost is borne by the subscriber's provider, not the origin
  • Origin's work is constant regardless of how many providers subscribe

Best For

Social feeds, broadcast announcements, public bulletin boards, any protocol with asymmetric write/read ratios or who: "anyone" read rules.


Strategy Comparison

Dimension Direct Relay Subscribe
Participants 2-50 50-10,000+ Unbounded
Providers (P) 2-10 10-100+ Any
Origin load per message O(P) O(k), k ~ log(P) O(1) per subscriber
Delivery latency 1 hop 2 hops max Subscription latency
Coordination needed None Deterministic (rendezvous hash) Subscription management
Works offline? Queue + retry + SMT Queue + retry + SMT Cursor-based catch-up
Push/Pull Push Push (distributed) Pull (with push option)
Best for DMs, small groups Group chats, communities Feeds, broadcasts

Mixing Strategies Within a Protocol

Different paths in the same protocol can use different strategies:

{
  "protocol": "https://example.com/community",
  "structure": {
    "community": {
      "member": { "$role": true },
      "announcement": {
        "$delivery": "subscribe",
        "$actions": [{ "role": "community/member", "can": ["read"] }]
      },
      "channel": {
        "participant": { "$role": true },
        "message": {
          "$delivery": "relay",
          "$actions": [{ "role": "channel/participant", "can": ["create", "read"] }]
        }
      },
      "dm": {
        "$delivery": "direct",
        "$actions": [
          { "who": "author", "of": "community/dm", "can": ["read"] },
          { "who": "recipient", "of": "community/dm", "can": ["read", "create"] }
        ]
      }
    }
  }
}
  • Announcements: subscribe -- one admin writes, thousands of members read
  • Channel messages: relay -- hundreds of active participants, dozens of providers
  • DMs: direct -- two participants, two providers

The Universal Backstop: SMT Reconciliation

Regardless of delivery strategy, all providers periodically perform SMT reconciliation per protocol context:

Every 30-60s (configurable, with backoff when converged):
  For each (remoteProvider, protocol, contextId) tuple:
    Compare SMT roots
    If divergent: tree walk -> identify missing CIDs -> pull

This means:

  • If direct delivery fails -> SMT catches it within 30-60s
  • If a relay coordinator is down -> the overlap + SMT catches it
  • If a subscribe connection drops -> cursor-based reconnect + SMT catches it
  • If a provider doesn't support any delivery strategy -> agent-level pull + SMT still works

The delivery strategies are optimizations for latency. The SMT guarantees correctness.

Spec Language Considerations

All delivery behavior is provider-level optimization, not a protocol-level requirement:

  • Providers that support delivery SHOULD implement the strategy declared by $delivery
  • Providers that do not support delivery MAY rely on participant agents performing pull-based sync
  • Receiving providers MUST accept delivery of records that pass standard processMessage() authorization
  • The $delivery directive is advisory -- the protocol functions correctly without it, just with higher latency

The key spec-level additions would be:

  1. $delivery directive on ProtocolRuleSet -- "direct" | "relay" | "subscribe"
  2. Delivery manifest format -- how a provider communicates "deliver this message to these tenant DIDs" to a peer provider (a lightweight envelope around the original signed message)
  3. Relay assignment algorithm -- the rendezvous hashing specification so all providers independently compute the same relay topology
  4. Context establishment -- the "join handshake" that delivers root record + role record + context key to a new participant's DWN

Design Principles

  1. Self-certifying messages eliminate relay trust. Any node can relay a message because the recipient verifies the author's signature. Relays can't forge or tamper.

  2. Idempotent processing eliminates exactly-once complexity. Deliver as many times as you want. Duplicates return 409 and are ignored. Use redundant delivery paths aggressively.

  3. SMT provides convergence verification without consensus. A single 32-byte root hash proves whether two nodes have the same message set. No Paxos, no Raft, no vector clocks.

  4. Provider-grouping exploits the natural topology. Participants cluster by provider. Send one copy per provider, fan out locally. If average cluster size is 50 participants/provider, you send 98% fewer cross-network messages.

  5. Encryption is transparent to delivery. All routing metadata is cleartext. Encrypted payloads are opaque. Relays never decrypt.

Prior Art Considered

System Delivery Model Lesson Taken
Matrix Flat push, one transaction per server Transaction batching, send-once-per-server grouping
Matrix DAG-based ordering + state resolution Deterministic convergence without central coordinator
ActivityPub Origin pushes to all followers' inboxes Shared inbox optimization (= provider-grouped delivery)
ActivityPub O(servers) origin load Why pure push doesn't scale -- motivated the relay strategy
AT Protocol Relay/firehose model Decoupling origin cost from audience size -- motivated subscribe strategy
AT Protocol Self-certifying data repos Validation that untrusted intermediaries work when data is signed
XMPP MIX PubSub nodes + MAM catch-up Persistent participation + cursor-based history -- already in DWN
Dynamo/Riak Merkle tree anti-entropy SMT reconciliation as universal consistency backstop
Plumtree Push-lazy-push multicast Eager delivery + lazy SMT reconciliation = the same pattern
NATS Subject-based routing with server fan-out Provider-grouped delivery mirrors NATS cluster routing

Open Questions

  1. Should $delivery be on the protocol definition or configurable per-context at runtime? A large community might want relay for a 10,000-person general channel but direct for a 5-person subgroup -- both using the same protocol path.

  2. How should the delivery manifest be transported? Options: a new JSON-RPC method (dwn.deliverMessages), an extension to the existing dwn.processMessage envelope, or a separate HTTP endpoint.

  3. Should providers advertise delivery capability in /info? This would let protocol authors and agents know which providers support which delivery strategies.

  4. How should relay coordinator failures be detected and recovered? The overlap provides immediate coverage, but should there be an explicit heartbeat or health-check protocol between coordinators?

  5. What is the right granularity for SMT reconciliation in multi-party contexts? Per-protocol? Per-context? Per-provider-pair? The per-protocol tree already exists in the StateIndex; per-context would require additional tree management.

RFC: Causal Scoped Replication for Multi-Master DWNs

Status

Draft - replaces the earlier dual-watermark-focused draft.

Tracks: enboxorg/enbox#761

Why this RFC exists

This project needs a sync architecture that is not just "cursor resume + polling fallback", but a complete replication model for:

  • multi-master operation
  • protocol and protocolPath scoped sync
  • strict dependency correctness
  • realtime first, anti-entropy backed
  • low-memory source servers

The central idea: treat DWN sync as causal operation replication with explicit scope and closure rules.

Dual watermarks still exist, but they are an implementation detail inside a larger causal replication design.

Executive Summary

  1. DWN sync is modeled as op-based eventual consistency over signed operations.
  2. Replication operates on scopes (global/protocol/path subsets), but every scope is expanded into an effective closure scope that includes dependencies.
  3. Realtime stream is primary; digest/SMT anti-entropy is repair.
  4. Each link tracks directional progression with replication checkpoints, not only a single cursor.
  5. Source servers remain mostly stateless in memory; consumers hold durable replication state.

Problem

Current behavior works, but key architecture gaps remain:

  1. Scope is mostly protocol-level; protocolPath subset replication is not first-class.
  2. No full formal closure contract for scoped sync.
  3. Cursor progression is too linear for causal blocking cases.
  4. Repair/gap handling is not a first-class, explicit protocol contract.
  5. Sync architecture framing is transport-centric, not causality-centric.

Design Principles

  1. Causality first: operations apply only when required predecessors are satisfied.
  2. Scope is intent, closure is correctness: requested scope is expanded to required dependencies.
  3. Realtime first, repair always available: stream for latency, anti-entropy for convergence.
  4. Consumer-owned progress: per-link replication state lives with the consumer.
  5. Bounded server memory: source servers do not own per-remote replay checkpoints.
  6. Hard-fail correctness: closure misses are repair triggers, not silently tolerated.

Terminology

  • Operation: a signed DWN message (+ optional data payload).
  • Link: one replication relation (tenant, remoteEndpoint, scopeId).
  • Intent Scope: user/app requested filter (global/protocol/path subset).
  • Effective Scope: intent scope plus all required dependency classes.
  • Closure: complete transitive dependency set required for correct apply.
  • Progress Token: source-log replay position descriptor.
  • Replication Checkpoint: durable progress state for a direction on a link.
  • Gap: requested progress token is no longer replayable from source.

Consistency Model

This RFC defines DWN replication as an op-based eventual consistency model with deterministic conflict and authorization checks:

  • transport: at-least-once
  • apply: idempotent
  • ordering: causal constraints before dependent apply
  • conflict semantics: existing DWN logic (newest write, delete rules, squash rules)
  • convergence: realtime stream + anti-entropy repair

This is CRDT-like at system level: convergence emerges from immutable operations, deterministic merge/apply rules, and replay safety.

The normative closure rules for scoped subset sync are defined in:

  • proposals/rfc-scoped-sync-closure.md

Replication Architecture

Three planes

  1. Operation Plane (realtime)
    • event stream delivery (subscribe/read) with cursors
  2. Digest Plane (repair)
    • set reconciliation (MessagesSync diff / future replacement)
  3. Dependency Plane (closure)
    • dependency discovery and transitive fetch for scoped correctness

Core components

  1. Replication Ledger (consumer-side durable store)
  2. Stream Replicator (realtime ingress/egress)
  3. Closure Resolver (dependency completion)
  4. Repair Engine (gap + anti-entropy repair)
  5. Flow Controller (bounded in-flight and backpressure)

Progress Token Contract (Normative)

This RFC standardizes a single resume/progression mechanism inspired by the pre-SMT messageCid + watermark model, generalized for distributed backends.

Token shape

type ProgressToken = {
  streamId: string;   // stable identity of the source stream domain
  epoch: string;      // stream generation/version, changes on non-compatible reset
  position: string;   // monotonic decimal string within (streamId, epoch)
  messageCid: string; // tie-break identity and replay idempotency anchor
};

Normative rules

  1. Progress comparisons are valid only when streamId and epoch are equal.
  2. position MUST be monotonic and strictly increasing for emitted events in a given (streamId, epoch).
  3. Consumers MUST store/transmit position as an opaque decimal string, and MUST compare positions numerically (BigInt-safe), never lexicographically.
  4. Resume requests MUST pass the exact previously observed token.
  5. Source MUST reject tokens from a different streamId or stale epoch with explicit gap metadata.
  6. Replication checkpoint progression uses token order only; it MUST NOT assume numeric adjacency (+1).
  7. Progress tokens are source-local resume artifacts: they are valid only for the specific remote/source stream that issued them and MUST NOT be compared or reused across different remote providers.

Why not opaque cursor-only

Opaque cursor strings hide ordering semantics needed by replication checkpoint logic. This RFC requires explicit monotonic ordering semantics while preserving backend portability through streamId and epoch boundaries.

Stream identity and epoch management

  1. streamId MUST remain stable across process restarts, horizontal scaling, and failover for the same logical source stream.
  2. epoch MUST change when continuity cannot be guaranteed (destructive trim reset, backend swap without deterministic replay continuity, stream recreation).
  3. If epoch changes, source MUST return ProgressGap for old tokens; silent remapping is forbidden.
  4. Backend implementations MAY internally use ULID, JetStream seq, SQL bigint, etc., but MUST emit ProgressToken at the interface boundary.
  5. streamId MUST NOT include process-specific identifiers (PID, hostname, port, container id).
  6. streamId SHOULD identify a stable tenant-partitioned logical stream domain, not a sync scope or individual process instance.
  7. epoch is equality-checked only and SHOULD be generated as UUID v4 when continuity must be broken.

Scope Model

type SyncScope = {
  kind: 'global' | 'protocol' | 'subset';
  protocol?: string;
  protocolPathPrefixes?: string[];
  contextIdPrefixes?: string[];
};

Rules:

  1. global has no filter constraints.
  2. protocol requires protocol.
  3. subset requires protocol and may include path/context prefixes.
  4. Prefix semantics are strict prefix matches (no globs in this RFC).
  5. Scope is canonicalized and hashed to scopeId.

Intent scope vs effective scope

Requested filters define intent scope. Replication executes effective scope:

effectiveScope = intentScope + requiredDependencies(intentScope)

This is non-optional for subset modes.

Dependency Closure Contract

This section is intentionally high-level. The normative dependency classes, traversal rules, stop conditions, and failure semantics are specified in:

  • proposals/rfc-scoped-sync-closure.md

Scoped replication MUST include transitive dependencies in these classes:

  1. Protocol metadata

    • ProtocolsConfigure lineage for involved protocol.
  2. Authorization lineage

    • grants/revocations and referenced permission chain required to validate replay.
  3. Record ancestry

    • initial write, parent chain, context chain required for apply validity.
  4. State semantics

    • deletes/tombstones/squash floors that affect resulting visible state.
  5. Encryption dependencies

    • key-delivery/context-key records required for decryptable state.

Hard-fail policy

If closure cannot be satisfied for an operation at a given progress token:

  1. operation is not considered committed for progression
  2. link transitions to repairing
  3. diagnostics are emitted with dependency class and missing identifiers

No best-effort partial closure mode is allowed in this RFC.

Link State and Replication Checkpoints

Directional state per link is persisted consumer-side:

type DirectionCheckpoint = {
  // stream progression
  receivedToken?: ProgressToken;          // latest received from source
  contiguousAppliedToken?: ProgressToken; // highest durably applied replay-safe token

  // sparse progression when out-of-order closure is resolving
  pendingTokens?: ProgressToken[]; // ordered ascending by numeric position comparison
};

type ReplicationLinkState = {
  tenantDid: string;
  remoteEndpoint: string;
  scopeId: string;
  scope: SyncScope;

  pull: DirectionCheckpoint;
  push: DirectionCheckpoint;

  state: 'initial' | 'live' | 'degraded_poll' | 'repairing' | 'paused';
  failureCount: number;
  lastSuccessAt?: string;
  lastFailureAt?: string;
};

Why this matters:

  • A single scalar cursor is not enough when some events are received but blocked on missing dependencies.
  • contiguousAppliedToken is the true resume watermark.
  • receivedToken and pending sets preserve observability and catch-up control.

Checkpoint invariants:

  1. contiguousAppliedToken is the only legal resume watermark.
  2. receivedToken MAY be ahead of contiguousAppliedToken while closure dependencies are pending.
  3. Token comparisons MUST reject mixed (streamId, epoch).
  4. Checkpoint commits MUST be atomic and monotonic.
  5. pendingTokens MUST NOT exceed 100 entries; on overflow, the link MUST transition to repairing.

Realtime Pull Algorithm (Remote -> Local)

  1. Start stream from pull.contiguousAppliedToken.
  2. For each event:
    • validate token domain (streamId, epoch)
    • mark pull.receivedToken
    • check closure prerequisites
    • if ready: apply idempotently
    • if blocked: queue in pending set and trigger dependency fetch
  3. After each apply, advance contiguous checkpoint when earlier pending dependencies are resolved.
  4. Persist checkpoint updates durably in monotonic fashion.
  5. On transport drop, reconnect from contiguousAppliedToken.

Realtime Push Algorithm (Local -> Remote)

  1. Read local stream from push.contiguousAppliedToken.
  2. Build dependency-safe batches (existing topological ordering primitives can seed this).
  3. Send to remote, interpreting responses:
    • success or idempotent-equivalent conflict advances apply eligibility
    • retryable errors remain pending
    • hard errors move link toward repair/paused according to policy
  4. Advance push.contiguousAppliedToken only when contiguous acked progression holds.

Gap and Repair Contract

Progress gap reply

Source MUST return explicit gap metadata:

{
  "status": { "code": 410, "detail": "Progress token gap" },
  "error": {
    "code": "ProgressGap",
    "requested": {
      "streamId": "...",
      "epoch": "...",
      "position": "...",
      "messageCid": "..."
    },
    "oldestAvailable": {
      "streamId": "...",
      "epoch": "...",
      "position": "...",
      "messageCid": "..."
    },
    "latestAvailable": {
      "streamId": "...",
      "epoch": "...",
      "position": "...",
      "messageCid": "..."
    },
    "reason": "token_too_old | epoch_mismatch | stream_mismatch"
  }
}

Repair flow

  1. Link enters repairing.
  2. Run scope-aware digest reconciliation.
  3. Resolve and apply missing operations with closure enforcement.
  4. Recompute contiguous checkpoint.
  5. Re-enter live.

Consumers MUST NOT jump to latestAvailable without repair.

rpc.ack flow-control acknowledgments and durable checkpoint commits are distinct:

  1. rpc.ack advances source-side transport windows only.
  2. Checkpoint commits happen only after successful local apply/closure checks.

Polling Fallback

Polling is fallback, not primary:

  1. Enter degraded_poll after realtime reconnect exhaustion.
  2. Run short-interval anti-entropy loops (e.g. 5 to 15 seconds with jitter).
  3. Keep trying to restore realtime stream.
  4. Return to live only after successful catch-up and healthy stream.

Transport and Memory Model

Source server requirements:

  1. Keep only connection-local flow state in memory.
  2. Do not maintain per-remote durable checkpoints.
  3. Replay authority remains event log backend.
  4. Keep in-flight windows bounded.

Consumer requirements:

  1. Persist replication ledger.
  2. Bound pending operation queues per link.
  3. Persist checkpoint commits atomically.

Protocol and Wire Evolution

Required filter evolution

Current message filters are too narrow (protocol but not path/context for Messages*).

Evolve request descriptors to support:

type ScopeFilter = {
  protocol?: string;
  protocolPathPrefixes?: string[];
  contextIdPrefixes?: string[];
};

Applies to realtime and repair requests.

Lifecycle and diagnostics events

Expose state transitions and progress-token/checkpoint metadata through API/client layers:

  • connected
  • resubscribed
  • degraded_poll
  • repairing
  • live
  • paused

with scopeId, direction, and checkpoint token details.

Security and Correctness

  1. Scoped sync MUST never bypass protocol authorization checks.
  2. Closure dependency fetches must honor existing grant constraints.
  3. Progress commits must be tamper-resistant and monotonic within (streamId, epoch).
  4. Duplicate delivery is acceptable; divergence is not.
  5. Echo-loop suppression should be explicit in push logic (origin tagging or equivalent).

Migration Strategy

No backward compatibility requirement is imposed.

Migration plan:

  1. Introduce new replication ledger keys and schema.
  2. Migrate existing single-cursor values into contiguousAppliedToken where possible.
  3. Enable new engine behind syncV2 flag.
  4. Run shadow metrics mode before active progression.
  5. Remove legacy-only paths after validation.

Implementation Mapping

@enbox/agent

  • packages/agent/src/sync-engine-level.ts
    • replace mode internals with link state machine + checkpoint model
  • packages/agent/src/sync-messages.ts
    • enforce closure-aware apply/ack path
  • packages/agent/src/sync-topological-sort.ts
    • extend dependency graph sources beyond current minimal set
  • packages/agent/src/types/sync.ts
    • move from protocol list to scope list

@enbox/dwn-sdk-js

  • packages/dwn-sdk-js/src/types/messages-types.ts
    • extend MessagesFilter and descriptor scope fields
  • packages/dwn-sdk-js/src/interfaces/messages-subscribe.ts
  • packages/dwn-sdk-js/src/interfaces/messages-sync.ts
  • packages/dwn-sdk-js/src/handlers/messages-subscribe.ts
  • packages/dwn-sdk-js/src/handlers/messages-sync.ts
    • scope-aware filtering + explicit gap replies + closure-aware diff behavior

@enbox/dwn-server

  • packages/dwn-server/src/connection/flow-controller.ts
  • packages/dwn-server/src/connection/socket-connection.ts
  • packages/dwn-server/src/json-rpc-handlers/subscription/*
    • propagate gap metadata and keep bounded memory under scoped replay

@enbox/dwn-clients

  • packages/dwn-clients/src/web-socket-clients.ts
  • packages/dwn-clients/src/dwn-rpc-types.ts
    • endpoint keying correctness + lifecycle/checkpoint diagnostics

@enbox/api

  • packages/api/src/live-query.ts
  • packages/api/src/dwn-api.ts
    • expose scope and checkpoint lifecycle to app callers

Phased Plan

Phase 0 - contracts and schemas

  1. finalize scope filter schema
  2. finalize checkpoint/ledger schema
  3. finalize progress-token gap/error contract
  4. finalize closure dependency matrix

Phase 1 - replication checkpoint core

  1. implement link ledger and directional checkpoints
  2. wire pull/push progression to contiguous checkpoint semantics
  3. add monotonic atomic commit helpers

Phase 2 - gap and repair engine

  1. implement explicit gap transitions
  2. implement repair loops with digest + targeted fetch
  3. add degraded polling policy

Phase 3 - scoped subset + strict closure

  1. add protocolPath/context prefix filters to wire contracts
  2. implement closure resolver
  3. enforce hard-fail closure policy

Phase 4 - observability and hardening

  1. lifecycle and checkpoint telemetry across sdk/clients/api
  2. memory and backpressure soak tests
  3. fault injection convergence tests

Testing Requirements

Unit

  1. scope canonicalization and scopeId determinism
  2. checkpoint monotonicity under concurrent updates
  3. closure traversal correctness and cycle handling
  4. progress-gap state transition behavior

Integration

  1. bidirectional live sync with intermittent disconnects
  2. restart-resume from contiguous checkpoint token
  3. subset scope replication with full closure correctness
  4. gap -> repair -> live transitions

E2E and performance

  1. burst write workloads with mixed payload sizes
  2. long-haul replication with packet loss and retries
  3. bounded source memory under many active links
  4. eventual convergence assertions after partitions

Observability Requirements

Metrics:

  1. sync_link_state{tenant,remote,scope}
  2. sync_checkpoint_pull_received
  3. sync_checkpoint_pull_contiguous
  4. sync_checkpoint_push_received
  5. sync_checkpoint_push_contiguous
  6. sync_gap_repairs_total
  7. sync_closure_missing_total{class}
  8. sync_apply_latency_ms
  9. sync_inflight_events

Logs:

  • tenantDid, remoteEndpoint, scopeId, direction, stateFrom, stateTo, token, contiguousToken, errorCode.

Resolved Decisions

  1. Scope matching uses strict prefix semantics (no globs).
  2. Closure completeness is mandatory hard-fail.
  3. Realtime is primary; polling is degraded fallback.
  4. Source server remains checkpoint-stateless per remote.
  5. Compatibility with pre-release sync contracts is not required.
  6. Progression uses ProgressToken (streamId, epoch, position, messageCid) as the single canonical mechanism.
  7. Transport and DWN message descriptors keep the field name cursor, but its value type is ProgressToken (object), not string.
  8. ProgressToken.position comparisons are numeric (BigInt-safe), never lexicographic.
  9. Resume gaps use 410 + ProgressGap (not 409).
  10. rpc.ack is flow-control only; durable replication checkpoint commits are separate and happen after successful local apply.
  11. pendingTokens is ordered by numeric token position and capped at 100 entries; overflow transitions link to repairing.
  12. streamId must be stable across restart/scale/failover and must not include process-specific identifiers.
  13. epoch is equality-checked only and SHOULD be UUID v4; generate a new epoch when replay continuity cannot be guaranteed.
  14. MessagesSync action: 'diff' must be formalized in spec before shipping the new engine.
  15. Progress tokens are source-local and must never be reused across different providers/remotes.
  16. Scoped subset sync closure rules are governed by rfc-scoped-sync-closure.md.

Implementation Gates

Gate A (Required before Phase 1 coding)

  1. Propagate ProgressToken object typing across dwn-sdk-js, agent, dwn-clients, and dwn-server surfaces that currently use cursor: string.
  2. Implement ProgressGap handling end-to-end (410, requested, oldestAvailable, latestAvailable, reason).
  3. Add EventLog replay bounds capability (oldest/latest) per tenant stream for deterministic gap metadata.
  4. Separate and document flow-control ack (rpc.ack) from durable checkpoint persistence in implementation code paths.
  5. Formalize MessagesSync action: 'diff' in spec and align current implementation behavior.

Gate B (Required before Phase 3 scoped subset coding)

  1. Extend MessagesFilter and related descriptors with subset scope fields (protocolPathPrefix, contextIdPrefix) and schema/docs updates.
  2. Apply rfc-scoped-sync-closure.md in implementation, including:
    • squash floors
    • implicit purgeOldest deletion visibility
    • cross-protocol $ref composition dependencies
    • grant/revocation lineage ordering semantics
  3. Define subset-scope closure validation checkpoints and explicit failure codes.

Gate C (Required before GA)

  1. Define retention SLA profiles and /info capability advertisement for replay windows.
  2. Complete soak tests for bounded memory under high fan-in/out and slow consumers.
  3. Complete convergence fault-injection suite (disconnects, duplicates, replays, gap repair).

Deferred (Non-Blocking for Phase 1)

  1. Best structure for dependency hints in stream envelopes (explicit list vs derivation on read).
  2. Exact batching thresholds for payload fetch and checkpoint commits.
  3. Final observability schema and dashboard conventions.

Acceptance Criteria

  1. Replication uses checkpoint progression in both directions.
  2. Subset scopes replicate with strict closure completeness.
  3. Gap handling is explicit, deterministic, and observable.
  4. Realtime-first path is stable with automatic degraded fallback.
  5. Source memory remains bounded under sustained load.
  6. Convergence suite passes under injected faults and partitions.

RFC: Strict Closure Rules for Scoped Sync

Status

Draft addendum to rfc-causal-scoped-replication.md.

Tracks: enboxorg/enbox#761

Why this RFC exists

rfc-causal-scoped-replication.md establishes that protocolPath/context-scoped replication requires strict closure completeness, but it intentionally stops short of defining the exact closure algorithm.

This addendum defines the dependency classes, traversal rules, stop conditions, and failure semantics required before Phase 3 scoped subset sync can be implemented safely.

Without these rules, a subset replica can appear healthy while still being unable to:

  • authorize writes
  • decrypt data
  • compute visible state correctly
  • enforce squash / purge semantics
  • converge under multi-master replay

Scope

This RFC covers closure requirements for scoped subset sync only:

  • protocol
  • protocolPathPrefix[]
  • contextIdPrefix[]

It does not redefine the realtime transport, progress token, or repair-state-machine contracts from the main sync RFC.

Definitions

  • Intent-matching operation: an operation whose protocol/path/context matches the user-requested sync scope.
  • Closure operation: an operation outside the intent-matching scope that must still replicate so intent-matching operations remain correct.
  • Closure root: the original intent-matching operation being evaluated.
  • Closure graph: the directed graph of dependency edges explored for a closure root.
  • Closure-complete: every hard dependency edge for the replicated operation set is either locally present or included in the replicated set.
  • Visibility floor: an operation that suppresses older state from being valid/visible (for example squash or purgeOldest).

Non-negotiable rule

Scoped subset replication MUST replicate a closed operation set, not merely a set of leaf records that match the requested scope.

If closure cannot be proven, the replica is not correct.

Closure model

For every intent-matching operation m, construct:

closure(m) = m + transitiveHardDependencies(m)

For a scoped batch S, the required replicated set is:

closure(S) = union(closure(m)) for all intent-matching m in S

Dependency classes

The closure resolver MUST evaluate the following dependency classes.

1. Protocol metadata closure

Include:

  • the relevant ProtocolsConfigure for the protocol
  • any composed / referenced protocol definitions needed to interpret the rule set
  • any protocol-level metadata required to determine $actions, $squash, $recordLimit, encryption rules, or refs

Rationale:

  • protocolPath subset sync is meaningless if the receiver cannot interpret the path's rules
  • cross-protocol composition means a path can depend on metadata from another protocol definition

2. Record ancestry closure

For each RecordsWrite / RecordsDelete / record-scoped event, include:

  • initialWrite
  • parent chain referenced by parentId
  • context ancestry implied by contextId
  • any newest-state predecessors required by existing DWN conflict rules

Rules:

  1. A non-initial write MUST include its initialWrite.
  2. A child record MUST include every ancestor record needed to validate its protocolPath placement.
  3. If a delete acts on a record lineage, the lineage root and relevant newest visible predecessor MUST be available.

3. Authorization closure

Include every operation needed to validate authorization of the closure root:

  • permission grant record referenced by permissionGrantId
  • grant lineage / parent grant where applicable
  • permission revocation records that affect the grant
  • role records or actor-binding records required by protocol rules
  • ancestor records needed by $actions rules with of, who, or role-based checks

Rules:

  1. If an operation uses delegated authorization, the corresponding grant MUST be replicated.
  2. If a grant has been revoked, the relevant revocation MUST be replicated.
  3. Grant validity is evaluated as of the closure root's causal point, not based solely on current latest state.

4. Visibility and state-floor closure

Include every operation needed to determine whether an intent-matching operation is still valid or visible:

  • tombstone / delete operations affecting the record lineage
  • latest squash record at the same protocolPath and parent-context scope
  • operations needed to interpret purgeOldest-style record-limit pruning, if and when that strategy is implemented
  • any visibility-floor metadata required by the protocol definition

Rules:

  1. A squash record is a hard dependency for all records whose validity depends on that squash floor.
  2. If a protocol path uses record limits with implicit purging, and that strategy is implemented by the runtime, the closure resolver MUST include the operations needed to explain why older records are absent.
  3. A replica MUST NOT present older records as valid if they are logically hidden by squash floors. The same rule applies to purge floors if and when purge-based record-limit strategies are implemented.

Current implementation note:

  • squash is implemented today and is part of the active closure contract
  • purgeOldest / implicit record-limit purging is not implemented today and remains a forward-looking extension to this closure class

5. Encryption closure

Include every operation needed for decryptability of scoped data:

  • key-delivery records
  • context-key records
  • protocol encryption metadata
  • ancestor/context records needed to resolve encryption scope

Rules:

  1. If a record is encrypted and the replica is expected to read it, the required key material MUST be in closure.
  2. If a record is intentionally unreadable to this replica, the closure resolver MAY omit decryptability dependencies only when the scoped-sync contract explicitly permits opaque delivery for that scope. This RFC assumes the default is readable-sync, so omission is NOT allowed.

6. Cross-protocol composition closure

When a protocol definition uses $ref / composition features:

  • include the referenced protocol definitions
  • include role/grant records from referenced protocols when they participate in authorization for the closure root
  • include ancestor paths needed to interpret refs transitively

Rules:

  1. Cross-protocol references are hard dependencies, not optional enrichments.
  2. If a composed protocol cannot be resolved, closure is incomplete.

Hard vs soft dependencies

This RFC defines hard dependencies only.

Hard dependency:

  • required for authorization, decryptability, visibility, or causal correctness

Soft dependency:

  • useful for optimization, UX, or indexing, but not required to make the replicated set correct

Scoped sync MUST fail on missing hard dependencies. Scoped sync MAY ignore soft dependencies.

Closure traversal rules

The resolver MUST use deterministic traversal.

Inputs

  • tenant DID
  • remote/source endpoint
  • sync scope
  • candidate operation set from live stream or repair diff

Algorithm

For each closure root:

  1. Initialize queue with the closure root.
  2. Pop next operation.
  3. Extract dependency edges for every dependency class in this RFC.
  4. For each referenced dependency:
    • if already satisfied locally and still valid for this closure evaluation, mark satisfied
    • else add to fetch queue / closure graph
  5. Continue until no unsatisfied hard dependencies remain.
  6. Validate closure-complete before:
    • advancing replication checkpoint

Apply timing model:

  1. Operations MAY be applied to local state before closure validation completes.
  2. The replication checkpoint MUST NOT advance until closure is validated.
  3. If closure fails after local apply, the link MUST transition to repairing.
  4. This is safe because replicated apply is idempotent and repair re-evaluates the operation set.

Determinism requirements

  1. Dependency extraction order MUST be deterministic.
  2. Duplicate nodes MUST be deduped by stable operation identity (messageCid / equivalent).
  3. Cycles MUST be detected and terminated deterministically.
  4. Closure results MUST be independent of traversal order.

Cycle note:

In well-formed DWN data, true semantic cycles should be rare or structurally impossible in most dependency classes:

  • record ancestry is acyclic by construction
  • grant lineage should form a DAG
  • protocol composition should reject invalid circular references

Cycle handling is therefore a defensive requirement. A visited-set keyed by stable operation identity is sufficient.

Stop conditions and failure semantics

Closure resolution fails when any hard dependency is:

  • missing and not fetchable
  • forbidden by auth rules for the syncing principal
  • unresolved due to protocol composition ambiguity
  • inconsistent with already replicated causal state

When closure fails:

  1. the closure root MUST NOT advance the replication checkpoint
  2. the link MUST transition to repairing
  3. diagnostics MUST identify:
    • closure root
    • dependency class
    • missing/failed dependency identity
    • failure code

Suggested failure codes:

  • ClosureProtocolMetadataMissing
  • ClosureInitialWriteMissing
  • ClosureParentChainMissing
  • ClosureContextChainMissing
  • ClosureGrantMissing
  • ClosureGrantRevocationMissing
  • ClosureVisibilityFloorMissing
  • ClosureEncryptionDependencyMissing
  • ClosureCrossProtocolReferenceMissing
  • ClosureDependencyForbidden

Squash rules

Squash is not a normal delete.

For a path with $squash: true:

  1. The most recent squash record at the same protocolPath and parent-context scope is a visibility floor.
  2. Any older record whose validity depends on being newer than that floor MUST NOT be treated as visible.
  3. Scoped subset sync MUST include the relevant squash record when syncing records under that floor.
  4. If a squash record is outside the requested path filter but still governs the requested records, it is a hard closure dependency.
  5. Squash is not merely visibility suppression in the current runtime: applying a squash causes older sibling records in scope to be fully purged.
  6. Therefore, when a subset consumer applies a squash record, it MUST locally perform the same deterministic purge behavior as the source for records at the same protocolPath and parent-context scope whose newest state predates the squash timestamp.
  7. This local purge is a side effect of applying the squash record; it is not modeled as a separate replicated delete event.

Implementation alignment note:

The source runtime currently uses performRecordsSquash() / purgeRecordMessages() semantics rather than emitting tombstones for purged records. Scoped sync consumers must mirror that behavior to converge.

purgeOldest / record-limit rules

If a protocol path uses record limits or implicit purging:

  1. The replica MUST be able to explain why an older record is absent.
  2. If the newest visible state depends on older records having been purged, the operations establishing that visibility floor are hard dependencies.
  3. Scoped sync MUST NOT silently treat omitted older records as “never existed” if they were logically superseded by a purge policy.

This means subset sync must carry enough state-floor metadata to distinguish:

  • never replicated
  • deleted
  • squashed
  • purged by retention / record-limit semantics

Current implementation note:

purgeOldest is not currently enforced by the runtime. The rules in this section are therefore normative only for future record-limit purge strategies. Phase 3 subset sync may ship with squash closure first, with purge-based closure activated when the runtime implements it.

Grant and revocation ordering rules

Authorization closure is causal, not merely latest-state lookup.

Rules:

  1. If operation m is authorized by grant g, then g is a hard dependency of m.
  2. If revocation r affects g, then r is a hard dependency of any closure evaluation where r changes validity of g.
  3. The resolver MUST preserve enough ordering information to know whether m was valid at its own commit point.
  4. A later revocation MUST NOT retroactively invalidate the historical existence of m, but it MAY affect later dependent operations and visible state.

Implementation note:

The closure resolver may need both:

  • the authorizing grant record, and
  • the latest relevant revocation state,

plus causal ordering information already carried by message timestamps / stream positions.

Checkpoint rules for scoped closure

The replication checkpoint MAY advance only when all closure roots up to that delivery point are closure-complete.

This means:

  1. receiving an operation is not enough
  2. parsing an operation is not enough
  3. even applying a subset of its fields is not enough

The checkpoint is advanced only after:

  • operation apply succeeds
  • closure validation succeeds
  • no unresolved hard dependencies remain for earlier delivered operations in the same link order

This rule composes with the existing delivery-ordinal checkpoint implementation from Phase 1/2:

  • delivery order is tracked by per-link ordinals
  • closure validation becomes an additional gate on when an ordinal is considered committed
  • this RFC does not replace ordinal-based checkpointing; it extends the commit condition

Performance requirements

Closure correctness is mandatory, but the resolver must also be viable in multi-tenant deployments.

Normative performance requirements:

  1. ProtocolsConfigure lookups MUST be cached per (tenant, protocol) within an evaluation pass, and SHOULD be cached across passes with version-aware invalidation.
  2. Grant lookups SHOULD be cached per (tenant, grantId) within an evaluation pass, with invalidation on revocation or grant updates.
  3. Batch closure evaluation MUST share dependency lookups across all closure roots in the same batch.
  4. Full-tenant scope (kind: 'full') bypasses scoped closure evaluation entirely; closure is trivially complete because no subset omission is being performed.
  5. Closure traversal depth MUST NOT exceed 32 dependency hops by default. Implementations MAY make this configurable, but exceeding the limit MUST fail deterministically with ClosureDepthExceeded.
  6. Closure evaluation MUST bound per-root in-memory graph state; implementations SHOULD dedupe by stable operation identity and reuse graph nodes across a batch.

These requirements do not mandate a specific cache implementation, only the observable behavior and bounds.

Repair interaction

Repair mode MUST use the same closure rules as live replication.

Rules:

  1. MessagesSync diff / targeted fetch is only the candidate set generator.
  2. Closure resolution still runs before repaired operations are considered complete.
  3. Repair that converges the digest but not the closure graph is NOT successful repair.

What this RFC explicitly does not define

Still deferred beyond this addendum:

  • exact wire/schema shape for subset filters in MessagesSubscribe / MessagesSync
  • transport encoding of closure hints in stream envelopes
  • performance optimizations (prefetch batching, dependency hint indexes, bloom filters, etc.)

Those can be implementation details as long as they preserve the normative closure rules here.

Decisions locked by this addendum

  1. Scoped subset sync is defined over closure-complete operation sets, not raw path-matching records.
  2. Missing hard dependencies are repair-triggering failures, never best-effort omissions.
  3. Squash floors are hard dependencies when they govern visible state.
  4. purgeOldest / record-limit visibility floors are hard dependencies when they explain absence/visibility.
  5. Cross-protocol $ref composition dependencies are hard dependencies.
  6. Grant and revocation lineage is causal and must be evaluated at the closure root's commit point.
  7. Replication checkpoint advancement is blocked until closure is complete.

Gate B completion checklist

Phase 3 subset sync MUST NOT begin until all of the following are complete:

  • subset scope filters are formalized in types/specs (protocolPathPrefix, contextIdPrefix)
  • closure resolver implements every hard dependency class in this RFC
  • closure failure codes are typed and surfaced through repair diagnostics
  • squash floor handling is covered by tests
  • purgeOldest / record-limit visibility handling is covered by tests if and when purge-based record-limit strategies are implemented
  • cross-protocol $ref closure is covered by tests
  • grant/revocation ordering closure is covered by tests

Sync Implementation Final Review (Single Source)

This file is the final pre-implementation review checklist and points to the authoritative decision source.

Authority Order

  1. proposals/rfc-causal-scoped-replication.md (normative architecture + gates)
  2. proposals/rfc-scoped-sync-closure.md (normative scoped-subset closure rules)
  3. dwn-spec/spec/spec.md (DWN semantics)
  4. dwn-transport-spec/spec/spec.md (wire and transport semantics)
  5. dwn-spec-delivery/spec/spec.md (delivery-aligned DWN variant)

Issue tracker:

Decisions Locked

The following are decided and must be implemented as-is:

  1. ProgressToken is canonical: { streamId, epoch, position, messageCid }.
  2. Field name stays cursor on wire/descriptors, but value is ProgressToken object.
  3. Position ordering is numeric (BigInt-safe), never lexicographic.
  4. Gap signaling is 410 + ProgressGap metadata.
  5. rpc.ack is flow-control only, not durable sync progression.
  6. pendingTokens max is 100; overflow forces repairing.
  7. streamId is stable across restart/scale/failover and excludes process identifiers.
  8. epoch is equality-checked and SHOULD be UUID v4.
  9. MessagesSync diff must be spec-formalized before shipping new engine.
  10. Progress tokens are source-local to a remote/provider stream and must never be reused across different remotes.
  11. Scoped subset sync is blocked until closure sub-RFC/gate completion.
  12. Scoped subset closure semantics are defined in proposals/rfc-scoped-sync-closure.md.

Go/No-Go Checklist

Before Phase 1 coding starts:

  • Token shape propagated in dwn-sdk-js/agent/dwn-clients/dwn-server.
  • Progress gap contract implemented end-to-end.
  • EventLog replay bounds available for gap metadata.
  • Ack vs checkpoint persistence split implemented and tested.
  • MessagesSync diff behavior and spec text aligned.

Before Phase 3 coding starts:

  • Scope filter extensions (protocolPathPrefix, contextIdPrefix) merged.
  • proposals/rfc-scoped-sync-closure.md approved and implemented (squash, purgeOldest, cross-protocol $ref, grants).
  • Closure failure behavior and codes implemented.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment