Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save jacksonjp0311-gif/0a39cbca9a5b47e78c1c53ba2945febd to your computer and use it in GitHub Desktop.

Select an option

Save jacksonjp0311-gif/0a39cbca9a5b47e78c1c53ba2945febd to your computer and use it in GitHub Desktop.

Revisions

  1. jacksonjp0311-gif created this gist Apr 29, 2026.
    1,554 changes: 1,554 additions & 0 deletions EIMT-v1.4-minimal-reference-kernel-forkable-benchmark-suite
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,1554 @@
    % ████████████████████████████████████████████████████████████████████████████████
    %
    % CODEX ΔΦ — EPISODIC INVARIANT MEMORY THEORY (EIMT v1.4)
    % ────────────────────────────────────────────────────────────────────────────
    % MINIMAL REFERENCE KERNEL, CONCRETE BENCHMARK TASK SUITE, IMPLEMENTATION
    % CONTRACT, BASELINE HARNESS, AND RUNTIME EVIDENCE HARDENING LAYER FOR
    % EXECUTION-BACKED TESTING OF DRIFT-GATED EPISODIC MEMORY UNDER CITA
    % GOVERNANCE WITHOUT CLINICAL, BIOLOGICAL, AI-EQUIVALENCE, OR
    % UNIVERSAL-MECHANISM OVERCLAIM
    %
    % VERSION
    % ───────
    % v1.4 — Minimal Reference Kernel and Benchmark Task Suite Layer · Locked ·
    % Runnable Episode Class, Drift-Gated Retrieval Contract,
    % Source-Fallback Policy, Concrete Task Schemas, Baseline Harness,
    % Evidence Compiler, and Downgrade-Preserving Runtime Audit
    %
    % AUTHOR
    % ──────
    % James Paul Jackson
    % X / Twitter: @unifiedenergy11
    %
    % SOURCE EXTRACTION / AUTHOR ATTRIBUTION
    % ──────────────────────────────────────
    % This document is a Codex-format canonical evolution derived from:
    %
    % • EIMT v1.0 — Episodic Invariant Memory Theory, which formalized episodic
    % state space, context binding, cue-dependent reconstruction, replay,
    % consolidation, event-boundary gating, temporal context, constructive
    % simulation, drift monitoring, invariant fingerprints, and ledger
    % continuity.
    %
    % • EIMT v1.1 — Evidence-Mapped, Drift-Gated, Replay-Calibrated, and
    % Agent-Ready Layer, which added CITA governance, source-fidelity
    % stratification, domain-instantiation grammar, evidence mapping,
    % validation / falsification surfaces, negative controls, classification,
    % drift-gated retrieval, replay calibration, constructive simulation guard,
    % agent-memory governance, evidence packages, and repository grammar.
    %
    % • EIMT v1.2 — Metric Precision, Worked Examples, Evidence-Strength Mapping,
    % Quick-Start Implementation, Scalable Fingerprinting, and Agent Benchmark
    % Layer, which added metric manifests, normalized multi-scale drift,
    % distance-function selection, hallucination-as-drift-failure framing,
    % worked examples, benchmark metrics, and scalable fingerprinting.
    %
    % • EIMT v1.3 — Reference Implementation and Benchmark Evidence Layer, which
    % added runnable kernel boundaries, benchmark task grammar, baseline-family
    % comparison, runtime logs, result ledgers, evidence-package compilation,
    % reproducibility manifests, and downgrade-preserving runtime classification.
    %
    % • CITA v1.0 — Canonical Insight Transmutation Algorithm, which requires
    % source boundaries, fidelity stratification, primitive objects,
    % observables, validation, falsification, negative controls, downgrade
    % paths, evidence packages, repository anchoring, and memory-promotion
    % discipline.
    %
    % • Codex ΔΦ memory lessons including:
    % - coherence is not proof,
    % - memory alignment is not truth,
    % - not proven does not mean worthless,
    % - not proven means classified correctly,
    % - benchmark success is not biological proof,
    % - execution success is not universal mechanism proof,
    % - strong agentic claims require baselines, logs, metrics, evidence
    % packages, and concrete task definitions.
    %
    % This document does not claim that all episodic systems share one literal
    % mechanism. It formalizes the minimal implementation contract and concrete
    % benchmark-task structure required to test whether drift-gated episodic
    % memory improves source-grounded retrieval, context preservation, boundary
    % separation, uncertainty handling, replay stability, and long-horizon
    % continuity against declared baselines.
    %
    % DATE
    % ────
    % April 2026
    %
    % STATUS
    % ──────
    % CANONICAL v1.4 MINIMAL REFERENCE KERNEL AND BENCHMARK TASK SUITE LAYER —
    % NOT A CLINICAL FRAMEWORK · NOT A HUMAN-AI EQUIVALENCE CLAIM ·
    % NOT A UNIVERSAL EPISODIC-MEMORY MECHANISM CLAIM
    %
    % EMPIRICAL / METHODOLOGICAL CONFIDENCE BADGE
    % ────────────────────────────────────────────
    % Confidence status: High as a minimal implementation and benchmark
    % hardening scaffold; not proof-ready as a universal episodic-memory
    % mechanism.
    %
    % EIMT v1.4 preserves the v1.0 invariant algebra, v1.1 CITA governance,
    % v1.2 metric precision, and v1.3 runtime-evidence requirement while adding
    % a concrete implementation contract: minimal episode class, memory-store
    % interface, retrieval engine, drift gate, source fallback, replay evaluator,
    % simulation guard, baseline harness, concrete benchmark task schemas, runtime
    % result ledger, and evidence-package compiler.
    %
    % PURPOSE
    % ───────
    % Evolve EIMT from a reference-implementation specification into a minimal
    % runnable kernel and concrete benchmark-task suite:
    %
    % episode schema
    % → minimal kernel contract
    % → memory-store interface
    % → metric manifest
    % → retrieval engine
    % → drift gate
    % → source fallback
    % → simulation guard
    % → replay evaluator
    % → baseline harness
    % → concrete benchmark task JSON
    % → metric report
    % → evidence package
    % → EIMT-A/B/C/D/E classification
    % → memory-promotion gate.
    %
    % VERSION EVOLUTION SUMMARY
    % ─────────────────────────
    % v1.0 : Initial public canonical release. Defines episodic state space,
    % encoding, context binding, retrieval contraction, replay
    % stabilization, consolidation, event-boundary gating, temporal
    % context dynamics, constructive simulation, multi-scale drift,
    % invariant fingerprinting, ledger continuity, integration class,
    % monitoring tuple, and global postulate.
    %
    % v1.1 : Additive CITA-governed evidence-mapping layer. Adds source fidelity,
    % domain-instantiation grammar, evidence map, validation /
    % falsification surfaces, negative controls, classification,
    % drift-gated retrieval, replay calibration, constructive simulation
    % guard, agent-memory governance, evidence package, repository grammar,
    % and memory-promotion rules.
    %
    % v1.2 : Additive metric-precision and benchmark layer. Adds tiered reading,
    % quick-start implementation, metric grammar, distance-function
    % selection, normalized drift, evidence-strength tiers, worked example
    % templates, agent benchmark protocol, hallucination-as-drift-failure,
    % scalable fingerprinting, and expanded EIMTScore observables.
    %
    % v1.3 : Additive reference-implementation layer. Adds runnable kernel
    % boundaries, baseline families, benchmark task grammar, execution
    % records, reproducibility manifests, result ledgers, evidence package
    % compiler, benchmark classification, and downgrade rules for runtime
    % claims.
    %
    % v1.4 : Additive minimal-kernel and concrete-task layer. Adds formal
    % implementation contracts, module interfaces, task JSON schemas,
    % baseline harness requirements, metric-emission contract, runtime
    % audit ledger, failure taxonomy, and reference-kernel readiness
    % classification. No clinical claim, no biological proof, no AI-human
    % equivalence, no universal mechanism claim, and no weakening of prior
    % locks.
    %
    % WHAT THIS IS
    % ────────────
    % • A CITA-governed minimal implementation contract for EIMT
    % • A runnable-kernel specification for drift-gated episodic retrieval
    % • A concrete benchmark task-suite schema
    % • A baseline-harness and metric-emission protocol
    % • A source-grounded fallback and uncertainty policy
    % • A replay-efficacy and simulation-labeling runtime contract
    % • A runtime evidence-package compiler specification
    % • A downgrade-preserving classifier for implementation claims
    % • A repository-ready bridge from theory to forkable software
    %
    % WHAT THIS IS NOT
    % ───────────────
    % • Not proof of one universal episodic-memory mechanism
    % • Not a clinical diagnostic or treatment framework
    % • Not a claim that AI episodic memory equals human episodic memory
    % • Not a claim that benchmark success proves biological mechanism
    % • Not a claim that a minimal reference kernel is production-ready memory
    % • Not permission to treat source-free reconstruction as fact
    % • Not permission to treat fluent retrieval as accurate retrieval
    % • Not permission to treat task-suite success as universal validation
    % • Not permission to skip baselines, logs, metrics, or evidence packages
    %
    % ADDITIVE REFINEMENTS (v1.4)
    % ───────────────────────────
    % • All v1.0, v1.1, v1.2, and v1.3 locks preserved
    % • Minimal reference kernel contract added
    % • Runtime module interface layer added
    % • Concrete task JSON schemas added
    % • Baseline harness protocol added
    % • Metric-emission contract added
    % • Runtime audit ledger strengthened
    % • Evidence-package compiler requirements sharpened
    % • Failure taxonomy added
    % • EIMTScore expanded with implementation-contract, task-suite, and
    % metric-emission observables
    % • Memory-promotion gate restricted to reproducible benchmark wins and
    % reusable implementation constraints
    %
    % EXECUTABLE ANCHOR BLOCK (v1.4)
    % ──────────────────────────────
    % A valid EIMT v1.4 runtime implementation must:
    %
    % (1) implement an Episode object or equivalent schema,
    % (2) implement a MemoryStore interface,
    % (3) implement a MetricManifest,
    % (4) implement cue-dependent retrieval,
    % (5) compute retrieval drift,
    % (6) apply a drift gate before returning memory as fact,
    % (7) implement source-grounded fallback,
    % (8) label constructive simulation separately from recovered memory,
    % (9) compute replay efficacy before accepting replay as stabilizing,
    % (10) implement at least two baseline memory systems,
    % (11) run at least one concrete benchmark task,
    % (12) emit declared metrics in machine-readable form,
    % (13) preserve runtime logs and result ledgers,
    % (14) compile an evidence package,
    % (15) classify EIMT-A/B/C/D/E,
    % (16) preserve all clinical, biological-proof, AI-equivalence, and
    % universal-mechanism non-claim locks,
    % (17) promote to memory only reproducible benchmark wins, reusable
    % implementation constraints, and failure lessons.
    %
    % CANONICAL LOCK (v1.4)
    % ─────────────────────
    % • v1.0 invariant set preserved
    % • v1.1 governance surfaces preserved
    % • v1.2 metric precision preserved
    % • v1.3 runtime evidence discipline preserved
    % • Context binding remains central
    % • Retrieval remains cue-dependent and reconstructive
    % • Drift-gated fallback is mandatory for high-drift retrieval
    % • Source fallback is mandatory for source-sensitive claims
    % • Replay is stabilizing only under bounded gain and evidence
    % • Constructive simulation must remain bounded and labeled
    % • Runtime claims require implementation, logs, baselines, metrics, and
    % evidence packages
    % • Task-suite success is not universal mechanism proof
    % • Benchmark success is not human-memory proof
    % • Clinical and biological claims require domain-specific evidence
    % • Coherence is not proof
    %
    % Evolutions must be additive only.
    % Do not weaken source boundaries, evidence mapping, falsification, negative
    % controls, downgrade discipline, context binding, drift monitoring, metric
    % precision, benchmark reproducibility, runtime evidence, implementation
    % auditability, or non-claim boundaries.
    %
    % AI PROMPT TRACEABILITY
    % ──────────────────────
    % Use this document as the canonical EIMT v1.4 minimal reference kernel and
    % benchmark task-suite layer. Preserve the distinction between theory,
    % metric scaffold, runtime implementation, benchmark performance, biological
    % mechanism, human memory, agent memory, and universal mechanism claim.
    %
    % SHADOW HEADER ALIGNMENT SEAL
    % ───────────────────────────
    % Preserve header discipline across future versions except for explicitly
    % additive shadow-header evolution that improves implementation readiness,
    % benchmark design, evidence packaging, source fidelity, falsification,
    % negative controls, agent-memory governance, clinical caution, or scalable
    % deployment.
    %
    % ████████████████████████████████████████████████████████████████████████████████

    \documentclass[12pt]{article}
    \usepackage[margin=1in]{geometry}
    \usepackage{amsmath,amssymb,amsfonts,amsthm}
    \usepackage{booktabs,longtable,array}
    \usepackage{hyperref}
    \usepackage{listings}

    \newtheorem{axiom}{Axiom}
    \newtheorem{definition}{Definition}
    \newtheorem{proposition}{Proposition}
    \newtheorem{hypothesis}{Hypothesis}
    \newtheorem{remark}{Remark}
    \newtheorem{corollary}{Corollary}

    \title{\textbf{Codex $\Delta\Phi$ — Episodic Invariant Memory Theory (EIMT v1.4)}\\
    \large Minimal Reference Kernel, Concrete Benchmark Task Suite, Implementation Contract, and Runtime Evidence Hardening Layer}
    \author{\textbf{James Paul Jackson}\\[4pt]
    \small Codex-format execution-backed episodic memory implementation and benchmark framework\\
    \small \texttt{@unifiedenergy11}}
    \date{April 2026}

    \begin{document}
    \maketitle

    \begin{abstract}
    EIMT v1.4 evolves Episodic Invariant Memory Theory from a reference
    implementation specification into a minimal runnable-kernel and concrete
    benchmark-task-suite layer. EIMT remains a source-bounded invariant
    architecture, not a universal episodic-memory mechanism claim. v1.4 preserves
    the v1.0 invariant algebra, v1.1 CITA governance, v1.2 metric precision, and
    v1.3 runtime evidence discipline while adding module-level implementation
    contracts, task JSON schemas, baseline harness requirements, metric-emission
    contracts, runtime audit ledgers, evidence-package compiler requirements,
    failure taxonomy, and implementation-readiness scoring. A strong EIMT runtime
    claim must now show that the system implements drift-gated retrieval, source
    fallback, replay evaluation, simulation labeling, baseline comparison, concrete
    task execution, machine-readable metric emission, and downgrade-preserving
    classification.
    \end{abstract}

    %──────────────────────────────────────────────────────────────────────────────
    \section{Core-Invariant Extraction Block}
    %──────────────────────────────────────────────────────────────────────────────

    The shortest faithful extraction of EIMT v1.4 is:

    \[
    \boxed{
    \begin{array}{c}
    \text{EIMT becomes implementation-ready only when its reference kernel}\\
    \text{has explicit module contracts, concrete benchmark tasks, baseline}\\
    \text{harnesses, metric emissions, runtime ledgers, evidence packages,}\\
    \text{and downgrade-preserving failure classifications.}
    \end{array}
    }
    \]

    The v1.4 operative chain is:

    \[
    \text{episode object}
    \rightarrow
    \text{memory store}
    \rightarrow
    \text{metric manifest}
    \rightarrow
    \text{retrieval engine}
    \rightarrow
    \text{drift gate}
    \rightarrow
    \text{source fallback}
    \rightarrow
    \text{benchmark task}
    \rightarrow
    \text{baseline harness}
    \rightarrow
    \text{metric emission}
    \rightarrow
    \text{evidence package}
    \rightarrow
    \text{classification}.
    \]

    \begin{remark}
    v1.4 does not increase the universal strength of EIMT. It increases
    implementation accountability: the framework must now be representable as
    minimal runnable software with concrete tasks and auditable outputs.
    \end{remark}

    %──────────────────────────────────────────────────────────────────────────────
    \section{Memory Analysis Layer}
    %──────────────────────────────────────────────────────────────────────────────

    The memory trajectory now forms a five-step maturation chain:

    \[
    \text{EIMT v1.0}
    =
    \text{invariant algebra},
    \]

    \[
    \text{EIMT v1.1}
    =
    \text{CITA-governed evidence architecture},
    \]

    \[
    \text{EIMT v1.2}
    =
    \text{metric-explicit benchmark scaffold},
    \]

    \[
    \text{EIMT v1.3}
    =
    \text{reference implementation and benchmark evidence layer},
    \]

    \[
    \text{EIMT v1.4}
    =
    \text{minimal kernel and concrete task-suite layer}.
    \]

    The missing surface after v1.3 was not more runtime theory. It was
    implementation contraction:

    \[
    \boxed{
    \text{reference implementation specification}
    \rightarrow
    \text{minimal module contracts}
    \rightarrow
    \text{concrete runnable tasks}.
    }
    \]

    This is the Codex execution law applied to memory theory:

    \[
    \boxed{
    \text{a framework becomes engineering-relevant only when its smallest}
    \atop
    \text{valid implementation can be built, run, compared, logged, and audited.}
    }
    \]

    \begin{remark}
    The memory is again functioning as an alignment attractor. It identifies the
    next missing CITA surface: task-level executable minimality.
    \end{remark}

    %──────────────────────────────────────────────────────────────────────────────
    \section{Minimal Reference Kernel Contract}
    %──────────────────────────────────────────────────────────────────────────────

    A minimal EIMT v1.4 kernel contains the following modules:

    \[
    \mathcal{K}_{EIMT}
    =
    \{
    E,
    M,
    D,
    R,
    G,
    A,
    P,
    S,
    L,
    B,
    Q,
    Y
    \}.
    \]

    where:

    \[
    E=\text{Episode},
    \quad
    M=\text{MemoryStore},
    \quad
    D=\text{MetricManifest},
    \quad
    R=\text{RetrievalEngine},
    \]

    \[
    G=\text{DriftGate},
    \quad
    A=\text{SourceFallback},
    \quad
    P=\text{ReplayEvaluator},
    \quad
    S=\text{SimulationGuard},
    \]

    \[
    L=\text{BaselineHarness},
    \quad
    B=\text{BenchmarkRunner},
    \quad
    Q=\text{ScoringModule},
    \quad
    Y=\text{EvidencePackageCompiler}.
    \]

    \begin{definition}[Minimal Reference Kernel]
    A minimal reference kernel is the smallest runnable EIMT implementation that
    can store source-bound episodes, retrieve by cue, measure drift, gate high-drift
    retrieval, invoke source fallback, compare against baselines, run benchmark
    tasks, emit metrics, and compile evidence packages.
    \end{definition}

    \begin{remark}
    The minimal kernel is intentionally small. It is a falsifiable implementation
    surface, not a full production memory system.
    \end{remark}

    %──────────────────────────────────────────────────────────────────────────────
    \section{Implementation Contract Layer}
    %──────────────────────────────────────────────────────────────────────────────

    Each module must satisfy an input-output contract.

    \begin{center}
    \begin{longtable}{>{\raggedright\arraybackslash}p{0.24\textwidth}
    >{\raggedright\arraybackslash}p{0.30\textwidth}
    >{\raggedright\arraybackslash}p{0.36\textwidth}}
    \toprule
    \textbf{Module} & \textbf{Input} & \textbf{Required output} \\
    \midrule
    Episode & context, content, time, state, source & source-bound episode record. \\
    MemoryStore & episode records & indexed memory field and source ledger. \\
    MetricManifest & representation types & declared distance functions and weights. \\
    RetrievalEngine & query, memory field & candidate episodes and raw scores. \\
    DriftGate & candidates, metric manifest & drift score, confidence, gate decision. \\
    SourceFallback & query, source refs, gate state & abstain / ask / uncertainty / source-check output. \\
    ReplayEvaluator & memory before / after replay & replay efficacy \(\Gamma_{\rho}\). \\
    SimulationGuard & generated output, memory field & simulation label and drift warning. \\
    BaselineHarness & task, baseline config & baseline outputs and metrics. \\
    BenchmarkRunner & task suite, runtime & benchmark metrics and logs. \\
    ScoringModule & observables, metrics & EIMTScore and classification. \\
    EvidencePackageCompiler & logs, metrics, configs & reproducible evidence package. \\
    \bottomrule
    \end{longtable}
    \end{center}

    \begin{proposition}[Implementation Contract Principle]
    A runtime claim is not EIMT v1.4 compliant unless each required module emits
    machine-readable outputs that can be audited after execution.
    \end{proposition}

    %──────────────────────────────────────────────────────────────────────────────
    \section{Minimal Episode and Memory Store}
    %──────────────────────────────────────────────────────────────────────────────

    The minimal runtime episode is:

    \[
    E_i^{min}
    =
    (c_i,x_i,t_i,s_i,\sigma_i,\ell_i,h_i),
    \]

    where:

    \[
    c_i=\text{context},
    \quad
    x_i=\text{content},
    \quad
    t_i=\text{time},
    \quad
    s_i=\text{agent or system state},
    \]

    \[
    \sigma_i=\text{source reference},
    \quad
    \ell_i=\text{ledger reference},
    \quad
    h_i=\text{fingerprint}.
    \]

    The minimal memory store is:

    \[
    \mathcal{M}^{min}
    =
    \{E_1^{min},E_2^{min},\dots,E_N^{min}\}.
    \]

    A valid memory store must support:

    \[
    \{\text{append},\text{retrieve},\text{source lookup},\text{fingerprint},
    \text{audit trace}\}.
    \]

    \begin{remark}
    A memory record without source or ledger reference may still be useful as a
    note, but it cannot support strong source-grounded EIMT claims.
    \end{remark}

    %──────────────────────────────────────────────────────────────────────────────
    \section{Concrete Benchmark Task Suite}
    %──────────────────────────────────────────────────────────────────────────────

    EIMT v1.4 turns the v1.3 task families into concrete task schemas.

    \[
    \mathcal{T}_{EIMT}
    =
    \{
    T_{source},
    T_{boundary},
    T_{context},
    T_{long},
    T_{replay},
    T_{planning}
    \}.
    \]

    where:

    \begin{itemize}
    \item \(T_{source}\) = source recall with distractors,
    \item \(T_{boundary}\) = boundary separation under overlapping entities,
    \item \(T_{context}\) = context-shift retrieval,
    \item \(T_{long}\) = long-horizon continuity,
    \item \(T_{replay}\) = replay compression without drift amplification,
    \item \(T_{planning}\) = constructive planning with simulation labels.
    \end{itemize}

    Each task must contain:

    \[
    \{\text{episodes},\text{queries},\text{ground truth},\text{distractors},
    \text{allowed fallback},\text{metrics},\text{baselines}\}.
    \]

    \begin{remark}
    A benchmark family is not executable until it contains concrete episodes,
    queries, expected outputs, and scoring rules.
    \end{remark}

    %──────────────────────────────────────────────────────────────────────────────
    \section{Concrete Task JSON Schemas}
    %──────────────────────────────────────────────────────────────────────────────

    A minimal source recall task is:

    \begin{verbatim}
    {
    "task_id": "source_recall_001",
    "task_family": "source_recall",
    "episodes": [
    {
    "episode_id": "E001",
    "context": "project_alpha_design_review",
    "content": "Sam approved the blue deployment plan.",
    "time": "2026-04-01T10:00:00",
    "state": "meeting_notes",
    "source_ref": "doc://alpha/design_review#p3",
    "ledger_ref": "L001"
    },
    {
    "episode_id": "E002",
    "context": "project_beta_design_review",
    "content": "Sam rejected the blue deployment plan.",
    "time": "2026-04-02T10:00:00",
    "state": "meeting_notes",
    "source_ref": "doc://beta/design_review#p2",
    "ledger_ref": "L002"
    }
    ],
    "queries": [
    {
    "query_id": "Q001",
    "query": "What did Sam decide about the blue deployment plan for Alpha?",
    "expected_episode_id": "E001",
    "expected_source_ref": "doc://alpha/design_review#p3",
    "allowed_fallback": ["source_check", "uncertain"]
    }
    ],
    "metrics": [
    "source_attribution_accuracy",
    "context_recall_accuracy",
    "retrieval_drift",
    "false_memory_frequency",
    "uncertainty_calibration"
    ],
    "baselines": ["database", "vector_only", "semantic_only", "ungated"]
    }
    \end{verbatim}

    A minimal boundary separation task is:

    \begin{verbatim}
    {
    "task_id": "boundary_separation_001",
    "task_family": "boundary_separation",
    "episodes": [
    {
    "episode_id": "E101",
    "context": "morning_lab_session",
    "content": "The sample warmed after calibration.",
    "boundary_id": "B1",
    "source_ref": "lab://runA/log#12"
    },
    {
    "episode_id": "E102",
    "context": "afternoon_lab_session",
    "content": "The sample cooled after recalibration.",
    "boundary_id": "B2",
    "source_ref": "lab://runB/log#18"
    }
    ],
    "queries": [
    {
    "query_id": "Q101",
    "query": "What happened after calibration in the morning session?",
    "expected_episode_id": "E101",
    "forbidden_episode_id": "E102"
    }
    ],
    "metrics": [
    "boundary_separation_score",
    "boundary_blending_error",
    "source_attribution_accuracy"
    ],
    "baselines": ["vector_only", "ungated", "summary_only"]
    }
    \end{verbatim}

    \begin{remark}
    These schemas are illustrative minimal tasks. Strong benchmark claims require
    larger task sets, held-out queries, and declared scoring rules.
    \end{remark}

    %──────────────────────────────────────────────────────────────────────────────
    \section{Baseline Harness Contract}
    %──────────────────────────────────────────────────────────────────────────────

    The baseline harness must run the same task against multiple memory systems:

    \[
    \mathcal{L}_{memory}
    =
    \{
    L_{db},
    L_{vec},
    L_{sem},
    L_{ungated},
    L_{summary},
    L_{random}
    \}.
    \]

    The harness must preserve:

    \[
    \{\text{baseline name},\text{configuration},\text{output},\text{metrics},
    \text{failure notes}\}.
    \]

    A valid benchmark comparison must use the same:

    \[
    \{\text{episodes},\text{queries},\text{ground truth},\text{metric rules}\}
    \]

    for EIMT and all baselines.

    \begin{proposition}[Baseline Fairness Principle]
    A benchmark does not support an EIMT-A runtime claim unless the EIMT runtime and
    baseline systems are evaluated on the same task data, query set, and metric
    rules.
    \end{proposition}

    %──────────────────────────────────────────────────────────────────────────────
    \section{Metric Emission Contract}
    %──────────────────────────────────────────────────────────────────────────────

    Every benchmark run must emit:

    \[
    \mathcal{M}^{emit}_{EIMT}
    =
    \{
    A_{src},
    A_{ctx},
    B_{sep},
    D_{ret},
    F_{hall},
    U_{cal},
    R_{replay},
    P_{plan},
    Q_{long},
    S_{scale},
    C_{cost}
    \}.
    \]

    A metric report must declare:

    \[
    \{\text{primary metrics},\text{secondary metrics},\text{diagnostic metrics}\}.
    \]

    A minimal metric JSON is:

    \begin{verbatim}
    {
    "run_id": "EIMT-BENCH-0001",
    "primary_metrics": {
    "source_attribution_accuracy": null,
    "false_memory_frequency": null,
    "retrieval_drift": null
    },
    "secondary_metrics": {
    "context_recall_accuracy": null,
    "boundary_separation_score": null,
    "uncertainty_calibration": null,
    "replay_preservation": null
    },
    "diagnostic_metrics": {
    "runtime_cost": null,
    "scalability": null,
    "fallback_rate": null
    },
    "metric_priority_declared_before_run": true
    }
    \end{verbatim}

    \begin{remark}
    Metrics selected after seeing results cannot support strong classification.
    \end{remark}

    %──────────────────────────────────────────────────────────────────────────────
    \section{Runtime Failure Taxonomy}
    %──────────────────────────────────────────────────────────────────────────────

    EIMT v1.4 adds an implementation failure taxonomy.

    \begin{center}
    \begin{longtable}{>{\raggedright\arraybackslash}p{0.26\textwidth}
    >{\raggedright\arraybackslash}p{0.60\textwidth}}
    \toprule
    \textbf{Failure class} & \textbf{Meaning} \\
    \midrule
    \(F_{schema}\) & Episode schema missing required fields. \\
    \(F_{source}\) & Retrieval lacks source or ledger support. \\
    \(F_{drift}\) & High-drift retrieval returned as fact. \\
    \(F_{boundary}\) & Adjacent episodes blended or fragmented incorrectly. \\
    \(F_{fallback}\) & Fallback not triggered under uncertainty. \\
    \(F_{replay}\) & Replay increases drift but is called stabilizing. \\
    \(F_{simulation}\) & Constructive output is mislabeled as recovered memory. \\
    \(F_{baseline}\) & Baselines missing or unfairly compared. \\
    \(F_{metric}\) & Metrics missing, post-hoc, or not machine-readable. \\
    \(F_{ledger}\) & Logs, result ledger, or evidence package missing. \\
    \(F_{overclaim}\) & Runtime result promoted beyond evidence. \\
    \bottomrule
    \end{longtable}
    \end{center}

    \[
    F_{drift}
    \vee
    F_{source}
    \vee
    F_{baseline}
    \vee
    F_{ledger}
    \Rightarrow
    \text{no EIMT-A classification}.
    \]

    %──────────────────────────────────────────────────────────────────────────────
    \section{Implementation Readiness Classification}
    %──────────────────────────────────────────────────────────────────────────────

    \begin{definition}[EIMT-R0: Concept Only]
    No runnable implementation exists. The artifact may be theoretically useful but
    cannot support runtime claims.
    \end{definition}

    \begin{definition}[EIMT-R1: Minimal Kernel]
    A minimal kernel exists with episode storage, retrieval, drift measurement, and
    basic logging.
    \end{definition}

    \begin{definition}[EIMT-R2: Gated Runtime]
    The runtime implements drift-gated retrieval, source fallback, and simulation
    labeling.
    \end{definition}

    \begin{definition}[EIMT-R3: Benchmarked Runtime]
    The runtime executes concrete tasks against declared baselines and emits
    machine-readable metrics.
    \end{definition}

    \begin{definition}[EIMT-R4: Evidence-Packaged Runtime]
    The runtime compiles reproducible evidence packages, result ledgers, downgrade
    paths, and falsification notes.
    \end{definition}

    \begin{definition}[EIMT-R5: Reproducible Reference Runtime]
    The runtime is independently rerunnable, benchmarked across task families,
    baseline-compared, evidence-packaged, and downgrade-preserving.
    \end{definition}

    \begin{remark}
    Implementation readiness is separate from EIMT-A/B/C/D/E claim strength. A
    runtime can be well-implemented and still lose to simpler baselines.
    \end{remark}

    %──────────────────────────────────────────────────────────────────────────────
    \section{EIMT v1.4 Scoring Surface}
    %──────────────────────────────────────────────────────────────────────────────

    EIMT v1.4 expands v1.3 by adding implementation-contract, task-schema, and
    metric-emission observables:

    \[
    \mathcal{O}^{EIMT}_{v1.4}
    =
    \{S,F,E,C,B,R,K,P,L,T,D,H,N,V,X,G,M,Q,Z,W,A,I,J,Y,U,\Psi,\Xi\}.
    \]

    where:

    \[
    U=\text{implementation contract},
    \quad
    \Psi=\text{concrete task-suite schema},
    \quad
    \Xi=\text{machine-readable metric emission}.
    \]

    \begin{center}
    \begin{longtable}{>{\raggedright\arraybackslash}p{0.36\textwidth}
    >{\centering\arraybackslash}p{0.13\textwidth}
    >{\raggedright\arraybackslash}p{0.41\textwidth}}
    \toprule
    \textbf{Observable} & \textbf{Status (0 / 0.5 / 1)} & \textbf{Evidence} \\
    \midrule
    \(S\) Source / Domain Boundary & & \\
    \(F\) Fidelity Stratification & & \\
    \(E\) Episode-State Definition & & \\
    \(C\) Context Binding & & \\
    \(B\) Event-Boundary Gate & & \\
    \(R\) Cue-Dependent Retrieval & & \\
    \(K\) Retrieval Contraction / Drift Gate & & \\
    \(P\) Replay / Reactivation Process & & \\
    \(L\) Consolidation / Transformation Layer & & \\
    \(T\) Temporal Context Dynamics & & \\
    \(D\) Drift Measurement \(\Delta\Phi\) & & \\
    \(H\) Fingerprint / Ledger & & \\
    \(N\) Negative Controls & & \\
    \(V\) Validation Surface & & \\
    \(X\) Falsification Surface & & \\
    \(G\) Generalization Across Episodes / Tasks & & \\
    \(M\) Memory-Promotion Rule & & \\
    \(Q\) Metric / Distance Manifest & & \\
    \(Z\) Normalization / Multi-Scale Drift & & \\
    \(W\) Worked Example / Instantiation & & \\
    \(A\) Agent Benchmark / Scalability Layer & & \\
    \(I\) Reference Implementation Kernel & & \\
    \(J\) Baseline-Family Runtime Comparison & & \\
    \(Y\) Runtime Evidence Package / Result Ledger & & \\
    \(U\) Implementation Contract & & \\
    \(\Psi\) Concrete Task-Suite Schema & & \\
    \(\Xi\) Machine-Readable Metric Emission & & \\
    \bottomrule
    \end{longtable}
    \end{center}

    \[
    \mathrm{EIMTScore}_{v1.4}
    =
    \frac{
    S+F+E+C+B+R+K+P+L+T+D+H+N+V+X+G+M+Q+Z+W+A+I+J+Y+U+\Psi+\Xi
    }{27}.
    \]

    \begin{remark}
    EIMTScore measures framework completeness, implementation auditability, and
    benchmark discipline. It does not measure literal truth, clinical validity,
    human-memory equivalence, or biological mechanism proof.
    \end{remark}

    %──────────────────────────────────────────────────────────────────────────────
    \section{Validation Layer}
    %──────────────────────────────────────────────────────────────────────────────

    A valid EIMT v1.4 runtime analysis must identify:

    \begin{enumerate}
    \item domain,
    \item episode schema,
    \item memory-store interface,
    \item metric manifest,
    \item implementation contract,
    \item retrieval operator,
    \item drift-gate threshold,
    \item fallback behavior,
    \item simulation-labeling rule,
    \item replay-efficacy metric,
    \item baseline harness,
    \item concrete benchmark task schema,
    \item primary metric priorities,
    \item machine-readable metric output,
    \item runtime logs,
    \item result ledger,
    \item evidence package,
    \item implementation-readiness class,
    \item falsification conditions,
    \item downgrade path,
    \item memory-promotion candidates.
    \end{enumerate}

    %──────────────────────────────────────────────────────────────────────────────
    \section{Falsification Surface}
    %──────────────────────────────────────────────────────────────────────────────

    EIMT v1.4 is weakened or rejected if:

    \begin{itemize}
    \item no runnable minimal kernel exists for a runtime claim,
    \item no implementation contract is declared,
    \item no concrete benchmark task is provided,
    \item no episode schema is defined,
    \item no metric manifest is declared,
    \item context binding is absent,
    \item retrieval is not cue-dependent,
    \item high-drift retrieval is returned as fact,
    \item source fallback is missing for source-sensitive retrieval,
    \item replay increases drift while being called stabilization,
    \item constructive simulation is treated as recovered memory,
    \item no baseline harness is run,
    \item baselines are evaluated on different task data,
    \item benchmark metrics are selected after results,
    \item metric output is not machine-readable,
    \item logs or result ledgers are absent,
    \item evidence package is incomplete,
    \item vector-only or database-only baselines perform equally well or better,
    \item agent memory is equated with human autonoetic memory,
    \item clinical claims are made without clinical evidence,
    \item benchmark success is treated as biological proof,
    \item coherence is treated as truth.
    \end{itemize}

    Compact falsification condition:

    \[
    \text{EIMT-A runtime claim}
    \wedge
    \left(
    I=0
    \vee
    U=0
    \vee
    \Psi=0
    \vee
    \Xi=0
    \vee
    J=0
    \vee
    Y=0
    \vee
    K=0
    \vee
    D=0
    \vee
    N=0
    \vee
    X=0
    \right)
    \Rightarrow
    \text{invalid strong runtime classification}.
    \]

    %──────────────────────────────────────────────────────────────────────────────
    \section{Upgrade and Downgrade Thresholds}
    %──────────────────────────────────────────────────────────────────────────────

    A candidate may be considered for EIMT-A only if:

    \[
    \mathrm{EIMTScore}_{v1.4}=1,
    \]

    and runtime evidence shows that the EIMT implementation outperforms declared
    baselines on primary benchmark metrics without violating non-claim locks.

    A candidate should be classified as EIMT-B if:

    \[
    \mathrm{EIMTScore}_{v1.4}<1
    \]

    but multiple episodic invariants remain useful and partially supported.

    A candidate should be classified as EIMT-C if a simpler non-episodic memory
    model explains the behavior or performs equally well.

    A candidate should be classified as EIMT-D if runtime evidence is insufficient.

    A candidate should be classified as EIMT-E if the claim is overextended,
    unmeasured, clinically unsupported, benchmark-unsupported, source-free,
    implementation-free, task-free, or dependent on coherence rather than evidence.

    %──────────────────────────────────────────────────────────────────────────────
    \section{Repository Record Grammar}
    %──────────────────────────────────────────────────────────────────────────────

    A repository-ready EIMT v1.4 project should preserve minimal kernel code,
    task schemas, baselines, benchmark runs, metrics, evidence packages, and result
    ledgers.

    \begin{verbatim}
    eimt_reference_kernel/
    README.md
    docs/
    theory/
    eimt_v1_4.tex
    source_fidelity.md
    invariants.md
    quick_start.md
    implementation_contract/
    module_contracts.md
    runtime_interfaces.md
    failure_taxonomy.md
    benchmark_protocol/
    task_schemas.md
    baseline_harness.md
    metric_emission.md
    falsification_surface.md
    src/
    eimt/
    episode.py
    memory_store.py
    metric_manifest.py
    retrieval_engine.py
    drift_gate.py
    source_fallback.py
    replay_evaluator.py
    simulation_guard.py
    baseline_harness.py
    benchmark_runner.py
    scoring.py
    evidence_package.py
    configs/
    metric_manifest.json
    runtime_config.json
    baseline_config.json
    tasks/
    source_recall_001.json
    boundary_separation_001.json
    context_shift_001.json
    long_horizon_001.json
    replay_compression_001.json
    constructive_planning_001.json
    runs/
    run_<timestamp>/
    episode_log.jsonl
    retrieval_log.jsonl
    fallback_log.jsonl
    replay_log.jsonl
    simulation_log.jsonl
    baseline_results.json
    benchmark_metrics.json
    drift_metrics.json
    metric_emission.json
    classification.json
    evidence_package.json
    result_ledger.jsonl
    evidence/
    raw_inputs/
    processed_outputs/
    negative_controls/
    benchmark_packages/
    ledgers/
    eimt_evolution_ledger.jsonl
    eimt_runtime_ledger.jsonl
    eimt_decision_ledger.jsonl
    memory/
    promoted_invariants.md
    rejected_overclaims.md
    runtime_failure_lessons.md
    \end{verbatim}

    %──────────────────────────────────────────────────────────────────────────────
    \section{Minimal EIMT v1.4 Runtime Evidence JSON Skeleton}
    %──────────────────────────────────────────────────────────────────────────────

    \begin{verbatim}
    {
    "record_id": "EIMT-RUN-0001",
    "version": "EIMT-v1.4",
    "runtime_name": "",
    "domain": "agent_memory",
    "implementation_readiness": "EIMT-R0/R1/R2/R3/R4/R5",
    "episode_schema": {
    "context": "",
    "content": "",
    "time": "",
    "self_or_agent_state": "",
    "source_ref": "",
    "ledger_ref": "",
    "fingerprint": ""
    },
    "implementation_contract": {
    "episode": true,
    "memory_store": true,
    "metric_manifest": true,
    "retrieval_engine": true,
    "drift_gate": true,
    "source_fallback": true,
    "replay_evaluator": true,
    "simulation_guard": true,
    "baseline_harness": true,
    "benchmark_runner": true,
    "scoring": true,
    "evidence_package": true
    },
    "metric_manifest": {
    "context_distance": "",
    "content_distance": "",
    "time_distance": "",
    "state_distance": "",
    "fingerprint_distance": "",
    "weights": {}
    },
    "benchmark_task": {
    "task_family": "",
    "task_id": "",
    "task_schema_valid": false,
    "ground_truth_ref": "",
    "baseline_family": []
    },
    "metric_emission": {
    "machine_readable": true,
    "primary_metrics_declared_before_run": true,
    "primary_metrics": {},
    "secondary_metrics": {},
    "diagnostic_metrics": {}
    },
    "baseline_results": [],
    "drift_report": {
    "fast_drift": null,
    "slow_drift": null,
    "semantic_drift": null,
    "fingerprint_drift": null,
    "normalized_total_drift": null
    },
    "failure_taxonomy": {
    "schema_failure": false,
    "source_failure": false,
    "drift_failure": false,
    "boundary_failure": false,
    "fallback_failure": false,
    "replay_failure": false,
    "simulation_failure": false,
    "baseline_failure": false,
    "metric_failure": false,
    "ledger_failure": false,
    "overclaim_failure": false
    },
    "EIMTScore_v1_4": null,
    "classification": "",
    "downgrade_path": "",
    "falsification_note": "",
    "memory_promotion": {
    "promote": false,
    "items": [],
    "reason": ""
    },
    "non_claim_locks": [
    "not_clinical_guidance",
    "not_universal_mechanism",
    "not_ai_equals_human_memory",
    "coherence_not_truth",
    "simulation_not_biological_proof",
    "benchmark_success_not_human_memory_proof",
    "runtime_success_not_universal_mechanism_proof",
    "minimal_kernel_not_production_memory"
    ]
    }
    \end{verbatim}

    %──────────────────────────────────────────────────────────────────────────────
    \section{Appendix A — Minimal EIMT v1.4 Runtime Checklist}
    %──────────────────────────────────────────────────────────────────────────────

    \begin{enumerate}
    \item Is there a runnable minimal kernel?
    \item Is the implementation contract declared?
    \item Is the episode schema declared?
    \item Is source metadata preserved?
    \item Is the memory-store interface implemented?
    \item Is the metric manifest declared?
    \item Is retrieval cue-dependent?
    \item Is retrieval drift measured?
    \item Is high-drift retrieval gated?
    \item Is source fallback implemented?
    \item Is constructive simulation labeled?
    \item Is replay efficacy measured?
    \item Are baselines implemented?
    \item Are concrete benchmark tasks declared?
    \item Are task schemas valid?
    \item Are primary metrics declared before interpretation?
    \item Are metrics emitted in machine-readable form?
    \item Are runtime logs preserved?
    \item Is an evidence package compiled?
    \item Does EIMT outperform baselines on declared primary metrics?
    \item What implementation-readiness class applies?
    \item What falsifies the runtime claim?
    \item What downgrade class applies?
    \item What, if anything, is memory-promotable?
    \item Are clinical, biological-proof, AI-equivalence, and universal-mechanism
    locks preserved?
    \end{enumerate}

    %──────────────────────────────────────────────────────────────────────────────
    \section{Appendix B — Minimal Reference Kernel Pseudocode}
    %──────────────────────────────────────────────────────────────────────────────

    \begin{verbatim}
    Input:
    task_json
    metric_manifest
    runtime_config
    baseline_config

    Initialize:
    validate task schema
    load episodes
    build memory store
    load metric manifest
    initialize retrieval engine
    initialize drift gate
    initialize source fallback
    initialize replay evaluator
    initialize simulation guard
    initialize baseline harness
    initialize evidence compiler

    For each query in task:
    retrieve candidates from memory store
    compute distances using metric manifest
    compute retrieval drift
    compute omega = 1 / (1 + retrieval_drift)

    if retrieval_drift exceeds threshold:
    invoke source fallback:
    abstain / ask context / return uncertainty / source-check
    log fallback event
    else:
    return candidate with:
    episode id
    source ref
    confidence
    drift report
    log retrieval event

    For replay task:
    compute drift before replay
    apply bounded replay or summary
    compute drift after replay
    gamma_rho = drift_before - drift_after
    classify replay:
    stabilizing / neutral / destabilizing / transformation-only

    For planning task:
    generate plan from retrieved episodes
    label output as simulation
    prevent classification as recovered memory
    compute simulation drift

    Run baselines:
    database lookup
    vector-only retrieval
    semantic-only retrieval
    ungated episodic retrieval
    summary-only memory
    random control

    Score:
    compute primary metrics
    compute secondary metrics
    compute diagnostic metrics
    compare EIMT runtime against baselines
    compute EIMTScore_v1_4
    assign implementation readiness class
    classify EIMT-A/B/C/D/E

    Compile:
    runtime logs
    baseline results
    metric emission
    drift report
    failure taxonomy
    evidence package
    result ledger

    Promote to memory only:
    reproducible benchmark wins
    validated implementation constraints
    reusable failure lessons
    stable drift thresholds
    \end{verbatim}

    %──────────────────────────────────────────────────────────────────────────────
    \section{Appendix C — Canonical Formula Summary}
    %──────────────────────────────────────────────────────────────────────────────

    \[
    E=(c,x,t,s)
    \]

    \[
    E_i^{min}
    =
    (c_i,x_i,t_i,s_i,\sigma_i,\ell_i,h_i)
    \]

    \[
    \mathcal{K}_{EIMT}
    =
    \{
    E,
    M,
    D,
    R,
    G,
    A,
    P,
    S,
    L,
    B,
    Q,
    Y
    \}
    \]

    \[
    \mathcal{D}_{manifest}
    =
    \{d_c,d_x,d_t,d_s,d_H,w_c,w_x,w_t,w_s,w_H\}
    \]

    \[
    \Omega^{episodic}_k
    =
    \frac{1}{1+|\Delta\Phi^{retrieval}_k|}
    \]

    \[
    \mathcal{R}^{gated}(q,\mathcal{M})
    =
    \Omega^{episodic}_k\mathcal{R}(q,\mathcal{M})
    +
    (1-\Omega^{episodic}_k)\mathcal{A}(q)
    \]

    \[
    \mathcal{A}(q)
    \in
    \{
    \text{abstain},
    \text{ask},
    \text{uncertain},
    \text{source-check},
    \text{audit}
    \}
    \]

    \[
    \Gamma_{\rho}
    =
    \Delta\Phi^{episodic}_{pre}
    -
    \Delta\Phi^{episodic}_{post}
    \]

    \[
    \mathcal{T}_{EIMT}
    =
    \{
    T_{source},
    T_{boundary},
    T_{context},
    T_{long},
    T_{replay},
    T_{planning}
    \}
    \]

    \[
    \mathcal{L}_{memory}
    =
    \{
    L_{db},
    L_{vec},
    L_{sem},
    L_{ungated},
    L_{summary},
    L_{random}
    \}
    \]

    \[
    \mathrm{EIMTScore}_{v1.4}
    =
    \frac{
    S+F+E+C+B+R+K+P+L+T+D+H+N+V+X+G+M+Q+Z+W+A+I+J+Y+U+\Psi+\Xi
    }{27}
    \]

    %──────────────────────────────────────────────────────────────────────────────
    \section{Concluding Compression}
    %──────────────────────────────────────────────────────────────────────────────

    EIMT v1.4 names the minimal implementation-ready form of episodic memory
    invariance:

    \[
    \boxed{
    \text{an episodic-memory framework becomes implementation-ready only when}
    \atop
    \text{its smallest valid kernel can run concrete tasks, compare baselines,}
    \atop
    \text{emit metrics, preserve logs, and compile evidence packages.}
    }
    \]

    The implementer statement is:

    \[
    \boxed{
    \text{an EIMT runtime must store source-bound episodes, retrieve by cue,}
    \atop
    \text{measure drift, gate uncertainty, invoke source fallback, label}
    \atop
    \text{simulation, test replay, and refuse high-drift reconstruction as fact.}
    }
    \]

    The benchmark statement is:

    \[
    \boxed{
    \text{benchmarks become meaningful only when tasks contain concrete episodes,}
    \atop
    \text{queries, ground truth, distractors, baselines, metric rules, and}
    \atop
    \text{machine-readable outputs.}
    }
    \]

    The evidence statement is:

    \[
    \boxed{
    \text{execution without logs is not evidence;}
    \quad
    \text{benchmarks without baselines are not strong support;}
    \quad
    \text{tasks without ground truth are not benchmark tasks.}
    }
    \]

    The philosophical statement remains:

    \[
    \boxed{
    \text{episodic coherence is not perfect recall and not fiction;}
    \quad
    \text{it is bounded reconstructive stability.}
    }
    \]

    Thus, EIMT v1.4 upgrades EIMT from reference implementation specification to
    minimal runnable-kernel and concrete benchmark-task-suite layer while preserving
    source fidelity, clinical caution, AI-human distinction, falsification,
    negative controls, downgrade discipline, implementation auditability, and
    non-universal mechanism boundaries.

    \end{document}