This document specifies a production-oriented implementation plan for a memory orchestration layer that sits in front of Mnemon and improves how agents retrieve, refine, and use graph memory. Mnemon should remain the persistent memory backend, while the new component acts as a query planner, retrieval orchestrator, re-ranker, and context packager for downstream agents.(Mnemon GitHub)(MCP architecture spec)
The recommended implementation target is an MCP server with a thin CLI for local debugging and batch experiments. MCP is designed around isolated, composable servers that expose focused capabilities to clients, which maps well to a memory orchestration service.(MCP architecture spec)(MCP memory server example)
Build a standalone service that makes agents use Mnemon more effectively without modifying Mnemon core in the first implementation. The system should improve recall quality, reduce irrelevant retrievals, support iterative memory expansion, and return compact, task-shaped context blocks suitable for agent consumption.(Mnemon GitHub)(Pragmatic Leader: Architecting Memory with MCP Servers)
Graph memory systems are good at persistent storage and graph-structured recall, but agents still need help deciding what to ask for, how broadly to expand, which memories to trust, and how to compress retrieved results into actionable context. A separate orchestrator can provide these capabilities while keeping the storage and retrieval substrate stable and reusable.(Mnemon GitHub)(Pragmatic Leader: Architecting Memory with MCP Servers)(Mnemis paper)
- Keep Mnemon as the source of truth for memory persistence and base retrieval.(Mnemon GitHub)
- Keep orchestration logic outside Mnemon until value is proven by benchmarks and production usage.(Mnemon GitHub)
- Expose capabilities as MCP tools first, because this makes the system easy to plug into agent runtimes that support MCP.(MCP architecture spec)(MCP memory server example)
- Preserve explainability by returning retrieval traces, scores, and reasons for inclusion when possible.(Pragmatic Leader: Architecting Memory with MCP Servers)
- Prefer deterministic retrieval pipelines before experimenting with learned or highly adaptive routing.
Primary users are developers building agent systems that need persistent memory, graph recall, and better memory utilization. Secondary users are researchers experimenting with retrieval orchestration, graph expansion, and memory-aware agent loops.(Mnemon GitHub)(Pragmatic Leader: Architecting Memory with MCP Servers)
- MCP server exposing memory orchestration tools.
- Thin CLI that calls the same internal orchestration library.
- Mnemon adapter module for querying and writing memory.
- Query rewrite and decomposition.
- Multi-stage retrieval pipeline.
- Optional graph neighborhood expansion.
- Re-ranking and deduplication.
- Context packaging for agents.
- Basic observability and evaluation harness.
- Replacing Mnemon storage.
- Forking Mnemon in v1.
- Training a custom neural model in v1.
- Building a full autonomous planner.
- UI beyond basic CLI diagnostics.
The component should be implemented as a wrapper in front of Mnemon rather than as a direct patch to Mnemon core. That allows independent iteration, benchmarking, rollback, and support for alternate memory backends later if desired.(Mnemon GitHub)(MCP architecture spec)
Agent / MCP Client
|
v
Memory Orchestrator MCP Server
- Query planner
- Retrieval pipeline
- Graph expander
- Re-ranker
- Context packager
- Trace/logger
|
v
Mnemon Adapter
|
v
Mnemon backend / storage
The orchestration server should expose stable tool interfaces while hiding internal retrieval complexity. Internally it should be built as a modular pipeline so each stage can be independently tested and replaced.(Mnemon GitHub)(MCP architecture spec)
The MCP architecture defines hosts, clients, and focused servers, which makes it a natural fit for a memory service that agents can call as a tool. Building this as MCP first makes the service reusable across multiple agent environments rather than coupling it to one custom runtime.(MCP architecture spec)
The agent asks for memory relevant to a current task. The orchestrator rewrites the query, retrieves candidates from Mnemon, re-ranks them, and returns a compact context package.
The agent asks about a person, project, or concept. The orchestrator retrieves seed memories, expands through graph neighbors, filters noisy branches, and returns a compressed set of linked evidence.
The agent asks what happened recently for an entity or topic. The orchestrator retrieves related memories over a time window, groups them, deduplicates them, and emits a chronological summary block.
The agent wants to write new memory. The orchestrator checks for near-duplicates, links the new memory to existing nodes if appropriate, and forwards the write to Mnemon.
The system must expose an MCP server with tools for retrieval, memory write, memory explain, and diagnostics. The server must be stateless at the protocol level, with request-scoped execution and optional shared caches behind the scenes.(MCP architecture spec)(MCP memory server example)
Required tools:
memory_searchmemory_expandmemory_contextmemory_writememory_explainmemory_health
The system must include a CLI that exercises the same internal modules as the MCP server. The CLI should support debugging pipelines locally, replaying queries from logs, and running evaluation batches.
Example commands:
orchestrator search --query "what do we know about vendor X?"orchestrator context --task "draft reply to customer" --query "recent issues with vendor X"orchestrator explain --query "project atlas blockers"orchestrator eval --dataset eval/queries.jsonl
The system must isolate all Mnemon interactions behind an adapter interface. This adapter should support candidate retrieval, graph neighbor expansion, entity lookup, memory write, and metadata fetch so the rest of the orchestrator is not tightly coupled to Mnemon internals.(Mnemon GitHub)
Suggested interface:
class MemoryBackend:
def search(query, top_k, filters=None): ...
def neighbors(node_ids, edge_types=None, hops=1, limit=50): ...
def get_memories(memory_ids): ...
def write(memory_record): ...
def link(source_id, target_id, relation): ...
def health(): ...The orchestrator must transform the raw agent request into retrieval-ready subqueries. This includes normalization, entity extraction, intent detection, optional time-window inference, and decomposition into parallel retrieval steps.
Inputs:
- user task
- optional conversation context
- optional filters
Outputs:
- canonical query
- subqueries
- retrieval strategy
- hop budget
- confidence estimate
The retrieval pipeline must support at least three stages:
- Initial candidate retrieval from Mnemon.
- Optional graph expansion around promising candidates.
- Re-ranking and pruning into a final result set.
The retrieval strategy should be configurable per request. Initial implementation should prefer simple heuristics and weighted scoring over experimental learned routing.(Pragmatic Leader: Architecting Memory with MCP Servers)(Mnemis paper)
The orchestrator must score candidates using a transparent weighted formula. The formula should combine relevance, recency, graph proximity, memory type, and duplication penalties.
Suggested v1 scoring model:
score =
w_relevance * semantic_score +
w_recency * recency_score +
w_graph * graph_proximity_score +
w_type * type_match_score -
w_dup * duplication_penalty -
w_noise * noise_penalty
Weights must be configurable. The system must return per-item score breakdowns in explain mode.
The orchestrator must convert the final memory set into a compact context object optimized for agent consumption. This object should include concise summaries, source IDs, entity links, and optional chronological or topical grouping.
Suggested output schema:
{
"query": "...",
"strategy": "direct|expanded|episodic",
"summary": "short synthesized memory context",
"items": [
{
"memory_id": "...",
"summary": "...",
"score": 0.91,
"reasons": ["entity match", "recent", "same project cluster"],
"linked_entities": ["..."],
"timestamp": "..."
}
],
"trace": {
"subqueries": ["..."],
"expanded_from": ["..."],
"dropped": [{"id": "...", "reason": "duplicate"}]
}
}The system must support an explain mode that shows query rewrites, retrieved candidates, expansion decisions, ranking scores, and drop reasons. This is required for debugging and for tuning retrieval quality.(Pragmatic Leader: Architecting Memory with MCP Servers)
The system must support writing new memory through the orchestrator. Before writing, it should perform duplicate detection and optional relation linking against nearby memories to avoid uncontrolled memory bloat.
The system must support YAML or TOML configuration for:
- backend endpoint
- top-k limits
- hop budgets
- scoring weights
- recency decay
- allowed edge types
- deduplication thresholds
- trace verbosity
- cache TTLs
- P50 memory search under 400 ms for direct recall in local deployments.
- P95 under 1.5 s for expanded multi-hop recall.
- Support configurable timeouts and partial-result fallback.
- Graceful degradation if graph expansion fails.
- Retry with bounded backoff for backend transient failures.
- Return partial trace and partial results rather than hard-failing whenever possible.
- Structured logs per request.
- Metrics: latency, recall depth, candidate counts, dedup ratio, expansion rate, cache hit rate.
- Query replay support from stored traces.
- No uncontrolled shell execution.
- Validate and sanitize all incoming tool parameters.
- Support access control if deployed as a shared service.
Purpose: retrieve relevant memory candidates.
Input:
{
"query": "string",
"task": "optional string",
"filters": {"entity_ids": [], "time_range": {}, "memory_types": []},
"top_k": 10,
"expand": false
}Output:
{
"items": [...],
"summary": "...",
"trace_id": "..."
}Purpose: produce a compact agent-ready context package.
Input:
{
"query": "string",
"task": "string",
"response_budget": {"max_items": 8, "max_chars": 3000},
"filters": {}
}Output:
{
"summary": "...",
"items": [...],
"context_block": "...",
"trace_id": "..."
}Purpose: expand around memory or entity seeds using graph neighbors.
Purpose: store memory with deduplication and optional linking.
Purpose: return debug trace for a prior request or fresh query execution.
Purpose: verify backend connectivity and basic retrieval readiness.
Recommended package structure:
memory_orchestrator/
app/
mcp_server.py
cli.py
core/
planner.py
retriever.py
expander.py
reranker.py
deduper.py
packager.py
explainer.py
backends/
mnemon_adapter.py
base.py
models/
schemas.py
config/
settings.py
eval/
runner.py
datasets/
tests/
Recommended stack:
- Python 3.11+
- FastMCP or another MCP server framework compatible with the target runtime
- Pydantic for schemas
- Typer or Click for CLI
- httpx for backend calls
- pytest for tests
- structlog or standard JSON logging for traces
Implement a deterministic orchestration layer first. Do not implement the complex-valued neural routing concept in v1. The first milestone should prove that a structured wrapper already improves Mnemon-backed retrieval.
Deliverables:
- MCP server
- CLI
- Mnemon adapter
- query planner
- retrieval pipeline
- heuristic reranker
- context packager
- explain mode
- tests and example config
Implement offline evaluation to compare:
- raw Mnemon retrieval
- orchestrated retrieval without expansion
- orchestrated retrieval with expansion
Metrics:
- precision@k
- recall@k
- nDCG@k
- duplicate rate
- average context length
- user-rated usefulness if a labeled dataset exists
Only after deterministic improvements are measured should the system add an experimental adaptive controller inspired by the proposed “wrapper magic.” This controller may adjust query expansion, reranking, or retrieval routing, but it must remain optional and behind a feature flag.
The controller should not replace Mnemon and should not be required for baseline operation.
Suggested scoring inputs:
- semantic similarity from Mnemon or embedding score
- entity overlap with extracted query entities
- relation overlap with requested task type
- recency decay for time-sensitive queries
- graph distance penalty for far expansions
- source reliability or confidence metadata if available
- duplicate and near-duplicate suppression
The agent-ready context block should:
- be concise
- include only high-value items
- preserve identifiers for traceability
- group by entity, time, or task when useful
- avoid dumping raw memory text when a synthesized summary is enough
Example format:
Memory context for task: draft vendor update
Relevant entities:
- Vendor X
- Project Atlas
- Infra team
Key recalled facts:
1. Vendor X missed the March delivery milestone and cited firmware instability.
2. Atlas depends on the delayed module integration.
3. Infra team escalated procurement risk in two recent episodes.
Supporting memory IDs:
- mem_1021
- mem_1044
- mem_1092
The orchestrator must avoid leaking Mnemon-specific implementation details into MCP responses except for stable identifiers and optional backend metadata. This keeps the wrapper portable and reduces coupling to Mnemon internals.(Mnemon GitHub)
The implementation must handle:
- backend unavailable
- empty retrieval set
- excessively broad graph expansion
- duplicate explosion
- stale or contradictory memories
- oversized context results
Fallback behavior should prefer smaller, safer outputs and explicit trace annotations.
The v1 implementation is acceptable when all of the following are true:
- An agent can call the MCP server to retrieve memory context backed by Mnemon.(Mnemon GitHub)(MCP architecture spec)
- A CLI can run the same retrieval path locally.
- Query planning, expansion, reranking, and packaging are modular and tested.
- Explain mode shows why memories were selected or dropped.
- Benchmark runs show equal or better precision/utility than direct Mnemon retrieval on a representative eval set.
- The system can be disabled and agents can still fall back to raw Mnemon retrieval.
Basic MCP server + Mnemon adapter + direct retrieval + packaging.
Query decomposition + explain mode + deduplication + CLI.
Graph expansion + reranking + evaluation harness.
Feature-flagged adaptive controller experiments.
Decide whether any proven retrieval features should be upstreamed into Mnemon core.
Reason: faster iteration, lower coupling, easier benchmarking, safer rollback, and better alignment with MCP’s composable server model.(Mnemon GitHub)(MCP architecture spec)
Reason: MCP is the runtime interface for agents, while CLI is the developer interface for debugging and evaluation.(MCP architecture spec)(MCP memory server example)
Reason: retrieval quality should be improved with transparent orchestration before adding experimental adaptive logic.
Implement a Python project called mnemon-memory-orchestrator that exposes an MCP server and a CLI. Use Mnemon as the backend through an adapter layer. Build a deterministic retrieval orchestrator with query planning, optional graph expansion, heuristic reranking, deduplication, context packaging, explain mode, configuration, tests, and an evaluation harness. Keep the code modular so that an experimental adaptive controller can be added later behind a feature flag. Do not modify Mnemon core in v1. Provide a runnable local setup, example config, and integration tests for the main MCP tools.(Mnemon GitHub)(MCP architecture spec)