prasad-kumkar/multi-hop-retrieval-and-query-planning.md

Multi-Hop Retrieval and Query Planning for Complex Agentic Search

Complex questions rarely fail because the model cannot write. They fail because the system treats retrieval like a single search call when the task actually requires staged evidence gathering. If a user asks, "Which vendors with open security exceptions also appear in renewal contracts expiring next quarter, and what precedent do we have for approving them?" one vector lookup is the wrong unit of work.

Multi-hop retrieval turns that request into an evidence workflow. The system decomposes the question, plans subqueries, selects the right retrieval tools for each step, decides when enough evidence has been gathered, and prevents the workflow from ballooning into an expensive search tree.

1. Start with decomposition, not embedding

The first mistake in agentic search is to embed the whole user query and hope the index understands the hidden structure. Complex questions usually contain several jobs mixed together:

identify entities
resolve relationships between those entities
fetch supporting facts from different sources
reconcile conflicting evidence
produce an answer with bounded confidence

The system should first classify the request into an execution shape. In practice, most multi-hop questions fall into one or more of these patterns:

entity expansion: find all items related to a seed entity
constraint joining: intersect results from several filters or systems
causal or temporal tracing: follow a sequence across time or state changes
comparative synthesis: retrieve evidence for several alternatives before ranking
coverage completion: keep searching until required evidence slots are filled

The planner should convert the user request into explicit evidence requirements before retrieval starts. If you skip that step, the system pays for repeated broad searches and still misses the dependency chain.

2. Plan subqueries as a graph, not a list

A good multi-hop plan is usually not a static sequence. Some steps depend on the output of earlier steps, while others can run in parallel. Think of the plan as a small execution graph with typed nodes.

For example:

{
  "goal": "assess whether expiring vendor renewals with open security exceptions were previously approved",
  "steps": [
    {
      "id": "contracts",
      "type": "structured_lookup",
      "source": "contract_system",
      "query": {
        "status": "active",
        "renewal_window": "next_quarter"
      }
    },
    {
      "id": "security",
      "type": "structured_lookup",
      "source": "vendor_risk_system",
      "depends_on": ["contracts"],
      "join_key": "vendor_id",
      "query": {
        "exception_state": "open"
      }
    },
    {
      "id": "precedents",
      "type": "hybrid_search",
      "source": "approval_archive",
      "depends_on": ["security"],
      "query_template": "vendor renewal approval exceptions for {{vendor_name}}"
    },
    {
      "id": "synthesis",
      "type": "reason_over_evidence",
      "depends_on": ["security", "precedents"]
    }
  ]
}

Retrieval orchestration should not repeatedly search every source with the full question. Each node should know:

what evidence it is trying to produce
which source class is appropriate
what inputs it needs from previous steps
when it should stop or retry

That structure is what separates query planning from prompt chaining.

3. Match the tool to the evidence class

Not every subquery belongs in semantic search. Multi-hop systems waste money when they use the same retrieval tool for every step. Tool selection should follow the evidence type:

use structured lookups for IDs, states, dates, ownership, and workflow fields
use keyword or filtered search for exact clauses, product names, policy sections, or regulated terminology
use vector search for concept expansion, analogous cases, and fuzzy evidence discovery
use hybrid retrieval when both precise filters and semantic similarity matter
use web or external search tools only when the answer genuinely depends on outside information

A common failure mode is vector-searching for facts that already exist as clean fields in a system of record. Another is doing expensive live lookups when a static corpus is sufficient. Choose the cheapest tool that can reliably satisfy the evidence requirement.

4. Orchestrate retrieval in stages

Multi-hop retrieval works best when each stage narrows the search space for the next stage. A practical orchestration loop looks like this:

extract entities, constraints, and missing variables from the request
create a bounded plan with estimated tool cost per step
run low-cost, high-precision steps first
use those results to parameterize broader semantic retrieval only where needed
normalize all evidence into a shared internal format
decide whether the answer is complete, incomplete, conflicting, or unsafe

The order matters. If you start with broad semantic search, the system often retrieves plausible but weak context before it has resolved the core entities. If you start with exact filters and state lookups, later semantic searches can be scoped to a smaller and more relevant slice of content.

Normalization is also non-optional. Each retrieval step should emit stable evidence objects with source, timestamp, confidence, and join keys. Without that layer, later agents or synthesis steps end up reasoning over raw snippets that are hard to compare and easy to duplicate.

5. Define stop conditions early

Most cost blowups happen because the system does not know when to stop. It keeps reformulating, broadening, or branching because there is no explicit completion contract.

Each planned step should have at least one stop condition:

slot completion: enough evidence has been gathered to fill all required answer fields
confidence threshold: the marginal value of another retrieval pass is below a defined threshold
result cap: a maximum number of retrieved artifacts or branches has been reached
budget cap: token, latency, or tool-spend budget is exhausted
safety boundary: the workflow reached a state that requires clarification or human review

You also need stop conditions at the workflow level. For example, "Do not execute more than two reformulation rounds," or "Do not expand more than five entities from any one branch." Without those boundaries, one ambiguous question can fan out into dozens of subqueries.

6. Control branching and reformulation

Query reformulation is useful, but it is one of the fastest ways to burn cost. The planner should treat reformulations as a scarce resource, not a default reflex.

Good rules:

reformulate only when the current retrieval result is clearly under-specified
keep reformulations typed: broaden, narrow, disambiguate, or translate terminology
prefer parameterized reformulation from known evidence over free-form brainstorming
deduplicate near-identical subqueries before execution
cache retrieval results at the subquery level so later branches can reuse them

The same logic applies to branching. If one node yields twenty entities, do not blindly launch twenty semantic searches. Rank candidate branches first by relevance, downstream actionability, or novelty.

7. Measure cost as part of planning, not after the incident

Multi-hop search needs explicit cost accounting. That does not only mean model tokens. It includes database calls, live API latency, reranker cost, and cross-system fan-out.

At planning time, assign each step:

expected latency
expected token usage
estimated branch factor
source priority
fallback path if the step fails or exceeds budget

That makes it possible to prefer lower-cost plans when several valid strategies exist. In production, a slightly less exhaustive plan with strong source prioritization often beats a theoretically optimal search that times out or exhausts budget.

8. Synthesize only after evidence quality is known

The answer step should not be the place where the system discovers that evidence is weak. Before synthesis, run a quick verification pass:

do required evidence slots exist?
do sources conflict?
is the freshest source recent enough for this task?
are the supporting artifacts diverse enough, or did all evidence come from one weak branch?
is the answer grounded enough to produce a conclusion instead of a partial response?

If the answer is incomplete, the system should say so explicitly or ask a clarifying question. Fluent synthesis over incomplete evidence is what makes complex agentic search look smart in demos and unreliable in production.

The practical design principle is simple: retrieval is a budgeted workflow, not a limitless reasoning playground. Decompose the question into evidence needs. Build a small query graph. Pick the cheapest reliable tool for each node. Narrow the search space stage by stage. Enforce stop conditions before the workflow wanders. Treat branching, reformulation, and synthesis as controlled operations with cost and confidence attached.

That is what makes multi-hop retrieval useful instead of merely impressive in traces.