Why Traditional RAG Scores 0% on Multi-Hop Queries — and What Two Lines of Code Changed

Emil

## The Problem Nobody Talks About Ask your RAG system: "What award did the director of Inception...

## The Problem Nobody Talks About

Ask your RAG system: "What award did the director of Inception win?"

This requires two hops:

Inception → Christopher Nolan
Christopher Nolan → Academy Award

Your retrieval engine does hop 1 fine. But hop 2? The embedding of the original query is nowhere near "Academy Award" in vector space. The answer sits at rank 665. Your top-20 retrieval window never sees it.

We tested this systematically on HotpotQA fullwiki — 5.2M Wikipedia articles, 500 multi-hop questions.

Every traditional method scored 0% Hit@20. BM25. Dense retrieval. Rerankers. All of them.

What If the Query Could Change Shape?

In 1958, Daniel Koshland proposed the induced-fit model of enzyme binding. Unlike the rigid "lock and key" model, enzymes change their shape to fit the substrate.

We applied the same principle to retrieval.

At each hop, IFR mutates the query embedding based on what it just found. The query literally reshapes itself to reach the next piece of evidence.

Query → [hop 1: find Film X] → mutate → [hop 2: find director] → mutate → [hop 3: find award] → found

The Drift Problem

This sounds elegant on paper. In practice, v1 was a disaster.

67% of failures came from catastrophic drift — the query mutated so aggressively that by hop 3, it had lost >80% of its original meaning. It was finding documents, but completely wrong ones.

We tested 8 drift correction approaches:

PID controllers
Sentinel beams
Moving anchors
Drifting anchors
Threshold tuning
Hierarchical traversal
Attention-based edge weighting
Swarm coordination (Boids)

Most made things worse. The winner was embarrassingly simple:

# Blend 50% of original query at every hop
query_vector = 0.5 * mutated + 0.5 * original

# Hard reset if drift exceeds threshold
if cosine_sim(query_vector, original) < 0.5:
    query_vector = original

Two lines of code. nDCG went from 0.197 to 0.317 (+61%).

Benchmark Results

Tested on HotpotQA fullwiki: 5.2M Wikipedia articles, 500 questions, 3 random seeds, single RTX 3060.

Method	R@5	R@10	MRR
RAG-rerank baseline	0.337	0.337	0.548
IFR-hybrid+CE	0.366	0.366	0.554
Delta	+2.9% (p=0.0002)	+2.9%	+0.6%

R@5 = R@10 because IFR surfaces all retrievable targets within the top 5 — ranks 6–10 add no new hits at this difficulty level.

Scaling: O(1) latency — 100x data growth = 1.1x latency growth. Beam traversal takes ~10ms on the full 5.2M corpus.

Why Three Layers Beat Perfect Traversal

Raw beam search R@5 = 0.309. With cross-encoder reranking: 0.366 (+5.7 points).

The insight: drift noise scores high against the mutated query but low against the original. So the cross-encoder naturally filters it. Trying to eliminate drift at the beam level gives diminishing returns. The multi-layer pipeline is the actual solution.

Limitations (We're Honest)

The 50% blend ratio is empirical. We don't have a principled method for setting it.
Tested only on HotpotQA fullwiki. Other multi-hop benchmarks needed.
Single GPU (RTX 3060). Not benchmarked at enterprise scale. ---

Question for the community:
We fixed drift with a static 50% anchor blend — but this feels like a brute-force solution. Has anyone worked on adaptive blending that adjusts the anchor weight based on query complexity or hop distance? Curious what approaches you've tried.

github.com/emil-celestix/celestix-ifr