7 RAG Retrieval Strategies, Benchmarked

Everyone building with LLMs hits the same wall eventually: the model knows nothing about your documents. Fine-tuning is expensive and brittle. The moment your data changes, you are retraining. Retrieval-Augmented Generation (RAG) sidesteps this entirely. Instead of baking knowledge into model weights, you retrieve relevant context at query time and let the LLM reason over it.

That idea is straightforward. Making it work well is not.

I built DocMind AI to explore that gap. It started as a simple document Q&A chatbot: ingest PDFs, text files, CSVs, Markdown, and web URLs, then answer questions with source citations. But the interesting question was never "can I build a RAG pipeline?" It was "which retrieval strategy actually works best, and how do I prove it?"

That question led me to implement seven different retrieval methods, build a full evaluation framework around RAGAS, and track everything in Weights & Biases. This post covers what I built, what the benchmarks revealed, and what surprised me along the way.

Source code: github.com/SuchinW/docmind-ai. MIT licensed, includes all retrieval strategies, evaluation pipeline, and a Streamlit UI.

How the Pipeline Works

The architecture follows the standard RAG pattern, but each stage is its own module with configuration driven by a single config.yaml:

Document Loading → Chunking → Embedding → Vector Store → Retrieval → Generation

Document Loading handles PDF, TXT, CSV, Markdown, and web URLs. Each loader attaches metadata (source file, page number) so the final answer can cite where it came from.

Chunking uses RecursiveCharacterTextSplitter with 1000-character chunks and 200-character overlap. The recursive splitter respects natural text boundaries (paragraph breaks, then sentences, then words) rather than cutting at arbitrary positions. The overlap ensures information at chunk boundaries is not lost.

Embedding defaults to OpenAI's text-embedding-3-small, with HuggingFace models (all-MiniLM-L6-v2) as a local alternative. One constraint that bit me early: the embedding model at index time must match the one at query time. Different models produce incompatible vector spaces; switch models and your retrieval breaks silently.

Vector Store uses FAISS for approximate nearest-neighbor search, with persistence to disk so you do not re-embed the same documents across sessions.

Retrieval is where the project gets interesting: seven strategies, each with distinct trade-offs. More on this below.

Generation uses LangChain Expression Language (LCEL) with a prompt that instructs the model to answer only from provided context and cite sources explicitly. This prompt design turned out to matter more than which retrieval method I chose, but I will get to that in the results.

The Seven Retrieval Strategies

The retriever is the bottleneck in any RAG system. A better retriever means better context, which means better answers. I implemented seven methods not because a production system needs all of them, but because I wanted to understand the trade-offs empirically rather than relying on conventional wisdom.

1. Similarity Search (Baseline)

Pure cosine similarity in FAISS. Embed the query, return the k closest chunks. This is the simplest possible approach, and it works surprisingly well when the user's vocabulary matches the document's vocabulary. It breaks down on paraphrase. If the user says "revenue" and the document says "earnings," similarity search can miss it entirely.

2. Maximal Marginal Relevance (MMR)

MMR adds a diversity penalty: after selecting the most relevant chunk, it penalizes subsequent chunks that are too similar to those already selected. This prevents returning five near-identical passages when the user needs breadth. The lambda_mult parameter controls the trade-off (1.0 = pure relevance, 0.0 = pure diversity). I settled on 0.7 after experimentation.

3. Hybrid Retrieval (BM25 + Vector + RRF)

This combines keyword search (BM25) with semantic search (vector similarity) using Reciprocal Rank Fusion. BM25 excels at exact term matching but has zero semantic understanding. Vector search understands meaning but can miss exact keywords. RRF merges both ranked lists:

python

bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = top_k
 
vector_retriever = vector_store.as_retriever(
    search_kwargs={"k": top_k}
)
 
return EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.4, 0.6],  # semantic search weighted higher
)

The 40/60 weight split gives semantic search the lead while keeping a keyword safety net. This handles the widest range of query types.

4. Cross-Encoder Reranking

A two-stage approach: over-fetch 3x candidates with fast vector search, then re-score each with a cross-encoder that evaluates the full (query, document) pair jointly. Cross-encoders are far more accurate than bi-encoders because they attend to fine-grained interactions between query and document tokens, but they are too slow to run on the full corpus.

python

base_retriever = vector_store.as_retriever(
    search_kwargs={"k": top_k * 3}  # over-fetch
)
 
cross_encoder = HuggingFaceCrossEncoder(
    model_name="cross-encoder/ms-marco-MiniLM-L-6-v2"
)
reranker = CrossEncoderReranker(model=cross_encoder, top_n=top_k)
 
return ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=base_retriever,
)

5. Multi-Query

The LLM generates alternative phrasings of the user's question, each phrasing retrieves its own chunks, and results are combined. This directly attacks vocabulary mismatch. If the user says "revenue" but the document says "earnings," one of the rephrased queries will likely bridge the gap. The cost is additional LLM calls.

6. Contextual Compression

After retrieving chunks, the LLM reads each one and extracts only the sentences relevant to the question. This reduces noise in the final context but adds an LLM call per chunk, which adds up quickly with large top_k values.

7. Parent Document Retrieval

This solves a fundamental tension in chunking: small chunks are better for precise retrieval (less noise to match against), but large chunks are better for answer generation (more surrounding context). The solution is to index small chunks for retrieval but return their larger parent chunks as context:

python

# Small chunks for precise retrieval
child_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400, chunk_overlap=50
)
 
# Large parent chunks for rich context
parent_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000, chunk_overlap=200
)

Each child chunk is tagged with its parent's ID, so after finding the precise match, the retriever looks up the broader surrounding context.

Conversational Memory

A RAG chatbot that only handles single-turn queries misses a major use case. Users naturally ask follow-ups: "What about the second approach?" or "Can you elaborate on that?" These questions are meaningless without conversation history.

Passing the raw follow-up to the retriever fails because the retriever cannot resolve "that" or "the second one." The fix is a contextualization chain that rewrites follow-up questions into standalone queries:

python

_CONTEXTUALIZE_PROMPT = ChatPromptTemplate.from_messages([
    (
        "system",
        "Given the chat history and the latest user question, "
        "reformulate the question to be a standalone question that "
        "doesn't require the chat history to understand. "
        "Do NOT answer the question, only reformulate it.",
    ),
    MessagesPlaceholder("chat_history"),
    ("human", "{input}"),
])

The "do NOT answer" instruction is critical. Without it, the LLM attempts to answer the question instead of reformulating it. This chain only fires when there is chat history; the first question passes through unchanged.

Evaluation with RAGAS

Building a RAG system without evaluation is guesswork. Traditional metrics like BLEU and ROUGE measure text overlap, which is useless when the same answer can be phrased many valid ways. RAGAS takes a different approach: it uses an LLM as a judge to evaluate answer quality along four dimensions:

Metric	What It Measures
Faithfulness	Is the answer grounded in the retrieved context? Catches hallucination.
Answer Relevancy	Does the answer address the question asked? Catches off-topic responses.
Context Precision	Are the retrieved chunks relevant to the question? Measures retrieval quality.
Context Recall	Do the chunks contain enough information to answer fully? Also measures retrieval.

The evaluation pipeline generates test questions from the documents, runs each through every retrieval method, computes all four metrics, and logs everything to Weights & Biases for comparison.

bash

# Evaluate all 7 methods with 50 test questions
python -m eval.evaluate --docs data/sample_docs/ --num-questions 50
 
# Compare specific methods with local embeddings
python -m eval.evaluate --docs data/sample_docs/ \
    --methods similarity hybrid rerank \
    --embedding-model all-MiniLM-L6-v2

Benchmark Results

I ran the full evaluation across all seven strategies: 50 LLM-generated test questions, three sample documents (covering RAG, Transformers, and vector databases), with RAGAS scoring every response. Here is what the data shows.

RAGAS Evaluation: 7 Retrieval Methods Compared

50 test questions · 3 documents · OpenAI text-embedding-3-small

Method	Faithfulness	Answer Relevancy	Context Precision	Context Recall	Overall
Similarity	0.9680	0.9279	0.8944	0.8706	0.9152
MMR	0.9875	0.9173	0.8528	0.9278	0.9213
Hybrid	0.9681	0.9219	0.8777	0.8667	0.9086
Rerank	0.9806	0.9262	0.8986	0.8595	0.9162
Multi-Query	0.9858	0.9226	0.8739	0.8778	0.9150
Contextual Compression	0.9649	0.9182	0.8625	0.9406	0.9215
Parent Document	0.9798	0.9249	0.8472	0.9500	0.9255

Reading the Numbers

Parent Document wins overall (0.9255). The dual-chunking strategy (small 400-character chunks as search keys, large 2000-character parent chunks as context) achieves the best balance across all four metrics. It also leads context recall by a wide margin (0.9500) because the larger parent chunks naturally contain more of the information needed for complete answers.

MMR wins faithfulness (0.9875). By promoting diversity in the retrieved set, MMR ensures the LLM sees a broader cross-section of relevant information, reducing the chance of over-relying on a single passage and hallucinating beyond what is stated. Multi-Query is close behind (0.9858); query rephrasing surfaces complementary context from different angles.

Rerank wins context precision (0.8986). The cross-encoder's joint evaluation of (query, document) pairs selects the most precisely relevant chunks. Similarity search is close (0.8944), suggesting pure cosine similarity is already a strong precision signal.

Similarity wins answer relevancy (0.9279). This was the surprise. The simplest method produced the most on-topic answers. Without diversity penalties or reranking, the context is maximally focused on the specific question, which helps the LLM stay on point.

All methods achieve high faithfulness (0.96+). The gap between best (MMR, 0.9875) and worst (contextual compression, 0.9649) is just 2.3 points. This tells me that prompt design (instructing the LLM to answer only from context and cite sources) does most of the grounding work. Retrieval strategy fine-tunes the result but does not determine whether the LLM hallucinates.

Context recall is the real differentiator. The largest performance gap across all metrics is in context recall, ranging from 0.8595 (rerank) to 0.9500 (parent document). Methods that return larger or more diverse context (parent document, contextual compression, MMR) capture more information. Rerank optimizes for precision at the expense of recall. The cross-encoder aggressively filters candidates, sometimes discarding passages with supporting details.

The radar chart below makes the trade-offs visual. Each method has a distinct shape revealing its strengths and weaknesses:

Method Profiles: Top 5 Retrieval Strategies

Each axis represents a RAGAS metric — the wider the shape, the better the method

The Verdict

Overall RAGAS Score by Retrieval Method

Average of Faithfulness, Answer Relevancy, Context Precision, Context Recall

All methods score within 1.9% of each other — retrieval strategy fine-tunes quality, prompt design provides the foundation

No single method dominates everywhere. The right choice depends on your use case:

Best overall: Parent Document (0.9255). Highest recall, near-top faithfulness, no additional API calls. The dual-chunking strategy is the most consistently effective.
Highest factual accuracy: MMR (0.9875 faithfulness). Diversity prevents over-reliance on a single source. A simple, low-cost improvement over baseline.
Best retrieval precision: Rerank (0.8986 context precision). The cross-encoder excels at filtering noise, at the cost of latency and the lowest recall.
Best simplicity-to-performance: Similarity. The baseline scores highest on answer relevancy and second on precision. For well-structured documents, advanced methods may not justify their complexity.
Safest default: Hybrid. It did not win any individual metric, but it avoids the failure modes of pure keyword or pure semantic search. Its value grows with larger, more diverse document collections.

For most applications, start with hybrid retrieval as the safe default. Then evaluate whether parent document (for recall-sensitive cases) or rerank (for precision-sensitive cases) provides meaningful improvement on your specific data.

What I Learned

The baseline is stronger than you think. Similarity search (pure cosine similarity, no tricks) scored highest on answer relevancy and second on precision. Before reaching for complex retrieval pipelines, establish a baseline. Complexity should be justified by measurable improvement, not assumed.

Chunking decisions compound downstream. Chunks too large dilute relevant signal with noise. Chunks too small lose critical context. The context recall results confirm this: Parent Document retrieval's dual-chunking achieves the highest recall precisely because it sidesteps this trade-off.

Reranking trades recall for precision. Cross-encoder reranking delivered the best precision (0.8986) but the worst recall (0.8595). It aggressively filters candidates, which helps when documents are large and noisy, but hurts when every passage matters.

Context recall is the hidden bottleneck. Faithfulness is uniformly high (0.96+) across all methods; the prompt engineering works. The real differentiator is whether the retriever surfaces enough information to answer completely. Optimizing for precision alone can paradoxically hurt answer quality by starving the generator of context.

Evaluation changes everything. Without RAGAS metrics, I would have relied on manual spot-checking, which masks systematic failures. The benchmarks revealed patterns I would never have caught: that similarity search is often good enough, that MMR's diversity penalty significantly reduces hallucination, and that the gap between methods is smaller than expected when prompt engineering is done right.

Source attribution keeps the LLM honest. Forcing the model to cite sources is not just a UX feature; it is a grounding mechanism. When the LLM must point to specific chunks, it halluccinates less. The uniformly high faithfulness scores across all methods suggest this constraint is doing most of the heavy lifting.

Tech Stack

Component	Technology
Framework	LangChain 0.3+ (LCEL)
Vector Store	FAISS
Embeddings	OpenAI text-embedding-3-small / HuggingFace
LLM Providers	OpenAI, Anthropic (Claude), Google (Gemini)
Keyword Search	BM25 (rank-bm25)
Reranking	HuggingFace cross-encoder/ms-marco-MiniLM-L-6-v2
Evaluation	RAGAS + Weights & Biases
Web UI	Streamlit
CLI	argparse with single-query and interactive modes

What is Next

Several directions I want to explore: fine-tuning embedding models on domain-specific data, agentic RAG with tool use for multi-step reasoning over documents, and graph-based retrieval where document relationships (not just content similarity) inform what gets retrieved. The evaluation framework makes it straightforward to benchmark any new approach against the existing baselines.

The full source code, evaluation pipeline, and Streamlit UI are available at github.com/SuchinW/docmind-ai. If you are building RAG systems, the evaluation infrastructure alone is worth pulling. It is the only way to make informed decisions about retrieval strategies instead of guessing.

7 RAG Retrieval Strategies, Benchmarked

How the Pipeline Works

The Seven Retrieval Strategies

1. Similarity Search (Baseline)

2. Maximal Marginal Relevance (MMR)

3. Hybrid Retrieval (BM25 + Vector + RRF)

4. Cross-Encoder Reranking

5. Multi-Query

6. Contextual Compression

7. Parent Document Retrieval

Conversational Memory

Evaluation with RAGAS

Benchmark Results

RAGAS Evaluation: 7 Retrieval Methods Compared

Reading the Numbers

Method Profiles: Top 5 Retrieval Strategies

The Verdict

Overall RAGAS Score by Retrieval Method

What I Learned

Tech Stack

What is Next

Related Articles

Responses