7 RAG Retrieval Strategies, Benchmarked
Everyone building with LLMs hits the same wall eventually: the model knows nothing about your documents. Fine-tuning is expensive and brittle. The moment your data changes, you are retraining. Retrieval-Augmented Generation (RAG) sidesteps this entirely. Instead of baking knowledge into model weights, you retrieve relevant context at query time and let the LLM reason over it.
That idea is straightforward. Making it work well is not.
I built DocMind AI to explore that gap. It started as a simple document Q&A chatbot: ingest PDFs, text files, CSVs, Markdown, and web URLs, then answer questions with source citations. But the interesting question was never "can I build a RAG pipeline?" It was "which retrieval strategy actually works best, and how do I prove it?"
That question led me to implement seven different retrieval methods, build a full evaluation framework around RAGAS, and track everything in Weights & Biases. This post covers what I built, what the benchmarks revealed, and what surprised me along the way.
Source code: github.com/SuchinW/docmind-ai. MIT licensed, includes all retrieval strategies, evaluation pipeline, and a Streamlit UI.
How the Pipeline Works
The architecture follows the standard RAG pattern, but each stage is its own module with configuration driven by a single config.yaml:
Document Loading → Chunking → Embedding → Vector Store → Retrieval → Generation
Document Loading handles PDF, TXT, CSV, Markdown, and web URLs. Each loader attaches metadata (source file, page number) so the final answer can cite where it came from.
Chunking uses RecursiveCharacterTextSplitter with 1000-character chunks and 200-character overlap. The recursive splitter respects natural text boundaries (paragraph breaks, then sentences, then words) rather than cutting at arbitrary positions. The overlap ensures information at chunk boundaries is not lost.
Embedding defaults to OpenAI's text-embedding-3-small, with HuggingFace models (all-MiniLM-L6-v2) as a local alternative. One constraint that bit me early: the embedding model at index time must match the one at query time. Different models produce incompatible vector spaces; switch models and your retrieval breaks silently.
Vector Store uses FAISS for approximate nearest-neighbor search, with persistence to disk so you do not re-embed the same documents across sessions.
Retrieval is where the project gets interesting: seven strategies, each with distinct trade-offs. More on this below.
Generation uses LangChain Expression Language (LCEL) with a prompt that instructs the model to answer only from provided context and cite sources explicitly. This prompt design turned out to matter more than which retrieval method I chose, but I will get to that in the results.
The Seven Retrieval Strategies
The retriever is the bottleneck in any RAG system. A better retriever means better context, which means better answers. I implemented seven methods not because a production system needs all of them, but because I wanted to understand the trade-offs empirically rather than relying on conventional wisdom.
1. Similarity Search (Baseline)
Pure cosine similarity in FAISS. Embed the query, return the k closest chunks. This is the simplest possible approach, and it works surprisingly well when the user's vocabulary matches the document's vocabulary. It breaks down on paraphrase. If the user says "revenue" and the document says "earnings," similarity search can miss it entirely.
2. Maximal Marginal Relevance (MMR)
MMR adds a diversity penalty: after selecting the most relevant chunk, it penalizes subsequent chunks that are too similar to those already selected. This prevents returning five near-identical passages when the user needs breadth. The lambda_mult parameter controls the trade-off (1.0 = pure relevance, 0.0 = pure diversity). I settled on 0.7 after experimentation.
3. Hybrid Retrieval (BM25 + Vector + RRF)
This combines keyword search (BM25) with semantic search (vector similarity) using Reciprocal Rank Fusion. BM25 excels at exact term matching but has zero semantic understanding. Vector search understands meaning but can miss exact keywords. RRF merges both ranked lists:
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = top_k
vector_retriever = vector_store.as_retriever(
search_kwargs={"k": top_k}
)
return EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.4, 0.6], # semantic search weighted higher
)
The 40/60 weight split gives semantic search the lead while keeping a keyword safety net. This handles the widest range of query types.
4. Cross-Encoder Reranking
A two-stage approach: over-fetch 3x candidates with fast vector search, then re-score each with a cross-encoder that evaluates the full (query, document) pair jointly. Cross-encoders are far more accurate than bi-encoders because they attend to fine-grained interactions between query and document tokens, but they are too slow to run on the full corpus.
base_retriever = vector_store.as_retriever(
search_kwargs={"k": top_k * 3} # over-fetch
)
cross_encoder = HuggingFaceCrossEncoder(
model_name="cross-encoder/ms-marco-MiniLM-L-6-v2"
)
reranker = CrossEncoderReranker(model=cross_encoder, top_n=top_k)
return ContextualCompressionRetriever(
base_compressor=reranker,
base_retriever=base_retriever,
)
5. Multi-Query
The LLM generates alternative phrasings of the user's question, each phrasing retrieves its own chunks, and results are combined. This directly attacks vocabulary mismatch. If the user says "revenue" but the document says "earnings," one of the rephrased queries will likely bridge the gap. The cost is additional LLM calls.
6. Contextual Compression
After retrieving chunks, the LLM reads each one and extracts only the sentences relevant to the question. This reduces noise in the final context but adds an LLM call per chunk, which adds up quickly with large top_k values.
7. Parent Document Retrieval
This solves a fundamental tension in chunking: small chunks are better for precise retrieval (less noise to match against), but large chunks are better for answer generation (more surrounding context). The solution is to index small chunks for retrieval but return their larger parent chunks as context:
# Small chunks for precise retrieval
child_splitter = RecursiveCharacterTextSplitter(
chunk_size=400, chunk_overlap=50
)
# Large parent chunks for rich context
parent_splitter = RecursiveCharacterTextSplitter(
chunk_size=2000, chunk_overlap=200
)
Each child chunk is tagged with its parent's ID, so after finding the precise match, the retriever looks up the broader surrounding context.
Conversational Memory
A RAG chatbot that only handles single-turn queries misses a major use case. Users naturally ask follow-ups: "What about the second approach?" or "Can you elaborate on that?" These questions are meaningless without conversation history.
Passing the raw follow-up to the retriever fails because the retriever cannot resolve "that" or "the second one." The fix is a contextualization chain that rewrites follow-up questions into standalone queries:
_CONTEXTUALIZE_PROMPT = ChatPromptTemplate.from_messages([
(
"system",
"Given the chat history and the latest user question, "
"reformulate the question to be a standalone question that "
"doesn't require the chat history to understand. "
"Do NOT answer the question, only reformulate it.",
),
MessagesPlaceholder("chat_history"),
("human", "{input}"),
])
The "do NOT answer" instruction is critical. Without it, the LLM attempts to answer the question instead of reformulating it. This chain only fires when there is chat history; the first question passes through unchanged.
Evaluation with RAGAS
Building a RAG system without evaluation is guesswork. Traditional metrics like BLEU and ROUGE measure text overlap, which is useless when the same answer can be phrased many valid ways. RAGAS takes a different approach: it uses an LLM as a judge to evaluate answer quality along four dimensions:
| Metric | What It Measures |
|---|---|
| Faithfulness | Is the answer grounded in the retrieved context? Catches hallucination. |
| Answer Relevancy | Does the answer address the question asked? Catches off-topic responses. |
| Context Precision | Are the retrieved chunks relevant to the question? Measures retrieval quality. |
| Context Recall | Do the chunks contain enough information to answer fully? Also measures retrieval. |
The evaluation pipeline generates test questions from the documents, runs each through every retrieval method, computes all four metrics, and logs everything to Weights & Biases for comparison.
# Evaluate all 7 methods with 50 test questions
python -m eval.evaluate --docs data/sample_docs/ --num-questions 50
# Compare specific methods with local embeddings
python -m eval.evaluate --docs data/sample_docs/ \
--methods similarity hybrid rerank \
--embedding-model all-MiniLM-L6-v2
Benchmark Results
I ran the full evaluation across all seven strategies: 50 LLM-generated test questions, three sample documents (covering RAG, Transformers, and vector databases), with RAGAS scoring every response. Here is what the data shows.
RAGAS Evaluation: 7 Retrieval Methods Compared
50 test questions · 3 documents · OpenAI text-embedding-3-small
| Method | Faithfulness | Answer Relevancy | Context Precision | Context Recall | Overall |
|---|---|---|---|---|---|
| Similarity | 0.9680 | 0.9279 | 0.8944 | 0.8706 | 0.9152 |
| MMR | 0.9875 | 0.9173 | 0.8528 | 0.9278 | 0.9213 |
| Hybrid | 0.9681 | 0.9219 | 0.8777 | 0.8667 | 0.9086 |
| Rerank | 0.9806 | 0.9262 | 0.8986 | 0.8595 | 0.9162 |
| Multi-Query | 0.9858 | 0.9226 | 0.8739 | 0.8778 | 0.9150 |
| Contextual Compression | 0.9649 | 0.9182 | 0.8625 | 0.9406 | 0.9215 |
| Parent Document | 0.9798 | 0.9249 | 0.8472 | 0.9500 | 0.9255 |
Reading the Numbers
Parent Document wins overall (0.9255). The dual-chunking strategy (small 400-character chunks as search keys, large 2000-character parent chunks as context) achieves the best balance across all four metrics. It also leads context recall by a wide margin (0.9500) because the larger parent chunks naturally contain more of the information needed for complete answers.
MMR wins faithfulness (0.9875). By promoting diversity in the retrieved set, MMR ensures the LLM sees a broader cross-section of relevant information, reducing the chance of over-relying on a single passage and hallucinating beyond what is stated. Multi-Query is close behind (0.9858); query rephrasing surfaces complementary context from different angles.
Rerank wins context precision (0.8986). The cross-encoder's joint evaluation of (query, document) pairs selects the most precisely relevant chunks. Similarity search is close (0.8944), suggesting pure cosine similarity is already a strong precision signal.
Similarity wins answer relevancy (0.9279). This was the surprise. The simplest method produced the most on-topic answers. Without diversity penalties or reranking, the context is maximally focused on the specific question, which helps the LLM stay on point.
All methods achieve high faithfulness (0.96+). The gap between best (MMR, 0.9875) and worst (contextual compression, 0.9649) is just 2.3 points. This tells me that prompt design (instructing the LLM to answer only from context and cite sources) does most of the grounding work. Retrieval strategy fine-tunes the result but does not determine whether the LLM hallucinates.
Context recall is the real differentiator. The largest performance gap across all metrics is in context recall, ranging from 0.8595 (rerank) to 0.9500 (parent document). Methods that return larger or more diverse context (parent document, contextual compression, MMR) capture more information. Rerank optimizes for precision at the expense of recall. The cross-encoder aggressively filters candidates, sometimes discarding passages with supporting details.
The radar chart below makes the trade-offs visual. Each method has a distinct shape revealing its strengths and weaknesses:
Method Profiles: Top 5 Retrieval Strategies
Each axis represents a RAGAS metric — the wider the shape, the better the method
The Verdict
Overall RAGAS Score by Retrieval Method
Average of Faithfulness, Answer Relevancy, Context Precision, Context Recall
All methods score within 1.9% of each other — retrieval strategy fine-tunes quality, prompt design provides the foundation
No single method dominates everywhere. The right choice depends on your use case:
- Best overall: Parent Document (0.9255). Highest recall, near-top faithfulness, no additional API calls. The dual-chunking strategy is the most consistently effective.
- Highest factual accuracy: MMR (0.9875 faithfulness). Diversity prevents over-reliance on a single source. A simple, low-cost improvement over baseline.
- Best retrieval precision: Rerank (0.8986 context precision). The cross-encoder excels at filtering noise, at the cost of latency and the lowest recall.
- Best simplicity-to-performance: Similarity. The baseline scores highest on answer relevancy and second on precision. For well-structured documents, advanced methods may not justify their complexity.
- Safest default: Hybrid. It did not win any individual metric, but it avoids the failure modes of pure keyword or pure semantic search. Its value grows with larger, more diverse document collections.
For most applications, start with hybrid retrieval as the safe default. Then evaluate whether parent document (for recall-sensitive cases) or rerank (for precision-sensitive cases) provides meaningful improvement on your specific data.
What I Learned
The baseline is stronger than you think. Similarity search (pure cosine similarity, no tricks) scored highest on answer relevancy and second on precision. Before reaching for complex retrieval pipelines, establish a baseline. Complexity should be justified by measurable improvement, not assumed.
Chunking decisions compound downstream. Chunks too large dilute relevant signal with noise. Chunks too small lose critical context. The context recall results confirm this: Parent Document retrieval's dual-chunking achieves the highest recall precisely because it sidesteps this trade-off.
Reranking trades recall for precision. Cross-encoder reranking delivered the best precision (0.8986) but the worst recall (0.8595). It aggressively filters candidates, which helps when documents are large and noisy, but hurts when every passage matters.
Context recall is the hidden bottleneck. Faithfulness is uniformly high (0.96+) across all methods; the prompt engineering works. The real differentiator is whether the retriever surfaces enough information to answer completely. Optimizing for precision alone can paradoxically hurt answer quality by starving the generator of context.
Evaluation changes everything. Without RAGAS metrics, I would have relied on manual spot-checking, which masks systematic failures. The benchmarks revealed patterns I would never have caught: that similarity search is often good enough, that MMR's diversity penalty significantly reduces hallucination, and that the gap between methods is smaller than expected when prompt engineering is done right.
Source attribution keeps the LLM honest. Forcing the model to cite sources is not just a UX feature; it is a grounding mechanism. When the LLM must point to specific chunks, it halluccinates less. The uniformly high faithfulness scores across all methods suggest this constraint is doing most of the heavy lifting.
Tech Stack
| Component | Technology |
|---|---|
| Framework | LangChain 0.3+ (LCEL) |
| Vector Store | FAISS |
| Embeddings | OpenAI text-embedding-3-small / HuggingFace |
| LLM Providers | OpenAI, Anthropic (Claude), Google (Gemini) |
| Keyword Search | BM25 (rank-bm25) |
| Reranking | HuggingFace cross-encoder/ms-marco-MiniLM-L-6-v2 |
| Evaluation | RAGAS + Weights & Biases |
| Web UI | Streamlit |
| CLI | argparse with single-query and interactive modes |
What is Next
Several directions I want to explore: fine-tuning embedding models on domain-specific data, agentic RAG with tool use for multi-step reasoning over documents, and graph-based retrieval where document relationships (not just content similarity) inform what gets retrieved. The evaluation framework makes it straightforward to benchmark any new approach against the existing baselines.
The full source code, evaluation pipeline, and Streamlit UI are available at github.com/SuchinW/docmind-ai. If you are building RAG systems, the evaluation infrastructure alone is worth pulling. It is the only way to make informed decisions about retrieval strategies instead of guessing.
Written by Suchinthaka Wanninayaka
AI/ML Researcher exploring semantic communications, diffusion models, and language model systems. Writing about deep learning from theory to production.
Related Articles
Responses
No responses yet. Be the first to share your thoughts!