Advanced Retrievers in LlamaIndex: Beyond Basic Vector Search

If you've been building RAG (Retrieval-Augmented Generation) applications, you've probably started with vector search — embed your documents, embed your query, retrieve the top-k similar chunks. It works. But it's only the beginning.
LlamaIndex ships with a rich set of advanced retrievers that go far beyond cosine similarity. In this article, we'll walk through six core retrieval strategies, when to use each one, and how they come together in a production pipeline.
What Are Advanced Retrievers?
Advanced retrievers are retrieval components that combine multiple techniques to find more relevant information than simple vector search can provide. They bring together:
Semantic Understanding — embeddings that grasp meaning, not just keywords
Keyword Matching — precise term-based search for exact specifications
Hierarchical Context — relationships between parent and child document segments
Multi-Query Processing — generating and merging results from query variations
Score Fusion — intelligently blending results from multiple retrieval methods
The payoff is real: more accurate retrieval, better context preservation, and fewer hallucinations from your LLM — because it only sees information that's actually relevant.
Index Types: The Foundation
Before retrievers make sense, you need to understand the three main index types in LlamaIndex. Each is designed for a different retrieval scenario.
VectorStoreIndex stores a vector embedding for each document chunk. It's the go-to for semantic retrieval in most LLM and RAG pipelines — best when meaning matters more than exact phrasing.
DocumentSummaryIndex generates summaries of documents at index time, then uses those summaries to filter candidates before retrieving full content. It shines when your document collection is large and diverse.
KeywordTableIndex extracts keywords from documents and maps them to specific chunks. Use it when you need exact term matching — think rule-based search or hybrid pipelines.
The Six Core Retrievers
1. Vector Index Retriever — The Foundation
This is where most RAG systems start. Documents are split into nodes, each gets embedded, and at query time the closest nodes by cosine similarity are returned.
from llama_index.core.retrievers import VectorIndexRetriever
retriever = VectorIndexRetriever(index=vector_index, similarity_top_k=3)
nodes = retriever.retrieve("What is machine learning?")
Best for: General-purpose semantic search, natural language queries, finding conceptually related content.
Limitation: It can miss exact keyword matches. If a user asks for "TF-IDF" and your documents say "Term Frequency-Inverse Document Frequency," the semantic distance may be small — but not always small enough.
2. BM25 Retriever — Keyword Search Done Right
BM25 is the algorithm behind Elasticsearch and Apache Lucene. It improves on TF-IDF with two key innovations:
Term frequency saturation — TF-IDF scores keep climbing linearly as a term repeats. BM25 applies a saturation function, so 100 occurrences scores only marginally higher than 10. This prevents overly repetitive documents from dominating.
Document length normalization — longer documents have an unfair advantage in TF-IDF. BM25 normalizes scores based on document length relative to the collection average.
from llama_index.retrievers.bm25 import BM25Retriever
import Stemmer
bm25_retriever = BM25Retriever.from_defaults(
nodes=nodes,
similarity_top_k=3,
stemmer=Stemmer.Stemmer("english"),
language="english"
)
Best for: Technical documentation, legal text, product specs — any domain where exact terminology matters.
Limitation: No semantic understanding. It won't handle synonyms or paraphrases.
3. Document Summary Index Retriever — Smart Pre-filtering
Instead of searching through every chunk in a large collection, this retriever first uses document-level summaries to identify which documents are relevant, then retrieves the detailed content from only those documents.
There are two modes:
LLM-based (
DocumentSummaryIndexLLMRetriever) — the LLM reads summaries and decides which documents to include. More accurate, higher latency.Embedding-based (
DocumentSummaryIndexEmbeddingRetriever) — uses vector similarity on summaries. Faster, good for large collections.
Best for: Large, heterogeneous document sets where you don't want to embed every chunk against every query.
4. Auto Merging Retriever — Hierarchical Context Preservation
This retriever solves a subtle but important problem: small chunks can retrieve relevant sentences while losing the surrounding context that gives them meaning.
Auto Merging works with a hierarchical node structure — small leaf nodes nested inside larger parent nodes. If enough leaf nodes from the same parent are retrieved, the retriever automatically "merges up" and returns the larger parent chunk instead. You get more context without having to increase your top-k indiscriminately.
from llama_index.core.retrievers import AutoMergingRetriever
from llama_index.core.node_parser import HierarchicalNodeParser
node_parser = HierarchicalNodeParser.get_leaf_nodes(nodes)
auto_retriever = AutoMergingRetriever(base_retriever, storage_context)
Best for: Long documents where individual sentences don't stand alone — research papers, legal contracts, technical manuals.
5. Recursive Retriever — Multi-Level Reference Following
This retriever can follow references between nodes. You define a graph of retrievers — some nodes act as indexes pointing to other retrievers — and the Recursive Retriever traverses that graph at query time, following links until it finds the most specific relevant content.
Think of it like following citations: you ask a question, retrieve a high-level summary, and the retriever automatically drills into the referenced sections.
Best for: Structured knowledge bases, research papers with citations, documentation with cross-references.
6. QueryFusion Retriever — Multi-Query Enhancement
This is the power move. Instead of running one query, QueryFusion generates multiple variations of your original question using the LLM, runs all of them, and fuses the results.
It supports three fusion modes:
RRF (Reciprocal Rank Fusion) — combines rank positions across query variations. A document that appears in position 2 across three different queries scores very high. This is the most robust mode and the safest default for production because it's rank-based and scale-invariant.
Relative Score Fusion — normalizes raw retrieval scores by the maximum score within each query variation, then sums them. Documents that score consistently high across variations win. Good when you trust the scores from your retriever.
Distribution-Based Score Fusion — the most sophisticated mode. It computes z-score normalization across each query variation's score distribution, then converts to [0,1] via a sigmoid transform. This handles cases where different query variations produce very different score distributions.
from llama_index.core.retrievers import QueryFusionRetriever
fusion_retriever = QueryFusionRetriever(
[base_retriever],
similarity_top_k=5,
num_queries=3, # LLM generates 3 query variations
mode="reciprocal_rerank",
use_async=False
)
When to pick each mode:
Production stability → RRF
Score interpretability → Relative Score
Statistical robustness with noisy scoring → Distribution-Based
Picking the Right Retriever for Your Use Case
| Use Case | Recommended Approach |
|---|---|
| General Q&A | Vector + BM25 via QueryFusion (hybrid) |
| Technical Documentation | BM25 primary, Vector as secondary |
| Long Documents | Auto Merging Retriever |
| Research Papers | Recursive Retriever |
| Large Document Collections | Document Summary Index → Vector |
Building a Hybrid Retriever
The most practical production pattern combines vector and BM25 with a weighted score fusion. Since node IDs differ between retrievers even for the same content, you match results by text content:
def hybrid_retrieve(query, top_k=5, vector_weight=0.7, bm25_weight=0.3):
vector_results = vector_retriever.retrieve(query)
bm25_results = bm25_retriever.retrieve(query)
# Normalize each set independently
max_v = max(r.score for r in vector_results) or 1
max_b = max(r.score for r in bm25_results) or 1
scores = {}
nodes = {}
for r in vector_results:
key = r.text.strip()
scores[key] = scores.get(key, 0) + vector_weight * (r.score / max_v)
nodes[key] = r
for r in bm25_results:
key = r.text.strip()
scores[key] = scores.get(key, 0) + bm25_weight * (r.score / max_b)
nodes[key] = r
ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
return [(nodes[k], s) for k, s in ranked[:top_k]]
The 70/30 weighting (vector dominant) works well for most general queries. Flip it to 40/60 if your domain is terminology-heavy.
Summary
Here's the mental model to carry forward:
Vector search is your semantic backbone — great at understanding meaning.
BM25 is your keyword anchor — great at finding exact terms.
Document Summary acts as a smart pre-filter for large collections.
Auto Merging preserves context you'd otherwise lose to chunking.
Recursive Retriever follows the structure of your knowledge.
QueryFusion turns one query into several, then fuses the best of all of them.
Most production RAG systems don't pick one — they compose. Start with a hybrid of Vector + BM25 via QueryFusion, and reach for the more specialized retrievers as your requirements get specific.
The deeper you go into LlamaIndex's retriever ecosystem, the more you realize that retrieval quality is often the biggest lever on RAG output quality. An LLM can only reason over what you put in its context — make that context count.
Thanks for reading! If you're building RAG pipelines and want to dig deeper, the LlamaIndex docs are an excellent next step — especially the retriever reference and the fusion examples.
Happy Coding!!



