Standard RAG pipelines follow a linear retrieve-then-generate pattern that works well for single-hop factual questions but breaks down for complex information needs requiring synthesis across multiple documents, reasoning over relationships, or dynamic retrieval strategies. This article examines the frontier of RAG research and practice -- agentic retrieval that makes iterative decisions, graph-structured knowledge retrieval, and self-correcting systems that detect and recover from retrieval failures.
Before examining advanced patterns, it is worth understanding precisely where simple retrieve-then-generate fails.
Single retrieval pass: A query like "Compare the environmental policies of the EU and US regarding carbon markets" requires retrieving information about EU policies and US policies separately. A single embedding-based retrieval may return documents about one but not the other.
No reasoning over retrieval quality: If retrieved documents are irrelevant, the LLM either hallucinates an answer or produces a vague non-answer. There is no mechanism to recognize retrieval failure and try again with a different strategy.
No relationship awareness: Questions like "Which companies funded by Sequoia went public in 2023?" require understanding the relationship between funding rounds and IPO events across multiple documents. Flat retrieval treats each document independently.
Fixed retrieval strategy: The same embedding similarity search is applied whether the query needs a code example, a statistical fact, or a conceptual explanation. Different information types may require different retrieval approaches. For a detailed treatment of how dense and sparse retrieval methods can be combined to address this, see Article 16: Retrieval Strategies.
Agentic RAG transforms retrieval from a single function call into an iterative reasoning process where an LLM agent decides what to retrieve, evaluates the results, and adapts its strategy.
The simplest agentic pattern: classify the query and route it to the appropriate retrieval pipeline.
from enum import Enum
from pydantic import BaseModel
class QueryType(str, Enum):
FACTUAL = "factual" # Direct fact lookup
CONCEPTUAL = "conceptual" # Explanation/understanding
COMPARATIVE = "comparative" # Compare multiple entities
PROCEDURAL = "procedural" # How-to / step-by-step
ANALYTICAL = "analytical" # Requires reasoning over data
class QueryRoute(BaseModel):
query_type: QueryType
sub_queries: list[str]
data_sources: list[str]
def route_query(query: str) -> QueryRoute:
"""Use an LLM to analyze and route the query."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": """Analyze the query and determine:
1. Query type (factual, conceptual, comparative, procedural, analytical)
2. Sub-queries needed to fully answer it
3. Which data sources to search (docs, code, api_reference, changelog)"""
}, {
"role": "user",
"content": query
}],
response_format={"type": "json_object"}
)
return QueryRoute.model_validate_json(response.choices[0].message.content)
A more powerful pattern gives the LLM agent access to retrieval as a tool, allowing it to make multiple retrieval calls with different queries and strategies:
import json
tools = [
{
"type": "function",
"function": {
"name": "vector_search",
"description": "Search documents by semantic similarity",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"},
"collection": {"type": "string", "enum": ["docs", "code", "api"]},
"top_k": {"type": "integer", "default": 5}
},
"required": ["query"]
}
}
},
{
"type": "function",
"function": {
"name": "keyword_search",
"description": "Search documents by exact keyword matching (BM25)",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"},
"filters": {
"type": "object",
"properties": {
"date_range": {"type": "string"},
"category": {"type": "string"}
}
}
},
"required": ["query"]
}
}
},
{
"type": "function",
"function": {
"name": "lookup_entity",
"description": "Look up a specific entity (person, company, concept) in the knowledge graph",
"parameters": {
"type": "object",
"properties": {
"entity_name": {"type": "string"},
"relation_type": {"type": "string", "enum": ["all", "funded_by", "competes_with", "authored"]}
},
"required": ["entity_name"]
}
}
}
]
async def agentic_rag(query: str, max_iterations: int = 5) -> str:
"""RAG with an agent loop -- the LLM decides what to retrieve."""
messages = [
{"role": "system", "content": """You are a research assistant. Use the available tools
to find information needed to answer the user's question. Make multiple searches if needed.
When you have sufficient information, provide a comprehensive answer with citations."""},
{"role": "user", "content": query}
]
retrieved_context = []
for iteration in range(max_iterations):
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
tool_choice="auto"
)
message = response.choices[0].message
if message.tool_calls:
messages.append(message)
for tool_call in message.tool_calls:
result = await execute_tool(tool_call)
retrieved_context.append(result)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(result)
})
else:
# Agent has decided it has enough information
return message.content
return messages[-1].content if messages else "Unable to find sufficient information."
The agent can refine its retrieval strategy based on intermediate results:
This mirrors how a human researcher works -- start with a broad search, identify what's missing, search specifically for those gaps.
Traditional RAG treats documents as independent entities in a flat vector space. Knowledge graph RAG adds explicit relationships between entities, enabling traversal-based retrieval.
# Simple knowledge graph structure
class KnowledgeGraph:
def __init__(self):
self.entities = {} # entity_id -> {name, type, description, embedding}
self.relations = [] # (source_id, relation_type, target_id, properties)
def add_entity(self, entity_id: str, name: str, entity_type: str, description: str):
self.entities[entity_id] = {
"name": name,
"type": entity_type,
"description": description,
"embedding": embed(f"{name}: {description}")
}
def add_relation(self, source: str, relation: str, target: str, properties: dict = None):
self.relations.append((source, relation, target, properties or {}))
def get_neighbors(self, entity_id: str, relation_type: str = None, max_hops: int = 2):
"""Retrieve entities within max_hops of the given entity."""
visited = set()
frontier = {entity_id}
results = []
for hop in range(max_hops):
next_frontier = set()
for node in frontier:
if node in visited:
continue
visited.add(node)
for source, rel, target, props in self.relations:
if source == node and (relation_type is None or rel == relation_type):
if target not in visited:
next_frontier.add(target)
results.append({
"entity": self.entities[target],
"relation": rel,
"from": self.entities[source]["name"],
"hop": hop + 1
})
elif target == node and (relation_type is None or rel == relation_type):
if source not in visited:
next_frontier.add(source)
results.append({
"entity": self.entities[source],
"relation": rel,
"from": self.entities[node]["name"],
"hop": hop + 1
})
frontier = next_frontier
return results
Microsoft's GraphRAG (Edge et al., 2024) introduced a systematic approach to building knowledge graphs from document collections for retrieval. The process involves:
Entity and relationship extraction: Use an LLM to extract entities (people, organizations, concepts) and their relationships from each document chunk.
Graph construction: Build a graph where nodes are entities and edges are relationships, with each annotated by the source text.
Community detection: Apply the Leiden algorithm to identify clusters of closely related entities (communities).
Community summarization: Generate natural language summaries for each community, capturing the key entities, relationships, and themes at that level.
Hierarchical summarization: Build summaries at multiple levels of the community hierarchy, from specific (individual relationships) to abstract (high-level themes).
# GraphRAG retrieval approaches
class GraphRAGRetriever:
def local_search(self, query: str, top_k: int = 5):
"""
Local search: Start from entities mentioned in the query,
traverse the graph to find related context.
Best for: Specific questions about particular entities.
"""
# 1. Extract entities from query
query_entities = self.extract_entities(query)
# 2. Find matching entities in the graph
matched = self.match_entities(query_entities)
# 3. Traverse graph from matched entities
context = []
for entity in matched:
neighbors = self.graph.get_neighbors(entity.id, max_hops=2)
context.extend(neighbors)
# 4. Retrieve source text chunks for context entities
source_texts = self.get_source_texts(context)
return source_texts
def global_search(self, query: str, level: int = 1):
"""
Global search: Use community summaries to answer broad questions.
Best for: Thematic questions, summarization across the corpus.
"""
# 1. Retrieve community summaries at the specified level
summaries = self.get_community_summaries(level=level)
# 2. Map: Generate partial answers from each relevant community
partial_answers = []
for summary in summaries:
if self.is_relevant(query, summary):
partial = self.generate_partial_answer(query, summary)
partial_answers.append(partial)
# 3. Reduce: Synthesize partial answers into a final answer
final_answer = self.synthesize(query, partial_answers)
return final_answer
When GraphRAG excels: Questions requiring understanding of relationships ("Who are the key collaborators of researcher X?"), global summarization queries ("What are the main themes in this document collection?"), and multi-entity questions ("How are companies A, B, and C related?").
Trade-offs: GraphRAG requires significant upfront processing (LLM calls for entity extraction and summarization), the graph must be rebuilt when documents change, and it adds complexity to the RAG pipeline. For simple factual retrieval, it's overkill.
Multi-hop retrieval addresses questions that cannot be answered by any single document but require connecting information across multiple sources.
Consider: "What is the GDP per capita of the country where the inventor of the World Wide Web was born?"
This requires:
No single retrieval step answers the question. Each step's answer informs the next step's query.
async def multi_hop_retrieve(
query: str,
retriever,
max_hops: int = 4
) -> dict:
"""Multi-hop retrieval with explicit reasoning steps."""
context = []
reasoning_chain = []
current_query = query
for hop in range(max_hops):
# Retrieve for current query
results = await retriever.search(current_query, top_k=3)
context.extend(results)
# Ask the LLM: Do we have enough to answer? If not, what's missing?
evaluation = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": """Given the original question and retrieved context, determine:
1. Can the question be fully answered with the current context?
2. If not, what specific information is still needed?
3. What search query would find the missing information?
Respond in JSON: {"answerable": bool, "missing": str, "next_query": str}"""
}, {
"role": "user",
"content": f"Question: {query}\n\nContext so far:\n" +
"\n".join([f"[{i+1}] {r['text']}" for i, r in enumerate(context)])
}]
)
eval_result = json.loads(evaluation.choices[0].message.content)
reasoning_chain.append({
"hop": hop + 1,
"query": current_query,
"found": [r["text"][:200] for r in results],
"evaluation": eval_result
})
if eval_result["answerable"]:
break
current_query = eval_result["next_query"]
return {
"context": context,
"reasoning_chain": reasoning_chain,
"total_hops": len(reasoning_chain)
}
RAPTOR (Sarthi et al., 2024) introduces a novel indexing structure that builds a tree of summaries over document chunks, enabling retrieval at multiple levels of abstraction.
[Root Summary]
/ \
[Summary L2-A] [Summary L2-B]
/ | \ / | \
[Sum A1] [Sum A2] [Sum A3] [Sum B1] [Sum B2] [Sum B3]
/ | \ / | \ / | \
[c1][c2][c3][c4][c5] [c8][c9][c10]
RAPTOR supports two retrieval approaches:
Tree traversal: Start at the root, decide which children are relevant, recurse down to find the most relevant leaf chunks. Good for focused queries.
Collapsed tree: Flatten all nodes (leaves + summaries at all levels) into a single index. Retrieve the most similar nodes regardless of level. A query about high-level themes might match a summary node; a detailed query might match a leaf.
class RAPTORIndex:
def __init__(self, embedding_model, llm):
self.embedding_model = embedding_model
self.llm = llm
self.tree_nodes = [] # All nodes at all levels
def build_tree(self, chunks: list[str], max_levels: int = 3):
"""Build the RAPTOR tree from leaf chunks."""
current_level = chunks
level = 0
# Store leaf nodes
for chunk in chunks:
self.tree_nodes.append({
"text": chunk,
"level": 0,
"embedding": self.embedding_model.encode(chunk)
})
while len(current_level) > 1 and level < max_levels:
# Cluster current level
embeddings = self.embedding_model.encode(current_level)
clusters = self.cluster(embeddings, current_level)
# Summarize each cluster
summaries = []
for cluster_texts in clusters:
summary = self.summarize(cluster_texts)
summaries.append(summary)
self.tree_nodes.append({
"text": summary,
"level": level + 1,
"embedding": self.embedding_model.encode(summary),
"children": cluster_texts
})
current_level = summaries
level += 1
def collapsed_tree_search(self, query: str, top_k: int = 5):
"""Search across all levels of the tree."""
query_embedding = self.embedding_model.encode(query)
similarities = [
(node, np.dot(query_embedding, node["embedding"]))
for node in self.tree_nodes
]
similarities.sort(key=lambda x: x[1], reverse=True)
return [node for node, _ in similarities[:top_k]]
Why RAPTOR works: For questions requiring high-level understanding ("What are the main contributions of this paper?"), summary nodes provide pre-computed abstractions that match better than any individual chunk. For detailed questions, leaf nodes provide specifics. The tree structure enables retrieval at the right level of abstraction.
Corrective RAG (Yan et al., 2024) introduces a self-evaluation mechanism that assesses the quality of retrieved documents and takes corrective action when retrieval fails.
class CorrectiveRAG:
def __init__(self, retriever, evaluator, web_searcher):
self.retriever = retriever
self.evaluator = evaluator
self.web_searcher = web_searcher
async def answer(self, query: str) -> str:
# Step 1: Initial retrieval
documents = await self.retriever.search(query, top_k=5)
# Step 2: Evaluate each document's relevance
evaluations = []
for doc in documents:
score = self.evaluator.evaluate_relevance(query, doc)
evaluations.append({"doc": doc, "score": score})
# Step 3: Triage based on evaluation
relevant = [e for e in evaluations if e["score"] > 0.7]
ambiguous = [e for e in evaluations if 0.3 <= e["score"] <= 0.7]
irrelevant = [e for e in evaluations if e["score"] < 0.3]
if len(relevant) >= 2:
# Sufficient relevant documents -- proceed with generation
context = [e["doc"] for e in relevant]
action = "CORRECT"
elif len(relevant) + len(ambiguous) >= 2:
# Some relevant content, supplement with web search
context = [e["doc"] for e in relevant + ambiguous]
web_results = await self.web_searcher.search(query)
context.extend(web_results[:3])
action = "AMBIGUOUS"
else:
# Retrieval failed -- fall back to web search entirely
context = await self.web_searcher.search(query)
action = "INCORRECT"
# Step 4: Knowledge refinement -- extract only relevant sentences
refined_context = self.refine_context(query, context)
# Step 5: Generate answer
answer = self.generate(query, refined_context, action)
return answer
def refine_context(self, query: str, documents: list) -> list[str]:
"""Extract only the sentences relevant to the query from each document."""
refined = []
for doc in documents:
sentences = split_sentences(doc["text"])
relevant_sentences = [
s for s in sentences
if self.evaluator.evaluate_relevance(query, {"text": s}) > 0.5
]
if relevant_sentences:
refined.append(" ".join(relevant_sentences))
return refined
The key insight of CRAG is the triage mechanism: rather than treating all retrieved documents equally, it explicitly evaluates and categorizes them, taking different corrective actions based on the assessment. For metrics and frameworks to measure how well these corrective mechanisms perform, see Article 18: RAG Evaluation.
Self-RAG (Asai et al., 2023) takes a more radical approach: it trains the LLM itself to decide when to retrieve, evaluate what was retrieved, and critique its own generated output.
Self-RAG introduces special reflection tokens that the model generates inline:
Query: "What causes northern lights?"
Model output:
[Retrieve] Yes
[Retrieved passage: "Aurora borealis occurs when charged particles from
the solar wind interact with Earth's magnetosphere..."]
[IsRel] Relevant
The northern lights, or aurora borealis, are caused by charged particles
from the sun interacting with Earth's magnetic field.
[IsSup] Fully Supported
These particles are guided by the magnetosphere toward the poles, where
they collide with atmospheric gases, causing them to emit light.
[IsUse] 5
Self-RAG is trained in two phases:
Critic model training: Train a model to generate the reflection tokens given (query, passage, generation) triples. This is trained on human annotations or GPT-4 labels.
Generator training: Train the language model end-to-end to generate both reflection tokens and text. The model learns to interleave retrieval decisions and quality assessments with generation.
Self-RAG shifts retrieval control from the application layer into the model itself. Instead of always retrieving (which wastes compute for queries the model already knows) or never retrieving (which causes hallucination), the model adaptively decides. Asai et al. (2023) showed that Self-RAG outperforms both standard RAG and vanilla LLMs across multiple benchmarks, with the added benefit of providing interpretable confidence signals via the reflection tokens. Note that RAG systems with tool-use capabilities introduce prompt injection risks -- an adversary could embed instructions in retrieved documents to manipulate the agent's behavior. See Article 12: Adversarial Prompting for defense strategies against indirect prompt injection in retrieval pipelines.
In practice, these techniques are combined based on the application's requirements:
class AdvancedRAGSystem:
"""Production system combining multiple advanced RAG patterns."""
async def answer(self, query: str) -> dict:
# 1. Query analysis and routing
route = self.route_query(query)
# 2. Select retrieval strategy
if route.requires_multi_hop:
context = await self.multi_hop_retrieve(query)
elif route.requires_graph_traversal:
context = await self.graph_rag_retrieve(query)
elif route.is_global_summary:
context = await self.raptor_retrieve(query, level="summary")
else:
context = await self.hybrid_retrieve(query)
# 3. Evaluate retrieval quality (CRAG-style)
quality = self.evaluate_retrieval(query, context)
if quality.score < 0.3:
# Retrieval failed -- try alternative strategy
context = await self.fallback_retrieve(query, tried=route.strategy)
# 4. Generate with self-assessment
answer = self.generate_with_reflection(query, context)
return {
"answer": answer.text,
"sources": answer.citations,
"confidence": answer.confidence,
"retrieval_strategy": route.strategy,
"hops": context.hop_count if hasattr(context, 'hop_count') else 1
}
The arrival of models with 1M+ token context windows -- Gemini 1.5 Pro at 1M tokens, Claude with 200K, GPT-4.1 at 1M -- has prompted a legitimate question: does RAG still matter when you can simply dump entire document collections into the prompt?
Several benchmarks have tested this directly. Google's "Lost in the Middle" (Liu et al., 2023) demonstrated that even models with long contexts suffer from a U-shaped attention curve: they attend well to information at the beginning and end of the context but degrade significantly for information positioned in the middle. Subsequent work by Anthropic and others has partially mitigated this, but the core finding persists -- retrieval precision degrades as context length grows.
The LOFT benchmark (Lee et al., 2024) evaluated long-context models on tasks specifically designed to test whether they could replace retrieval systems. The findings were nuanced: for simple fact lookup where the answer appears verbatim in the corpus, long context performed comparably to RAG. But for tasks requiring synthesis across multiple passages, or where the relevant information constitutes a small fraction of the total context, RAG with a well-tuned retriever consistently outperformed brute-force context stuffing.
Several conditions make RAG indispensable regardless of context window size:
Corpus exceeds context limits. Even a 1M token window holds roughly 750K words -- about 3,000 pages. Many enterprise knowledge bases, codebases, and document archives are orders of magnitude larger. Retrieval remains the only practical approach for corpora at this scale.
Cost and latency constraints. Sending 1M tokens per query is expensive (roughly $2-10 per query depending on the model) and slow (seconds to tens of seconds for prefill alone). RAG with a focused retrieval of 2-5K tokens is 100-500x cheaper per query and returns responses in under a second.
Information freshness. Long context requires reprocessing the full corpus on every query. RAG indexes can be updated incrementally as documents change, making it far more practical for dynamic knowledge bases.
Precision on needle-in-haystack tasks. When the relevant information is a single paragraph in a million-token corpus, retrieval models trained specifically to surface relevant passages will outperform an LLM's implicit attention-based "search" through a massive context.
The most effective production systems use both: RAG for precision and long context for breadth. The pattern works as follows:
This hybrid approach consistently outperforms either technique alone. The retriever ensures the most critical information is present and prominently positioned; the long context provides background knowledge that helps the model interpret and synthesize the retrieved passages.
Standard RAG assumes the language model is used as-is, with retrieved context simply prepended to the prompt. RAFT -- Retrieval Augmented Fine-Tuning (Zhang et al., 2024) -- challenges this assumption by fine-tuning the model specifically to work with retrieved context, including learning to ignore irrelevant retrieved documents.
The core insight is straightforward: in real RAG pipelines, not every retrieved document is relevant. Standard models, when presented with a mix of relevant and irrelevant context, often get confused or distracted by the noise. RAFT fine-tunes the model on training examples that deliberately include "distractor" documents alongside relevant ones.
The training data for RAFT is constructed as follows:
During training, a fraction of examples include only oracle documents (to teach the model what a relevant document looks like), while the majority include a mix of oracle and distractor documents (to teach the model to identify and extract information from relevant documents while ignoring noise). Crucially, the model is trained to produce chain-of-thought reasoning that references specific passages, reinforcing the connection between retrieved evidence and generated answers.
RAFT demonstrates significant improvements over both standard RAG and pure fine-tuning across domain-specific benchmarks. On the PubMed QA benchmark, RAFT improved accuracy by 5-10% over standard RAG with the same retriever. On HotpotQA, the gains were even larger for multi-hop questions where distractor documents are most harmful.
The practical takeaway: if you control the model (i.e., you can fine-tune it) and your retrieval pipeline has a known precision rate below 80%, RAFT-style training can meaningfully improve end-to-end answer quality. The approach is complementary to retriever improvements -- better retrieval reduces the number of distractors, while RAFT makes the model more robust to the distractors that remain.
For practitioners using DSPy (see Article 11: Prompt Optimization), a lighter-weight alternative to full RAFT training is prompt optimization over the RAG pipeline: DSPy can automatically tune the prompt and few-shot examples to improve the model's ability to extract relevant information from noisy retrieved context, without requiring gradient-based fine-tuning.
Advanced RAG techniques vary dramatically in their cost profiles. Understanding these trade-offs is essential for selecting the right approach for a given budget and use case.
Standard vector indexing is cheap. Embedding a million documents with a model like text-embedding-3-small costs roughly $0.02 per million tokens, meaning a corpus of 1M documents (averaging 500 tokens each) can be indexed for about $10. Incremental updates are proportional to the number of changed documents.
GraphRAG indexing is expensive. Entity extraction and community summarization require LLM calls for every chunk. For a corpus of 1M documents, Microsoft's GraphRAG implementation typically requires 5-10 LLM calls per chunk for entity extraction, relationship identification, and community summarization. At GPT-4o-mini pricing (~$0.15 per million input tokens), indexing 1M documents costs $500-2,000. With GPT-4o, costs scale to $5,000-20,000. The graph must also be substantially rebuilt when documents change, making it poorly suited for rapidly evolving corpora.
RAPTOR indexing falls in between. The clustering step is cheap (CPU-based), but summarization at each tree level requires LLM calls. For a 1M document corpus with 3 tree levels, expect costs of $200-800 depending on the summarization model.
| Technique | Retrieval Cost | Generation Input Tokens | Typical Total Cost/Query |
|---|---|---|---|
| Standard RAG (top-5) | ~$0.00001 (vector search) | 2-5K | $0.001-0.01 |
| Agentic RAG (3 iterations) | ~$0.00003 | 5-15K + 3 routing calls | $0.01-0.05 |
| GraphRAG local search | ~$0.0001 (graph traversal) | 3-8K | $0.005-0.02 |
| GraphRAG global search | ~$0.001 (multi-community) | 10-50K (map-reduce) | $0.05-0.50 |
| Multi-hop (3 hops) | ~$0.00003 | 5-15K + evaluation calls | $0.02-0.10 |
| Long context (full corpus) | N/A | 100K-1M | $0.50-10.00 |
Standard RAG is the right default. For the vast majority of question-answering workloads, it provides the best cost-quality ratio. Start here and only add complexity when you can demonstrate measurable quality gains on your specific use case.
Agentic RAG is cost-effective when query complexity varies widely. The agent can use a single retrieval pass for simple queries and multiple passes only when needed, amortizing the overhead across the workload.
GraphRAG is justified when relationship-heavy queries constitute a significant fraction of traffic and the corpus is relatively stable (infrequent re-indexing). The high indexing cost is amortized over query volume -- at 10,000 queries per day, the per-query amortized indexing cost becomes negligible within a week.
Long context is cost-effective only for low-volume, high-value queries where completeness matters more than cost -- legal document review, comprehensive due diligence, or research synthesis where missing a relevant passage has high consequences.
For teams evaluating advanced RAG, a practical budgeting framework:
Building advanced RAG pipelines requires orchestrating multiple components: retrievers, rerankers, LLM calls, evaluation steps, and routing logic. Several frameworks have emerged to manage this complexity.
LangGraph models RAG pipelines as stateful graphs where nodes are processing steps and edges define the control flow. Its key strength is explicit support for cycles and conditional branching, making it natural to implement patterns like CRAG (evaluate, then conditionally re-retrieve) and agentic RAG (loop until sufficient context). For LangGraph fundamentals and the full API, see the dedicated LangGraph article.
LangGraph works well for pipelines with complex control flow -- multi-hop retrieval with conditional termination, parallel retrieval from multiple sources with result merging, or human-in-the-loop approval steps. Its graph-based model makes the pipeline's structure inspectable and debuggable. The trade-off is verbosity: simple pipelines require more boilerplate than a linear chain, and the abstraction can feel heavy for straightforward retrieve-then-generate workflows.
LlamaIndex Workflows provide an event-driven orchestration model built around the concept of steps that emit and consume events. This model is particularly well-suited for RAG because it natively supports the data-flow patterns common in retrieval pipelines: ingest events trigger indexing, query events trigger retrieval, retrieval results trigger reranking, and so on.
LlamaIndex's strength is its deep integration with retrieval primitives -- it ships with built-in support for vector stores, knowledge graphs, RAPTOR-style tree indices, and query engines that abstract the retrieval-generation loop. For teams whose primary use case is RAG, LlamaIndex offers the fastest path from prototype to production. The trade-off is that its abstractions are opinionated about how retrieval should work, which can become constraining for highly custom pipelines.
DSPy (Khattab et al., 2023) takes a fundamentally different approach: rather than orchestrating pipeline steps procedurally, it treats the entire RAG pipeline as a program to be optimized. Retrieval, prompting, and generation are expressed as declarative modules, and DSPy's compiler optimizes the prompts and few-shot examples to maximize end-to-end performance on a development set.
For advanced RAG, DSPy's key contribution is that it can jointly optimize the retrieval query formulation and the generation prompt. Instead of manually tuning how queries are rewritten for multi-hop retrieval or how the generation prompt instructs the model to use context, DSPy discovers these configurations automatically. This is especially powerful for RAFT-like improvements without fine-tuning: DSPy can optimize the prompt to make the model more robust to distractor documents in retrieved context. For a deeper treatment of DSPy's optimization approach, see Article 11: Prompt Optimization.
The choice depends on what you are optimizing for:
In practice, many production systems combine elements: DSPy for prompt optimization during development, LangGraph or LlamaIndex for runtime orchestration, and custom code for domain-specific logic that no framework handles well.