Context engineering has emerged as the defining discipline of applied AI engineering -- the systematic practice of designing, assembling, and managing everything that goes into a language model's context window at inference time. While prompt engineering focuses on crafting instructions and examples, context engineering encompasses the broader architectural challenge: deciding what information the model needs, where that information comes from, how it's formatted and ordered, and how to manage the finite budget of tokens available. This article examines context engineering from first principles, covering context window mechanics, information architecture, retrieval-driven context assembly, and the production patterns that separate brittle prototypes from reliable systems.
The term gained wide adoption after Andrej Karpathy's observation that "the hottest new programming language is English" evolved into a more precise framing: the real skill is not writing prompts but engineering the full context that surrounds them. Tobi Lutke (Shopify CEO) and others have described context engineering as "the art of providing all the information and tools an LLM needs to successfully accomplish a task." This reflects a maturation of the field -- from crafting clever single-shot prompts to designing information systems that dynamically assemble the right context for each interaction.
Mental Model
The mental model for context engineering is the model is a fixed function; the only thing you control is its input, so the input is the program. You cannot change the weights at request time. Everything that determines output quality โ instructions, retrieved knowledge, tools, history, examples โ is just the argument you pass. Context engineering is the discipline of designing that argument under a hard token budget: deciding what information earns a place in the window, in what order, with what structure, for this specific request.
This subsumes "prompting" and reframes it as systems design. Prompting is the static instruction slice; the rest is dynamic โ which makes context engineering the umbrella over dynamic context assembly (build it per request), context window management (fit it to the budget), and context compression (shrink it without losing signal). The recurring judgment is signal density per token: more context is not better, the right context in the right place is.
Why Context Engineering Matters
Beyond Prompt Engineering
Prompt engineering, as covered in Prompt Engineering Fundamentals, deals with how to phrase instructions, structure few-shot examples, and steer model behavior. Context engineering is the superset: it encompasses the prompt but also everything else in the context window -- retrieved documents, conversation history, tool outputs, system state, memory, and metadata.
The distinction becomes clear in production systems. A chatbot prompt might be 200 tokens, but the full context at inference time is often 4,000-32,000 tokens assembled from multiple sources:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Context Window โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ System Prompt (~500 tokens) โ โ
โ โ - Role, personality, constraints โ โ
โ โ - Output format specifications โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Retrieved Context (~2000 tokens) โ โ
โ โ - RAG documents, knowledge base hits โ โ
โ โ - Relevant code, documentation โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Conversation History (~3000 tokens) โ โ
โ โ - Prior messages (possibly summarized) โ โ
โ โ - Tool call results from prior turns โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Tool Definitions (~1000 tokens) โ โ
โ โ - Function schemas, descriptions โ โ
โ โ - Available actions and parameters โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Current User Message (~200 tokens) โ โ
โ โ - The actual user request โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ Total: ~6700 tokens of an 128K window โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Each component is a design decision. What goes in, what stays out, how it's formatted, and where it's positioned all affect model performance. Context engineering is the discipline of making these decisions well.
The Information Bottleneck
Even with context windows reaching 128K-2M tokens, the bottleneck is not raw capacity but effective capacity. Research consistently demonstrates that models do not attend equally to all content in the context:
Lost in the middle (Liu et al., 2023): Models perform best when relevant information is at the very beginning or very end of the context. Information in the middle receives less attention, leading to degraded performance. This finding has direct architectural implications -- position your most critical context (system instructions, key constraints) at the start, and the user's current query at the end.
Attention dilution: As context length grows, the model's attention is distributed across more tokens. Adding irrelevant content doesn't just waste tokens -- it actively degrades performance on the relevant content. A 4K context with precisely relevant information often outperforms a 32K context padded with tangentially related content.
Reasoning capacity trade-offs: Tokens spent on context are tokens not available for reasoning (in the output). For complex tasks requiring extended chain-of-thought reasoning, reserving output token budget matters as much as curating input context.
The Context Engineering Stack
A production context engineering system has distinct layers, each with its own design considerations:
Layer 1: Static Context (System Prompt)
The system prompt is the foundation -- the context that remains constant across all interactions within an application. Effective system prompts follow the principles covered in System Prompt Design, but context engineering adds the perspective of budget allocation: how many of your total tokens should be devoted to static instructions versus dynamic content?
Design heuristic: System prompts should consume no more than 10-15% of your effective context budget. If your system prompt is 3,000 tokens and your effective context is 8,000 tokens, you've already consumed 37% on static instructions, leaving limited room for retrieved context and conversation history.
# Context budget planning
CONTEXT_BUDGET = {
"system_prompt": 800, # 10% -- role, constraints, format
"retrieved_context": 3200, # 40% -- RAG results, knowledge
"conversation_history": 2400, # 30% -- recent messages + summary
"tool_definitions": 800, # 10% -- available tools/functions
"user_message": 400, # 5% -- current request
"safety_margin": 400, # 5% -- buffer for tokenization variance
}
# Total: 8000 tokens of a 128K window
# Remaining capacity reserved for model output
Layer 2: Dynamic Retrieval Context
The most impactful layer for most applications. Dynamic context is assembled at query time from external sources -- vector databases, search indices, knowledge bases, APIs, or databases. This is where context engineering intersects with RAG (see Retrieval Strategies and Advanced RAG).
Key design decisions:
What to retrieve: Not all queries need retrieval. A classification step or embedding similarity threshold can determine whether retrieved context would help or hurt. Unnecessary retrieval adds latency and potentially dilutes the context.
How much to retrieve: More isn't better. Retrieving 20 chunks when 3 would suffice wastes context budget and dilutes attention. Conversely, retrieving too little risks missing critical information. Adaptive retrieval -- starting with a few results and expanding only if confidence is low -- often outperforms fixed-k retrieval.
How to order retrieved results: Given the "lost in the middle" finding, place the most relevant results first and last. Some practitioners reverse-sort by relevance (least relevant first, most relevant last) so the highest-relevance content is closest to the user message.
def assemble_retrieval_context(
query: str,
collection,
budget_tokens: int = 3200,
max_results: int = 10,
relevance_threshold: float = 0.7,
) -> str:
"""Retrieve and assemble context within a token budget."""
results = collection.query(
query_texts=[query],
n_results=max_results,
)
# Filter by relevance threshold (Chroma returns distances, not similarities)
filtered = []
for doc, distance, metadata in zip(
results["documents"][0],
results["distances"][0],
results["metadatas"][0],
):
similarity = 1 - distance # for cosine distance
if similarity >= relevance_threshold:
filtered.append({"text": doc, "similarity": similarity, "meta": metadata})
# Sort: most relevant first (for "primacy" attention effect)
filtered.sort(key=lambda x: x["similarity"], reverse=True)
# Pack within token budget
context_parts = []
token_count = 0
for item in filtered:
chunk_tokens = len(item["text"].split()) * 1.3 # rough token estimate
if token_count + chunk_tokens > budget_tokens:
break
source = item["meta"].get("source", "unknown")
context_parts.append(f"[Source: {source}]\n{item['text']}")
token_count += chunk_tokens
return "\n\n---\n\n".join(context_parts)
For vector database options including Chroma, Pinecone, Qdrant, and pgvector, see Article 14: Vector Databases. For chunking strategies that affect retrieval quality, see Article 15: Chunking Strategies.
Layer 3: Conversation History
Managing conversation history is a core context engineering challenge. Raw conversation histories grow without bound, and naively stuffing them into context wastes tokens on irrelevant early messages while potentially losing important context from the middle.
Sliding window: Keep the last N messages. Simple but loses earlier context entirely.
Summarization: Periodically summarize older messages, maintaining a rolling summary. See Agent Memory for detailed implementations.
Selective retention: Use a model to decide which past messages are relevant to the current query, loading only those. More expensive (requires an extra LLM call) but produces the most focused context.
Hybrid approach: Maintain a rolling summary of the full conversation plus the last K verbatim messages:
class ConversationContextManager:
def __init__(self, max_history_tokens: int = 2400):
self.messages = []
self.summary = ""
self.max_tokens = max_history_tokens
def add_turn(self, role: str, content: str):
self.messages.append({"role": role, "content": content})
def get_context(self) -> list[dict]:
"""Build conversation context within budget."""
# Always include summary if it exists
context = []
if self.summary:
context.append({
"role": "system",
"content": f"Conversation summary:\n{self.summary}"
})
# Add recent messages in reverse until budget is reached
recent = []
token_count = len(self.summary.split()) * 1.3
for msg in reversed(self.messages):
msg_tokens = len(msg["content"].split()) * 1.3
if token_count + msg_tokens > self.max_tokens:
break
recent.insert(0, msg)
token_count += msg_tokens
return context + recent
def compress(self, summarizer):
"""Summarize older messages to free context budget."""
if len(self.messages) <= 4:
return
older = self.messages[:-4]
text = "\n".join(f"{m['role']}: {m['content']}" for m in older)
self.summary = summarizer(
f"Previous summary:\n{self.summary}\n\nNew messages:\n{text}\n\n"
"Create a concise summary preserving key facts, decisions, and context."
)
self.messages = self.messages[-4:]
Layer 4: Tool Definitions and State
For agent-based applications (see Agent Architectures and Function Calling), tool definitions consume context budget. Each tool schema -- name, description, parameters, examples -- can cost 100-500 tokens.
Selective tool loading: Don't load all 50 tools for every query. Classify the user intent first, then load only relevant tools:
TOOL_GROUPS = {
"search": ["web_search", "knowledge_base_search", "code_search"],
"data": ["sql_query", "csv_analyze", "chart_generate"],
"communication": ["send_email", "create_ticket", "post_message"],
"code": ["run_code", "read_file", "write_file", "run_tests"],
}
def select_tools(user_message: str, classifier) -> list[dict]:
"""Load only relevant tool definitions based on user intent."""
intent = classifier(user_message) # Returns tool group names
tools = []
for group in intent.groups:
tools.extend(TOOL_GROUPS.get(group, []))
return [TOOL_SCHEMAS[t] for t in tools]
Tool output truncation: Tool call results (API responses, search results, file contents) can be arbitrarily large. Always truncate or summarize tool outputs before adding them to context:
def truncate_tool_output(output: str, max_tokens: int = 1000) -> str:
"""Truncate tool output to fit within context budget."""
words = output.split()
if len(words) * 1.3 <= max_tokens:
return output
# Keep beginning and end (most informative parts)
keep_words = int(max_tokens / 1.3)
half = keep_words // 2
return " ".join(words[:half]) + "\n\n[...truncated...]\n\n" + " ".join(words[-half:])
Context Assembly Patterns
Pattern 1: Static Assembly
The simplest pattern: concatenate fixed components in a predetermined order. Suitable for applications with predictable context needs.
def build_static_context(system_prompt: str, user_message: str) -> list[dict]:
return [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message},
]
Pattern 2: Retrieval-Augmented Assembly
The standard RAG pattern: enrich the context with retrieved documents. The context is assembled dynamically based on the user's query.
def build_rag_context(
system_prompt: str,
user_message: str,
retriever,
) -> list[dict]:
retrieved = retriever.search(user_message, k=5)
context_block = format_retrieved_docs(retrieved)
return [
{"role": "system", "content": system_prompt},
{"role": "system", "content": f"Relevant context:\n{context_block}"},
{"role": "user", "content": user_message},
]
Pattern 3: Agentic Assembly
The context evolves across multiple reasoning steps. Each tool call produces output that becomes part of the context for the next step. This is the pattern used in agent loops (see Agent Architectures).
def agentic_context_loop(
system_prompt: str,
user_message: str,
tools: list[dict],
max_steps: int = 10,
) -> str:
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message},
]
for step in range(max_steps):
response = llm.chat(messages=messages, tools=tools)
if response.finish_reason == "stop":
return response.content
# Execute tool calls and add results to context
for tool_call in response.tool_calls:
result = execute_tool(tool_call)
messages.append({"role": "assistant", "content": None, "tool_calls": [tool_call]})
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": truncate_tool_output(str(result)),
})
return "Max steps reached."
Pattern 4: Multi-Source Assembly
Complex applications pull context from multiple sources -- databases, APIs, vector stores, user profiles, session state -- and merge them into a unified context. This requires explicit orchestration:
import asyncio
async def build_multi_source_context(
user_message: str,
user_id: str,
session: dict,
budget: dict,
) -> list[dict]:
"""Assemble context from multiple sources in parallel."""
# Fire all retrievals concurrently
user_profile_task = asyncio.create_task(get_user_profile(user_id))
rag_task = asyncio.create_task(retrieve_documents(user_message, budget["retrieved"]))
history_task = asyncio.create_task(get_conversation_history(session["id"], budget["history"]))
tools_task = asyncio.create_task(select_tools_for_intent(user_message))
user_profile, rag_docs, history, tools = await asyncio.gather(
user_profile_task, rag_task, history_task, tools_task
)
# Assemble in optimal order for attention
system_content = build_system_prompt(user_profile, session)
context_content = format_retrieved_docs(rag_docs)
messages = [
{"role": "system", "content": system_content},
{"role": "system", "content": f"Relevant context:\n{context_content}"},
]
messages.extend(history)
messages.append({"role": "user", "content": user_message})
return messages, tools
Context Quality Principles
Principle 1: Relevance Over Volume
Every token in the context should earn its place. The most common context engineering mistake is including information "just in case." Irrelevant context actively harms performance through attention dilution.
Test: For each piece of context, ask: "Would removing this change the model's output for the worse?" If not, remove it.
Principle 2: Recency and Freshness
Stale context is worse than no context. If your retrieved documents are outdated, the model may generate confidently wrong answers grounded in obsolete information. Context engineering must account for information freshness:
- Timestamp all retrieved context and include the timestamp in the context
- Prefer recent information when multiple sources conflict
- Implement TTLs on cached context
Principle 3: Source Attribution
Including source metadata in context enables the model to cite sources, assess credibility, and handle conflicting information. This connects to Hallucination Mitigation -- models with attributed sources hallucinate less.
def format_with_attribution(docs: list[dict]) -> str:
"""Format documents with clear source attribution."""
parts = []
for i, doc in enumerate(docs, 1):
source = doc.get("source", "unknown")
date = doc.get("date", "unknown date")
parts.append(f"[{i}] Source: {source} | Date: {date}\n{doc['text']}")
return "\n\n".join(parts)
Principle 4: Structured Formatting
How context is formatted affects how well the model processes it. Research and practice suggest:
- Use clear section delimiters (XML tags, markdown headers, or separator lines)
- Keep formatting consistent across context components
- Use structured formats (JSON, markdown tables) for structured data
- Avoid mixing formatting styles within the same context
# Good: clearly delimited sections
context = """<system_instructions>
You are a technical support agent for Acme Cloud Platform.
</system_instructions>
<knowledge_base>
{retrieved_docs}
</knowledge_base>
<conversation_history>
{history}
</conversation_history>
<user_query>
{user_message}
</user_query>"""
# Bad: unstructured soup of text
context = f"{system_prompt}\n{retrieved_docs}\n{history}\n{user_message}"
Principle 5: Context-Aware Instruction Placement
Instructions that refer to retrieved context should be placed after the context they reference. The model processes tokens sequentially during generation, and instructions that reference not-yet-seen content are less effective:
# Better: instruction after context
messages = [
{"role": "system", "content": "You are a helpful research assistant."},
{"role": "system", "content": f"Reference documents:\n{context}"},
{"role": "user", "content": (
"Based on the reference documents above, answer this question: "
f"{question}\n\n"
"Cite document numbers in square brackets [1], [2], etc."
)},
]
# Worse: instruction before context it references
messages = [
{"role": "system", "content": (
"You are a helpful research assistant. "
"Cite document numbers in square brackets. "
"Use ONLY the provided reference documents."
)},
{"role": "user", "content": f"{question}\n\nDocuments:\n{context}"},
]
Measuring Context Quality
Context engineering decisions should be evaluated empirically, not by intuition. Key metrics:
Context Relevance
What fraction of retrieved context is actually relevant to the query? Measured by having a judge model (or human) rate each context chunk for relevance. See LLM-as-Judge for automated evaluation approaches.
def evaluate_context_relevance(query: str, context_chunks: list[str], judge) -> float:
"""Score what fraction of provided context is relevant to the query."""
relevant_count = 0
for chunk in context_chunks:
score = judge.evaluate(
f"Is this context relevant to answering the query?\n"
f"Query: {query}\nContext: {chunk}\n"
f"Rate: relevant or irrelevant"
)
if score == "relevant":
relevant_count += 1
return relevant_count / len(context_chunks) if context_chunks else 0
Context Utilization
Does the model's output actually use the provided context? Low utilization suggests the context is irrelevant or poorly positioned. High utilization with incorrect answers suggests the context itself is wrong or misleading.
Answer Faithfulness
Does the model's answer faithfully reflect the provided context, or does it hallucinate beyond what the context supports? See RAG Evaluation for detailed metrics including faithfulness, answer relevance, and context precision.
Latency Impact
Context assembly adds latency. Retrieval, summarization, tool execution, and context formatting all take time. Measure end-to-end latency and identify which context assembly steps are bottlenecks. For optimization techniques, see Cost Optimization and Inference Optimization.
Production Context Engineering
Caching Strategies
Context assembly is often the most expensive part of an inference pipeline (in terms of latency, not cost). Caching can dramatically reduce context assembly time:
Prompt caching: Many providers (Anthropic, OpenAI, Google) offer prompt caching that reduces cost and latency for repeated static prefixes. Structure your context so that the static system prompt is the prefix, followed by dynamic content.
Anthropic's prompt caching lets you mark a prefix with cache_control: {"type": "ephemeral"}. The first call writes the cache; subsequent calls with the same prefix pay ~10% of the input-token cost and typically see 2-5x latency reduction. Here is the pattern applied to a large system prompt plus a dynamic user turn:
import anthropic
client = anthropic.Anthropic()
SYSTEM_PROMPT = """You are a senior technical writer ...
[several thousand tokens of style guide, glossary, templates]
""" # Long static content โ good candidate for caching
def ask_with_cached_system(question: str) -> str:
"""The system prompt prefix is cached after the first call."""
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}, # cache this prefix
}
],
messages=[{"role": "user", "content": question}],
)
usage = response.usage
cache_hit = getattr(usage, "cache_read_input_tokens", 0)
print(f"tokens: {usage.input_tokens} in / {usage.output_tokens} out "
f"/ {cache_hit} from cache")
return response.content[0].text
# First call โ cache miss, prefix is written
ask_with_cached_system("Rewrite this paragraph in an active voice: ...")
# Second call โ cache hit, ~90% cost saving on the system prefix
ask_with_cached_system("Convert these bullet points into a numbered list: ...")
The key constraint: the cached prefix must appear as the earliest content block (system comes before user messages, and the cached block must be identical byte-for-byte on every call). Vary only the content after the cache_control marker.
Retrieval caching: Cache retrieval results for identical or similar queries. A semantic cache using embedding similarity can serve results for paraphrased queries without hitting the vector database.
Embedding caching: Cache embeddings for frequently queried strings to avoid repeated embedding computation.
Context Debugging
When a model produces wrong or unexpected output, the first thing to check is the context. Build observability into your context assembly pipeline:
class ContextAssembler:
def __init__(self, logger):
self.logger = logger
def assemble(self, query: str, **kwargs) -> dict:
context = {}
# Log each component
context["system"] = self.build_system_prompt()
self.logger.info("system_prompt", tokens=count_tokens(context["system"]))
context["retrieved"] = self.retrieve(query)
self.logger.info("retrieval", count=len(context["retrieved"]),
tokens=count_tokens(str(context["retrieved"])))
context["history"] = self.get_history()
self.logger.info("history", turns=len(context["history"]),
tokens=count_tokens(str(context["history"])))
total = sum(count_tokens(str(v)) for v in context.values())
self.logger.info("context_assembled", total_tokens=total,
budget_utilization=total / self.budget)
return context
For production observability patterns including tracing, metrics, and alerting on context quality, see Observability.
Iterative Refinement
Context engineering is inherently iterative. The workflow:
- Prototype: Start with a simple static context. Test with representative queries.
- Identify failures: When the model fails, determine whether the failure is due to missing context, irrelevant context, poor positioning, or insufficient instructions.
- Add retrieval: If failures are due to missing knowledge, add RAG. Start with a vector database like Chroma for rapid iteration (see Vector Databases).
- Tune retrieval: Adjust chunk sizes, k values, relevance thresholds, and reranking strategies (see Chunking Strategies and Retrieval Strategies).
- Manage budget: If the context window is filling up, add summarization, selective loading, or budget-aware truncation.
- Evaluate: Use automated evals (see Eval Fundamentals) to measure whether context changes improve or degrade overall quality.
Context Engineering vs. Fine-Tuning
A common question: should you engineer better context or fine-tune the model?
Context engineering is the right choice when:
- The knowledge needed changes frequently (product docs, policies, prices)
- You need grounding in specific source documents with citations
- Your application requires different behavior for different contexts
- You want to iterate quickly without retraining
- The model already "knows how" to do the task but lacks specific information
Fine-tuning (see Fine-Tuning Fundamentals) is the right choice when:
- You need to change the model's fundamental behavior or style
- The knowledge is stable and can be baked into weights
- You need to reduce context length (and thus latency/cost) by internalizing knowledge
- The task requires specialized reasoning patterns the model doesn't exhibit with context alone
In practice, the most effective systems combine both: a fine-tuned model for base behavior and style, augmented with dynamic context for specific knowledge and current information.
Runtime Internals
The "input is the program" model hides the mechanics that decide whether a context is well-engineered or just large.
Budget allocation before assembly
The first runtime step is arithmetic, not retrieval: reserve space for the system prompt and the expected output, then divide the remainder across knowledge, history, and tools by priority. Skipping this is why systems "work in dev, 400 at the boundary" โ the budget must be planned, not discovered when the request overflows. This is the entry point to context window management.
Structure and delimiting
The same tokens perform differently depending on layout. Clearly delimited sections (XML-ish tags, headers) and instruction-after-context ordering measurably beat an unstructured "soup". The runtime principle: the model attends to boundaries and recency, so structure is not cosmetic โ it is a correctness lever, and it is why naive concatenation underperforms a deliberate template.
Measuring context quality
You cannot tune what you cannot measure. The runtime instruments context: token utilization, source mix, drop rate, and โ critically โ task accuracy with vs without each context component (ablation). A component that does not move the metric is wasted budget; one that hurts it is active noise. This is eval fundamentals applied to the input rather than the model.
Context engineering vs fine-tuning
A recurring runtime decision: encode behavior in weights (fine-tune) or in the input (context). Context is dynamic, inspectable, instantly updatable, and cheap to change; fine-tuning bakes in style and base behavior but is slow and opaque. Mature systems do both โ a fine-tuned base for behavior, dynamic context for current knowledge โ the same split discussed in fine-tuning fundamentals.
Connections to Other Topics
Context engineering sits at the intersection of several disciplines covered in this series:
- Prompt design (Prompt Engineering Fundamentals, System Prompts): The static foundation of context
- Retrieval (Retrieval Strategies, Advanced RAG, Vector Databases): Dynamic context sourcing
- Memory (Agent Memory): Persistent context across sessions
- Evaluation (Eval Fundamentals, RAG Evaluation): Measuring context quality
- Agents (Agent Architectures, Function Calling): Agentic context assembly
- Cost (Cost Optimization): Context budget management
- Safety (Hallucination Mitigation, Guardrails & Filtering): Context-driven safety