Context engineering has emerged as the defining discipline of applied AI engineering -- the systematic practice of designing, assembling, and managing everything that goes into a language model's context window at inference time. While prompt engineering focuses on crafting instructions and examples, context engineering encompasses the broader architectural challenge: deciding what information the model needs, where that information comes from, how it's formatted and ordered, and how to manage the finite budget of tokens available. This article examines context engineering from first principles, covering context window mechanics, information architecture, retrieval-driven context assembly, and the production patterns that separate brittle prototypes from reliable systems.
The term gained wide adoption after Andrej Karpathy's observation that "the hottest new programming language is English" evolved into a more precise framing: the real skill is not writing prompts but engineering the full context that surrounds them. Tobi Lutke (Shopify CEO) and others have described context engineering as "the art of providing all the information and tools an LLM needs to successfully accomplish a task." This reflects a maturation of the field -- from crafting clever single-shot prompts to designing information systems that dynamically assemble the right context for each interaction.
Prompt engineering, as covered in Prompt Engineering Fundamentals, deals with how to phrase instructions, structure few-shot examples, and steer model behavior. Context engineering is the superset: it encompasses the prompt but also everything else in the context window -- retrieved documents, conversation history, tool outputs, system state, memory, and metadata.
The distinction becomes clear in production systems. A chatbot prompt might be 200 tokens, but the full context at inference time is often 4,000-32,000 tokens assembled from multiple sources:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Context Window โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ System Prompt (~500 tokens) โ โ
โ โ - Role, personality, constraints โ โ
โ โ - Output format specifications โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Retrieved Context (~2000 tokens) โ โ
โ โ - RAG documents, knowledge base hits โ โ
โ โ - Relevant code, documentation โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Conversation History (~3000 tokens) โ โ
โ โ - Prior messages (possibly summarized) โ โ
โ โ - Tool call results from prior turns โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Tool Definitions (~1000 tokens) โ โ
โ โ - Function schemas, descriptions โ โ
โ โ - Available actions and parameters โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Current User Message (~200 tokens) โ โ
โ โ - The actual user request โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ Total: ~6700 tokens of an 128K window โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Each component is a design decision. What goes in, what stays out, how it's formatted, and where it's positioned all affect model performance. Context engineering is the discipline of making these decisions well.
Even with context windows reaching 128K-2M tokens, the bottleneck is not raw capacity but effective capacity. Research consistently demonstrates that models do not attend equally to all content in the context:
Lost in the middle (Liu et al., 2023): Models perform best when relevant information is at the very beginning or very end of the context. Information in the middle receives less attention, leading to degraded performance. This finding has direct architectural implications -- position your most critical context (system instructions, key constraints) at the start, and the user's current query at the end.
Attention dilution: As context length grows, the model's attention is distributed across more tokens. Adding irrelevant content doesn't just waste tokens -- it actively degrades performance on the relevant content. A 4K context with precisely relevant information often outperforms a 32K context padded with tangentially related content.
Reasoning capacity trade-offs: Tokens spent on context are tokens not available for reasoning (in the output). For complex tasks requiring extended chain-of-thought reasoning, reserving output token budget matters as much as curating input context.
A production context engineering system has distinct layers, each with its own design considerations:
The system prompt is the foundation -- the context that remains constant across all interactions within an application. Effective system prompts follow the principles covered in System Prompt Design, but context engineering adds the perspective of budget allocation: how many of your total tokens should be devoted to static instructions versus dynamic content?
Design heuristic: System prompts should consume no more than 10-15% of your effective context budget. If your system prompt is 3,000 tokens and your effective context is 8,000 tokens, you've already consumed 37% on static instructions, leaving limited room for retrieved context and conversation history.
# Context budget planning
CONTEXT_BUDGET = {
"system_prompt": 800, # 10% -- role, constraints, format
"retrieved_context": 3200, # 40% -- RAG results, knowledge
"conversation_history": 2400, # 30% -- recent messages + summary
"tool_definitions": 800, # 10% -- available tools/functions
"user_message": 400, # 5% -- current request
"safety_margin": 400, # 5% -- buffer for tokenization variance
}
# Total: 8000 tokens of a 128K window
# Remaining capacity reserved for model output
The most impactful layer for most applications. Dynamic context is assembled at query time from external sources -- vector databases, search indices, knowledge bases, APIs, or databases. This is where context engineering intersects with RAG (see Retrieval Strategies and Advanced RAG).
Key design decisions:
What to retrieve: Not all queries need retrieval. A classification step or embedding similarity threshold can determine whether retrieved context would help or hurt. Unnecessary retrieval adds latency and potentially dilutes the context.
How much to retrieve: More isn't better. Retrieving 20 chunks when 3 would suffice wastes context budget and dilutes attention. Conversely, retrieving too little risks missing critical information. Adaptive retrieval -- starting with a few results and expanding only if confidence is low -- often outperforms fixed-k retrieval.
How to order retrieved results: Given the "lost in the middle" finding, place the most relevant results first and last. Some practitioners reverse-sort by relevance (least relevant first, most relevant last) so the highest-relevance content is closest to the user message.
def assemble_retrieval_context(
query: str,
collection,
budget_tokens: int = 3200,
max_results: int = 10,
relevance_threshold: float = 0.7,
) -> str:
"""Retrieve and assemble context within a token budget."""
results = collection.query(
query_texts=[query],
n_results=max_results,
)
# Filter by relevance threshold (Chroma returns distances, not similarities)
filtered = []
for doc, distance, metadata in zip(
results["documents"][0],
results["distances"][0],
results["metadatas"][0],
):
similarity = 1 - distance # for cosine distance
if similarity >= relevance_threshold:
filtered.append({"text": doc, "similarity": similarity, "meta": metadata})
# Sort: most relevant first (for "primacy" attention effect)
filtered.sort(key=lambda x: x["similarity"], reverse=True)
# Pack within token budget
context_parts = []
token_count = 0
for item in filtered:
chunk_tokens = len(item["text"].split()) * 1.3 # rough token estimate
if token_count + chunk_tokens > budget_tokens:
break
source = item["meta"].get("source", "unknown")
context_parts.append(f"[Source: {source}]\n{item['text']}")
token_count += chunk_tokens
return "\n\n---\n\n".join(context_parts)
For vector database options including Chroma, Pinecone, Qdrant, and pgvector, see Article 14: Vector Databases. For chunking strategies that affect retrieval quality, see Article 15: Chunking Strategies.
Managing conversation history is a core context engineering challenge. Raw conversation histories grow without bound, and naively stuffing them into context wastes tokens on irrelevant early messages while potentially losing important context from the middle.
Sliding window: Keep the last N messages. Simple but loses earlier context entirely.
Summarization: Periodically summarize older messages, maintaining a rolling summary. See Agent Memory for detailed implementations.
Selective retention: Use a model to decide which past messages are relevant to the current query, loading only those. More expensive (requires an extra LLM call) but produces the most focused context.
Hybrid approach: Maintain a rolling summary of the full conversation plus the last K verbatim messages:
class ConversationContextManager:
def __init__(self, max_history_tokens: int = 2400):
self.messages = []
self.summary = ""
self.max_tokens = max_history_tokens
def add_turn(self, role: str, content: str):
self.messages.append({"role": role, "content": content})
def get_context(self) -> list[dict]:
"""Build conversation context within budget."""
# Always include summary if it exists
context = []
if self.summary:
context.append({
"role": "system",
"content": f"Conversation summary:\n{self.summary}"
})
# Add recent messages in reverse until budget is reached
recent = []
token_count = len(self.summary.split()) * 1.3
for msg in reversed(self.messages):
msg_tokens = len(msg["content"].split()) * 1.3
if token_count + msg_tokens > self.max_tokens:
break
recent.insert(0, msg)
token_count += msg_tokens
return context + recent
def compress(self, summarizer):
"""Summarize older messages to free context budget."""
if len(self.messages) <= 4:
return
older = self.messages[:-4]
text = "\n".join(f"{m['role']}: {m['content']}" for m in older)
self.summary = summarizer(
f"Previous summary:\n{self.summary}\n\nNew messages:\n{text}\n\n"
"Create a concise summary preserving key facts, decisions, and context."
)
self.messages = self.messages[-4:]
For agent-based applications (see Agent Architectures and Function Calling), tool definitions consume context budget. Each tool schema -- name, description, parameters, examples -- can cost 100-500 tokens.
Selective tool loading: Don't load all 50 tools for every query. Classify the user intent first, then load only relevant tools:
TOOL_GROUPS = {
"search": ["web_search", "knowledge_base_search", "code_search"],
"data": ["sql_query", "csv_analyze", "chart_generate"],
"communication": ["send_email", "create_ticket", "post_message"],
"code": ["run_code", "read_file", "write_file", "run_tests"],
}
def select_tools(user_message: str, classifier) -> list[dict]:
"""Load only relevant tool definitions based on user intent."""
intent = classifier(user_message) # Returns tool group names
tools = []
for group in intent.groups:
tools.extend(TOOL_GROUPS.get(group, []))
return [TOOL_SCHEMAS[t] for t in tools]
Tool output truncation: Tool call results (API responses, search results, file contents) can be arbitrarily large. Always truncate or summarize tool outputs before adding them to context:
def truncate_tool_output(output: str, max_tokens: int = 1000) -> str:
"""Truncate tool output to fit within context budget."""
words = output.split()
if len(words) * 1.3 <= max_tokens:
return output
# Keep beginning and end (most informative parts)
keep_words = int(max_tokens / 1.3)
half = keep_words // 2
return " ".join(words[:half]) + "\n\n[...truncated...]\n\n" + " ".join(words[-half:])
The simplest pattern: concatenate fixed components in a predetermined order. Suitable for applications with predictable context needs.
def build_static_context(system_prompt: str, user_message: str) -> list[dict]:
return [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message},
]
The standard RAG pattern: enrich the context with retrieved documents. The context is assembled dynamically based on the user's query.
def build_rag_context(
system_prompt: str,
user_message: str,
retriever,
) -> list[dict]:
retrieved = retriever.search(user_message, k=5)
context_block = format_retrieved_docs(retrieved)
return [
{"role": "system", "content": system_prompt},
{"role": "system", "content": f"Relevant context:\n{context_block}"},
{"role": "user", "content": user_message},
]
The context evolves across multiple reasoning steps. Each tool call produces output that becomes part of the context for the next step. This is the pattern used in agent loops (see Agent Architectures).
def agentic_context_loop(
system_prompt: str,
user_message: str,
tools: list[dict],
max_steps: int = 10,
) -> str:
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message},
]
for step in range(max_steps):
response = llm.chat(messages=messages, tools=tools)
if response.finish_reason == "stop":
return response.content
# Execute tool calls and add results to context
for tool_call in response.tool_calls:
result = execute_tool(tool_call)
messages.append({"role": "assistant", "content": None, "tool_calls": [tool_call]})
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": truncate_tool_output(str(result)),
})
return "Max steps reached."
Complex applications pull context from multiple sources -- databases, APIs, vector stores, user profiles, session state -- and merge them into a unified context. This requires explicit orchestration:
import asyncio
async def build_multi_source_context(
user_message: str,
user_id: str,
session: dict,
budget: dict,
) -> list[dict]:
"""Assemble context from multiple sources in parallel."""
# Fire all retrievals concurrently
user_profile_task = asyncio.create_task(get_user_profile(user_id))
rag_task = asyncio.create_task(retrieve_documents(user_message, budget["retrieved"]))
history_task = asyncio.create_task(get_conversation_history(session["id"], budget["history"]))
tools_task = asyncio.create_task(select_tools_for_intent(user_message))
user_profile, rag_docs, history, tools = await asyncio.gather(
user_profile_task, rag_task, history_task, tools_task
)
# Assemble in optimal order for attention
system_content = build_system_prompt(user_profile, session)
context_content = format_retrieved_docs(rag_docs)
messages = [
{"role": "system", "content": system_content},
{"role": "system", "content": f"Relevant context:\n{context_content}"},
]
messages.extend(history)
messages.append({"role": "user", "content": user_message})
return messages, tools
Every token in the context should earn its place. The most common context engineering mistake is including information "just in case." Irrelevant context actively harms performance through attention dilution.
Test: For each piece of context, ask: "Would removing this change the model's output for the worse?" If not, remove it.
Stale context is worse than no context. If your retrieved documents are outdated, the model may generate confidently wrong answers grounded in obsolete information. Context engineering must account for information freshness:
Including source metadata in context enables the model to cite sources, assess credibility, and handle conflicting information. This connects to Hallucination Mitigation -- models with attributed sources hallucinate less.
def format_with_attribution(docs: list[dict]) -> str:
"""Format documents with clear source attribution."""
parts = []
for i, doc in enumerate(docs, 1):
source = doc.get("source", "unknown")
date = doc.get("date", "unknown date")
parts.append(f"[{i}] Source: {source} | Date: {date}\n{doc['text']}")
return "\n\n".join(parts)
How context is formatted affects how well the model processes it. Research and practice suggest:
# Good: clearly delimited sections
context = """<system_instructions>
You are a technical support agent for Acme Cloud Platform.
</system_instructions>
<knowledge_base>
{retrieved_docs}
</knowledge_base>
<conversation_history>
{history}
</conversation_history>
<user_query>
{user_message}
</user_query>"""
# Bad: unstructured soup of text
context = f"{system_prompt}\n{retrieved_docs}\n{history}\n{user_message}"
Instructions that refer to retrieved context should be placed after the context they reference. The model processes tokens sequentially during generation, and instructions that reference not-yet-seen content are less effective:
# Better: instruction after context
messages = [
{"role": "system", "content": "You are a helpful research assistant."},
{"role": "system", "content": f"Reference documents:\n{context}"},
{"role": "user", "content": (
"Based on the reference documents above, answer this question: "
f"{question}\n\n"
"Cite document numbers in square brackets [1], [2], etc."
)},
]
# Worse: instruction before context it references
messages = [
{"role": "system", "content": (
"You are a helpful research assistant. "
"Cite document numbers in square brackets. "
"Use ONLY the provided reference documents."
)},
{"role": "user", "content": f"{question}\n\nDocuments:\n{context}"},
]
Context engineering decisions should be evaluated empirically, not by intuition. Key metrics:
What fraction of retrieved context is actually relevant to the query? Measured by having a judge model (or human) rate each context chunk for relevance. See LLM-as-Judge for automated evaluation approaches.
def evaluate_context_relevance(query: str, context_chunks: list[str], judge) -> float:
"""Score what fraction of provided context is relevant to the query."""
relevant_count = 0
for chunk in context_chunks:
score = judge.evaluate(
f"Is this context relevant to answering the query?\n"
f"Query: {query}\nContext: {chunk}\n"
f"Rate: relevant or irrelevant"
)
if score == "relevant":
relevant_count += 1
return relevant_count / len(context_chunks) if context_chunks else 0
Does the model's output actually use the provided context? Low utilization suggests the context is irrelevant or poorly positioned. High utilization with incorrect answers suggests the context itself is wrong or misleading.
Does the model's answer faithfully reflect the provided context, or does it hallucinate beyond what the context supports? See RAG Evaluation for detailed metrics including faithfulness, answer relevance, and context precision.
Context assembly adds latency. Retrieval, summarization, tool execution, and context formatting all take time. Measure end-to-end latency and identify which context assembly steps are bottlenecks. For optimization techniques, see Cost Optimization and Inference Optimization.
Context assembly is often the most expensive part of an inference pipeline (in terms of latency, not cost). Caching can dramatically reduce context assembly time:
Prompt caching: Many providers (Anthropic, OpenAI, Google) offer prompt caching that reduces cost and latency for repeated static prefixes. Structure your context so that the static system prompt is the prefix, followed by dynamic content.
Retrieval caching: Cache retrieval results for identical or similar queries. A semantic cache using embedding similarity can serve results for paraphrased queries without hitting the vector database.
Embedding caching: Cache embeddings for frequently queried strings to avoid repeated embedding computation.
When a model produces wrong or unexpected output, the first thing to check is the context. Build observability into your context assembly pipeline:
class ContextAssembler:
def __init__(self, logger):
self.logger = logger
def assemble(self, query: str, **kwargs) -> dict:
context = {}
# Log each component
context["system"] = self.build_system_prompt()
self.logger.info("system_prompt", tokens=count_tokens(context["system"]))
context["retrieved"] = self.retrieve(query)
self.logger.info("retrieval", count=len(context["retrieved"]),
tokens=count_tokens(str(context["retrieved"])))
context["history"] = self.get_history()
self.logger.info("history", turns=len(context["history"]),
tokens=count_tokens(str(context["history"])))
total = sum(count_tokens(str(v)) for v in context.values())
self.logger.info("context_assembled", total_tokens=total,
budget_utilization=total / self.budget)
return context
For production observability patterns including tracing, metrics, and alerting on context quality, see Observability.
Context engineering is inherently iterative. The workflow:
A common question: should you engineer better context or fine-tune the model?
Context engineering is the right choice when:
Fine-tuning (see Fine-Tuning Fundamentals) is the right choice when:
In practice, the most effective systems combine both: a fine-tuned model for base behavior and style, augmented with dynamic context for specific knowledge and current information.
Context engineering sits at the intersection of several disciplines covered in this series: