The design of agent architectures determines how language models reason, plan, and act in complex environments. From the foundational ReAct framework that interleaves reasoning with action, to sophisticated cognitive architectures that incorporate reflection, search, and hierarchical planning, the field has rapidly evolved beyond simple prompt-response patterns. This article provides a deep technical examination of the major agent architecture paradigms, their theoretical foundations, implementation patterns, and practical tradeoffs observed in production systems.
Before examining specific architectures, it is worth understanding the conceptual progression that led to modern agent designs. Chain-of-Thought prompting (Wei et al., 2022) demonstrated that LLMs could perform multi-step reasoning when prompted to "think step by step" (see Article 08: Few-Shot & Chain-of-Thought for a thorough treatment). This was a passive capability -- the model reasoned but could not act on its reasoning.
The key insight of agent architectures is closing the loop: the model reasons about what to do, takes an action (typically a tool call), observes the result, and then reasons again. This observe-think-act cycle, borrowed from classical AI agent theory (Russell & Norvig), becomes the foundation for all modern LLM agent designs.
ReAct (Yao et al., 2022, "ReAct: Synergizing Reasoning and Acting in Language Models") introduced a deceptively simple but powerful pattern: interleave reasoning traces (thoughts) with actions and observations in a single prompt sequence.
The format follows:
Thought: I need to find the population of Tokyo to answer this question.
Action: search("Tokyo population 2024")
Observation: Tokyo's population is approximately 13.96 million in the city proper...
Thought: Now I have the population. The user also asked about the area, so I need to look that up.
Action: search("Tokyo area square kilometers")
Observation: Tokyo covers 2,194 square kilometers...
Thought: I now have both pieces of information to answer the question.
Action: finish("Tokyo has a population of ~14 million across 2,194 kmยฒ")
The power of ReAct comes from the synergy between reasoning and acting. The reasoning traces serve multiple purposes:
Yao et al. showed that ReAct outperformed both pure reasoning (Chain-of-Thought without actions) and pure acting (actions without reasoning traces) on knowledge-intensive tasks (HotpotQA, FEVER) and interactive decision-making tasks (ALFWorld, WebShop).
A minimal ReAct loop in code:
class ReActAgent:
def __init__(self, llm, tools, max_steps=10):
self.llm = llm
self.tools = {t.name: t for t in tools}
self.max_steps = max_steps
def run(self, question: str) -> str:
prompt = self._build_initial_prompt(question)
for step in range(self.max_steps):
response = self.llm.generate(prompt)
# Parse thought and action from response
thought, action, action_input = self._parse_response(response)
prompt += f"\nThought: {thought}\nAction: {action}({action_input})"
if action == "finish":
return action_input
# Execute tool and get observation
tool = self.tools.get(action)
if tool is None:
observation = f"Error: Unknown tool '{action}'"
else:
observation = tool.execute(action_input)
prompt += f"\nObservation: {observation}\n"
return "Max steps exceeded without reaching an answer."
ReAct's step-by-step nature means it struggles with tasks requiring long-horizon planning. Each thought considers only the immediate next action, without a global plan. This leads to:
Plan-and-Execute agents address ReAct's myopia by separating planning from execution into distinct phases:
class PlanAndExecuteAgent:
def __init__(self, planner_llm, executor_llm, tools):
self.planner = planner_llm
self.executor = ReActAgent(executor_llm, tools)
def run(self, task: str) -> str:
# Phase 1: Create plan
plan = self.planner.generate(
f"Create a step-by-step plan to accomplish: {task}\n"
f"Available tools: {[t.name for t in self.tools]}\n"
f"Return a numbered list of steps."
)
steps = self._parse_plan(plan)
results = []
for i, step in enumerate(steps):
# Phase 2: Execute each step
context = f"Overall task: {task}\nPrevious results: {results}\nCurrent step: {step}"
result = self.executor.run(context)
results.append({"step": step, "result": result})
# Phase 3: Check if replanning is needed
if self._should_replan(task, steps[i+1:], results):
remaining_steps = self._replan(task, results, steps[i+1:])
steps = steps[:i+1] + remaining_steps
return self._synthesize_answer(task, results)
Plan-and-Execute provides several benefits:
The most challenging aspect of Plan-and-Execute is knowing when and how to replan. Too much replanning degenerates into step-by-step execution (essentially ReAct). Too little means the agent stubbornly follows a plan that has become invalid.
Effective replanning heuristics include:
Reflexion (Shinn et al., 2023, "Reflexion: Language Agents with Verbal Reinforcement Learning") introduces a meta-cognitive layer: after completing a task (successfully or not), the agent generates a natural language reflection on what went well and what went wrong. These reflections are stored and provided as context in subsequent attempts.
class ReflexionAgent:
def __init__(self, actor, evaluator, reflector, memory):
self.actor = actor # The agent that attempts the task
self.evaluator = evaluator # Determines success/failure
self.reflector = reflector # Generates reflections
self.memory = memory # Stores reflections
def run(self, task: str, max_trials: int = 3) -> str:
for trial in range(max_trials):
# Retrieve relevant past reflections
reflections = self.memory.retrieve(task)
# Attempt the task with reflection context
trajectory = self.actor.run(task, reflections=reflections)
# Evaluate the result
success, feedback = self.evaluator.evaluate(task, trajectory)
if success:
return trajectory.final_answer
# Generate and store reflection
reflection = self.reflector.generate(
f"Task: {task}\n"
f"Attempt: {trajectory}\n"
f"Feedback: {feedback}\n"
f"What went wrong and what should be done differently?"
)
self.memory.store(task, reflection)
return trajectory.final_answer # Return best attempt
Reflexion is notable because it achieves learning without gradient updates. The "reinforcement" is entirely verbal -- stored as natural language reflections that modify the agent's behavior through in-context learning. Shinn et al. demonstrated significant improvements on coding tasks (HumanEval: 67.0% to 91.0%), decision-making (ALFWorld), and reasoning tasks through iterative reflection.
This approach is particularly valuable in production systems where fine-tuning is impractical. The reflection memory acts as a lightweight, interpretable form of experience that can be inspected, edited, and shared across agent instances. For a deeper exploration of how agents store and retrieve experiences across sessions, see Article 28: Agent Memory.
LATS (Zhou et al., 2023) combines the LLM agent paradigm with Monte Carlo Tree Search (MCTS), treating agent trajectories as a tree to be explored systematically rather than a single path.
The key insight is that agent decision-making at each step can be viewed as a search problem: multiple possible actions exist, and the agent should explore promising branches while being willing to backtrack from unpromising ones.
[Initial State]
/ | \
[Action A] [Action B] [Action C]
/ \ | |
[AโD] [AโE] [BโF] [CโG]
โ โ โ โ
Score: Score: Score: Score:
0.3 0.7 0.5 0.8 โ Expand CโG further
The algorithm:
This approach is computationally expensive (many LLM calls per decision) but excels on tasks where exploration is critical, such as complex reasoning problems and multi-step planning.
LangGraph (part of the LangChain ecosystem) represents agent logic as a directed graph where nodes are computation steps and edges are transitions. This addresses a fundamental limitation of linear agent loops: real-world agent workflows often involve branching, parallel execution, cycles, and conditional logic that don't fit neatly into a sequential framework. For a comprehensive treatment of LangGraph's primitives, patterns, and deployment options, see the dedicated LangGraph article.
LangGraph has matured significantly since its introduction, with built-in support for persistence, human-in-the-loop approval gates, and the LangGraph Platform for deployment. The core abstraction remains the StateGraph, but the ecosystem now includes prebuilt components for common patterns like tool-calling agents and ReAct loops:
from langgraph.graph import StateGraph, END, START
from langgraph.prebuilt import ToolNode, tools_condition
from langgraph.checkpoint.memory import MemorySaver
from typing import Annotated
from typing_extensions import TypedDict
from langchain_core.messages import AnyMessage
from langgraph.graph.message import add_messages
class AgentState(TypedDict):
messages: Annotated[list[AnyMessage], add_messages]
def call_model(state: AgentState) -> AgentState:
"""Invoke the LLM with current messages and available tools."""
response = model.invoke(state["messages"])
return {"messages": [response]}
# Build the graph with prebuilt tool handling
graph = StateGraph(AgentState)
graph.add_node("agent", call_model)
graph.add_node("tools", ToolNode(tools))
graph.add_edge(START, "agent")
graph.add_conditional_edges("agent", tools_condition) # Route to tools or END
graph.add_edge("tools", "agent") # Return to agent after tool execution
# Compile with persistence for conversation memory and time-travel debugging
memory = MemorySaver()
agent = graph.compile(checkpointer=memory)
# Invoke with a thread ID for persistent state
result = agent.invoke(
{"messages": [{"role": "user", "content": "Research topic X"}]},
config={"configurable": {"thread_id": "session-1"}}
)
For more complex workflows, LangGraph supports subgraph composition, enabling teams to build modular agent components that can be tested independently and composed into larger systems. A research agent subgraph might be embedded within a broader customer service workflow, for example.
At the opposite end of the complexity spectrum from LangGraph, OpenAI's Swarm (2024) offers a minimalist reference implementation for multi-agent handoffs. Swarm's core abstraction is intentionally simple: agents are defined as objects with instructions and a list of functions, and one agent can hand off to another by returning a new agent from a function call.
from swarm import Swarm, Agent
triage_agent = Agent(
name="Triage",
instructions="Determine the user's intent and hand off to the appropriate specialist.",
functions=[transfer_to_sales, transfer_to_support],
)
sales_agent = Agent(
name="Sales",
instructions="Help the user with purchasing decisions.",
functions=[check_inventory, create_quote],
)
client = Swarm()
response = client.run(agent=triage_agent, messages=[{"role": "user", "content": "I want to buy..."}])
Swarm is explicitly not a production framework -- OpenAI describes it as an educational and experimental tool. Its value lies in demonstrating that multi-agent coordination does not require heavy infrastructure. The handoff pattern, where one agent explicitly delegates to another, is a powerful primitive that can be implemented in any framework. For more on multi-agent patterns, see Article 27: Multi-Agent Systems.
AgentState is a well-defined data structure, making it easy to persist, resume, and debugBeyond LangGraph, the state machine pattern is broadly applicable. Key states in a robust agent system:
[IDLE] โ [PLANNING] โ [EXECUTING] โ [OBSERVING] โ [DECIDING]
โ |
| [REFLECTING] โโโโโโโโโโโโโโโ+
| |
+โโโโโโโโโโโโโโโ+
|
[RESPONDING] โ [IDLE]
Each state has clear entry/exit conditions, valid transitions, and error handling. This formalism prevents agents from entering undefined states and makes the system easier to reason about.
AutoGPT (Significant Gravitas, 2023) popularized the concept of fully autonomous agents that can set their own goals, create tasks, and execute them without human intervention. While the original implementation had significant reliability issues, the architectural pattern influenced subsequent work:
The primary lesson from AutoGPT was negative: fully autonomous agents without proper guardrails tend to loop, waste resources, and produce unreliable results. Modern systems incorporate more structure, human oversight, and bounded autonomy.
BabyAGI introduced a cleaner version of the autonomous agent pattern focused on task management:
class TaskManager:
def __init__(self, llm):
self.task_queue = PriorityQueue()
self.completed_tasks = []
self.llm = llm
def create_tasks(self, objective, last_result):
new_tasks = self.llm.generate(
f"Objective: {objective}\n"
f"Last completed task result: {last_result}\n"
f"Existing tasks: {self.task_queue.items}\n"
f"Create new tasks if needed."
)
for task in parse_tasks(new_tasks):
self.task_queue.add(task)
def prioritize(self, objective):
reprioritized = self.llm.generate(
f"Reprioritize these tasks for objective '{objective}':\n"
f"{self.task_queue.items}"
)
self.task_queue = parse_prioritized_queue(reprioritized)
Park et al. (2023, "Generative Agents: Interactive Simulacra of Human Behavior") demonstrated a cognitive architecture for simulating believable human behavior. Their architecture includes:
While designed for simulation, this architecture has influenced practical agent design, particularly the memory and reflection components.
The emergence of models with built-in chain-of-thought capabilities -- OpenAI's o1 and o3 series, DeepSeek-R1, Claude with extended thinking -- fundamentally changes the landscape of agent design. These models perform multi-step reasoning internally before producing output, which means the explicit Thought/Action/Observation loop of ReAct becomes partially redundant. The model is already thinking step by step; forcing it through an additional external reasoning scaffold can be counterproductive.
In a reasoning-native architecture, the agent loop simplifies. Rather than parsing interleaved thought and action traces, the agent issues a task to the model, the model reasons internally (often producing a hidden or visible chain-of-thought), and returns either a final answer or a structured tool call. The orchestrator manages the tool execution loop, but the reasoning happens inside the model rather than being prompted externally.
class ReasoningNativeAgent:
"""Agent design for models with built-in chain-of-thought (o1, R1, etc.)."""
def __init__(self, reasoning_model, tools, max_steps=15):
self.model = reasoning_model
self.tools = {t.name: t for t in tools}
self.max_steps = max_steps
def run(self, task: str) -> str:
messages = [{"role": "user", "content": task}]
for step in range(self.max_steps):
# The model reasons internally -- no explicit "Thought:" prompting needed
response = self.model.generate(
messages=messages,
tools=list(self.tools.values()),
reasoning_effort="high" # Provider-specific parameter
)
if response.tool_calls:
for call in response.tool_calls:
result = self.tools[call.name].execute(call.arguments)
messages.append({"role": "tool", "content": result, "tool_call_id": call.id})
else:
return response.content
return "Max steps reached."
The decision is not binary. Reasoning-native models excel at tasks requiring deep analytical thinking -- mathematical proofs, complex code generation, multi-constraint planning -- where the internal chain-of-thought can explore the problem space without the overhead of external tool calls. For these tasks, wrapping a reasoning model in a ReAct loop adds latency and cost without improving quality.
Explicit agent loops remain preferable when the task is primarily about information gathering rather than reasoning. A ReAct agent using a fast, inexpensive model to search databases, call APIs, and aggregate results will outperform a reasoning model on both cost and speed. The reasoning model's extended thinking is wasted on tasks like "look up X, then look up Y, then combine them."
The hybrid approach is often strongest in practice: use a reasoning model for the planning phase (where deep thinking about task decomposition adds value) and a faster model for execution steps that are primarily tool invocation. This mirrors the Plan-and-Execute pattern but with the planner benefiting from built-in reasoning capabilities. See Article 08: Few-Shot & Chain-of-Thought for the foundational prompting techniques that reasoning-native models internalize.
Standard RAG (retrieval-augmented generation) follows a fixed pipeline: embed the query, retrieve top-k documents, stuff them into the prompt, generate an answer. This works for simple factual questions but breaks down when the initial query is ambiguous, the retrieved documents are insufficient, or the task requires synthesizing information across multiple retrieval passes.
Agentic RAG places an agent in control of the retrieval process. The agent decides what to search for, evaluates whether the retrieved results are sufficient, reformulates queries when they are not, and determines when it has gathered enough information to answer. This transforms retrieval from a single-shot operation into an iterative, adaptive process.
class AgenticRAGAgent:
def __init__(self, llm, retriever, max_retrievals=5):
self.llm = llm
self.retriever = retriever
self.max_retrievals = max_retrievals
def run(self, question: str) -> str:
gathered_context = []
for attempt in range(self.max_retrievals):
# Agent decides what to search for
search_decision = self.llm.generate(
f"Question: {question}\n"
f"Information gathered so far: {gathered_context}\n"
f"What specific information do you still need? "
f"Generate a search query, or respond SUFFICIENT if you have enough."
)
if "SUFFICIENT" in search_decision:
break
# Retrieve and evaluate
results = self.retriever.search(search_decision)
evaluation = self.llm.generate(
f"Question: {question}\n"
f"Search query: {search_decision}\n"
f"Results: {results}\n"
f"Are these results relevant and sufficient? "
f"Extract useful information or explain what is missing."
)
gathered_context.append({
"query": search_decision,
"results": results,
"evaluation": evaluation
})
# Synthesize final answer from all gathered context
return self.llm.generate(
f"Question: {question}\n"
f"All gathered information: {gathered_context}\n"
f"Provide a comprehensive answer."
)
Query routing. The agent first classifies the incoming question to determine which retrieval source to use. A technical question might route to a code documentation index, while a policy question routes to an internal knowledge base. The routing decision itself can involve reasoning about the question's domain and intent.
Retrieval evaluation. After each retrieval step, the agent assesses whether the returned documents actually address the question. This is where agentic RAG diverges most sharply from static RAG -- a standard pipeline blindly uses whatever was retrieved, while an agentic system can recognize when results are off-topic, outdated, or insufficiently detailed and issue a refined query.
Multi-hop retrieval. Some questions require chaining information across documents. "What was the revenue impact of the product launched by the team that the new VP leads?" requires first identifying the VP, then their team, then the product, then the revenue data. An agentic retriever decomposes this naturally through iterative search.
Adaptive chunking and re-ranking. The agent can request different chunk sizes, apply re-ranking models to candidate results, or expand its search to adjacent chunks when a retrieved passage seems relevant but incomplete.
Agentic RAG has become one of the most common production agent patterns because it addresses the primary failure mode of naive RAG -- retrieval quality -- while remaining relatively bounded in scope and cost. For deeper coverage of retrieval mechanisms, see Article 16: Retrieval Strategies and Article 17: Advanced RAG.
An agent without robust termination logic is a liability. It may loop indefinitely, accumulating costs while producing no useful output. It may give up too early when the task is nearly complete. It may produce a degraded result without communicating the degradation to the caller. Termination strategy is not an afterthought -- it is a core architectural decision.
The most straightforward termination mechanism is a cost ceiling. Track cumulative token usage and tool invocation costs, and halt the agent when a budget is reached:
class BudgetedAgent:
def __init__(self, agent, max_cost_usd=1.0, max_steps=20, max_tokens=100_000):
self.agent = agent
self.max_cost = max_cost_usd
self.max_steps = max_steps
self.max_tokens = max_tokens
def run(self, task: str) -> AgentResult:
accumulated_cost = 0.0
accumulated_tokens = 0
for step in range(self.max_steps):
result = self.agent.step(task)
accumulated_cost += result.cost
accumulated_tokens += result.tokens_used
if result.is_complete:
return AgentResult(answer=result.answer, status="complete",
cost=accumulated_cost)
if accumulated_cost >= self.max_cost:
return AgentResult(
answer=self._best_partial_answer(),
status="budget_exceeded",
cost=accumulated_cost
)
if accumulated_tokens >= self.max_tokens:
return AgentResult(
answer=self._best_partial_answer(),
status="token_limit_reached",
cost=accumulated_cost
)
return AgentResult(answer=self._best_partial_answer(),
status="max_steps_reached", cost=accumulated_cost)
Rather than terminating on resource limits alone, quality gates evaluate whether the agent's current output meets a minimum quality threshold. This can be implemented as a separate evaluator model that scores the agent's answer, a set of heuristic checks (does the answer address all parts of the question? does it contain supporting evidence?), or a confidence assessment from the agent itself.
Quality gates are particularly important in systems where a partial but high-quality answer is preferable to a complete but unreliable one. The gate acts as a checkpoint: if the answer passes, return it; if not, either continue or escalate.
Well-designed agents communicate their limitations rather than silently producing poor output. A graceful degradation strategy includes:
These strategies interact with agent evaluation -- see Article 30: Agent Evaluation for metrics that capture termination quality, and Article 39: Cost Optimization for the economics of budget-based termination.
A significant evolution in agent capability is the ability to interact with graphical interfaces directly. Anthropic's computer use pattern enables agents to observe screenshots, identify UI elements, and execute mouse and keyboard actions. This represents a departure from the API-centric tool use model -- instead of calling structured functions, the agent operates at the human interaction layer.
The architecture for computer use agents follows a perception-action cycle:
This pattern is inherently fragile -- UI layouts change, elements load asynchronously, and pixel coordinates are resolution-dependent. Production implementations require robust retry logic and state verification at each step.
Anthropic's recommended agent pattern for tool use involves a tight loop where the model calls tools, observes results, and critically evaluates whether the result matches expectations before proceeding. This self-correction pattern is distinct from Reflexion's trial-level reflection; it operates at the individual step level within a single execution.
class SelfCorrectingToolLoop:
def __init__(self, model, tools, max_retries_per_step=2):
self.model = model
self.tools = tools
self.max_retries = max_retries_per_step
def execute_with_correction(self, messages: list) -> str:
while True:
response = self.model.generate(messages=messages, tools=self.tools)
if not response.tool_calls:
return response.content
for tool_call in response.tool_calls:
result = self._execute_tool(tool_call)
messages.append({"role": "tool", "content": result,
"tool_call_id": tool_call.id})
# Model evaluates whether the tool result is usable
# If the result indicates an error or unexpected output,
# the model can retry with modified parameters
# This happens naturally in the next model turn
The key insight is that the model itself serves as the error detector. When a tool returns an unexpected result -- an API error, empty data, a malformed response -- the model recognizes the problem and either retries with modified parameters, tries an alternative tool, or adjusts its plan. This self-correction capability is what separates robust production agents from brittle prototypes.
For the foundational mechanics of how tools are defined and invoked, see Article 25: Function Calling & Tool Integration. For how these patterns extend to multi-agent coordination, see Article 27: Multi-Agent Systems.
| Architecture | Best For | Complexity | Reliability | Cost |
|---|---|---|---|---|
| ReAct | Simple tool-use tasks, Q&A | Low | Medium | Low |
| Plan-and-Execute | Multi-step workflows | Medium | Medium-High | Medium |
| Reflexion | Tasks where learning from failure matters | Medium | High | Medium-High |
| LATS | Complex reasoning, exploration-heavy tasks | High | High | Very High |
| Reasoning-Native | Deep analysis, complex planning | Low-Medium | High | Medium-High |
| Agentic RAG | Knowledge-intensive Q&A, research | Medium | High | Medium |
| LangGraph/State Machine | Production systems, complex workflows | Medium-High | High | Varies |
| Autonomous (AutoGPT-style) | Open-ended exploration | High | Low | Very High |
Choose your architecture based on:
Each architecture requires carefully designed prompts. Key principles:
Thought:, Action:, Plan:) that the parser can reliably extractAgents can take many seconds or minutes to complete. Streaming intermediate results is essential for user experience:
async def stream_agent_execution(agent, task):
async for event in agent.astream_events(task):
if event["type"] == "thought":
yield f"Thinking: {event['content']}\n"
elif event["type"] == "tool_call":
yield f"Using tool: {event['tool_name']}\n"
elif event["type"] == "tool_result":
yield f"Got result from {event['tool_name']}\n"
elif event["type"] == "final_answer":
yield f"\nAnswer: {event['content']}\n"
This article connects to several related topics covered elsewhere in this series: