โ† all lessons/๐Ÿค– Phase 4 ยท Agents & Orchestration/#103
Lesson 11 of 19 in Phase 4 ยท Agents & Orchestration

LATS: Tree Search Agents with Monte-Carlo Tree Search

๐Ÿค– Phase 4 ยท Agents & OrchestrationIntermediate~8 min read
Recommended prerequisite:#102 Plan-and-Execute Agents: Decompose First, Then Act
โ† PreviousPlan-and-Execute Agents: Decompose First, Then ActNext โ†’Supervisor Multi-Agent Lab: Routing Work Across Specialist Workers

Most agent loops are greedy: they commit to one trajectory, and a single bad reasoning step or tool call sends the whole run off a cliff with no way back. LATS โ€” Language Agent Tree Search โ€” replaces that single committed path with a search over many candidate trajectories. It borrows Monte-Carlo Tree Search (MCTS) from game-playing AI and adapts the four-phase loop (select, expand, evaluate, backpropagate) to language agents, using the LLM itself as both the action generator and the value function. Zhou et al. introduced it in "Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models" (ICML 2024, arXiv:2310.04406), where it lifted HotPotQA exact-match from ReAct's 32% to 71% and pushed HumanEval pass-rate to 94.4% with GPT-4. This lesson is the code-first companion: a runnable LATS in this repo's agents-lab/ package, paired with the broader agent architectures survey.

Mental Model

The mental model for LATS is search a tree of trajectories instead of committing to one โ€” you buy deliberation and backtracking at the price of many more LLM calls. A plain ReAct agent is a depth-first walk that never reconsiders: each step is final. LATS keeps a tree of partial trajectories, scores them, and spends its next LLM call wherever the score-vs-uncertainty math says the most promising-yet-underexplored branch is. When a branch fails, it doesn't just die โ€” its low value flows back up the tree (and, in the full algorithm, a verbal reflection is attached as context), so the next iteration is steered away from the same mistake.

The deeper idea is that LATS unifies three things that earlier methods kept separate. Reasoning (chain-of-thought "thoughts"), acting (tool calls and observations, exactly as in ReAct), and planning (the MCTS deliberation over futures) all live inside one search. ReAct gives you the action space; MCTS gives you a principled way to explore it instead of greedily autoregressing through it. Whether the search actually succeeds is judged on the trajectory, which is squarely an agent evaluation problem.

The MCTS Loop, Adapted to Language Agents

Classic MCTS runs four phases per iteration over a search tree where each node is a game state. LATS keeps the phases but swaps the domain: a node is now a partial trajectory (a candidate answer, or a reasoning/acting step), the moves are LLM-generated, and the reward comes from a value function plus any environment feedback.

Selection โ€” UCT balances exploit vs explore

Starting at the root, LATS walks down the tree, at each level picking the child that maximizes the UCT (Upper Confidence bounds applied to Trees) score:

UCT(node) = q(node) + c * sqrt( ln N(parent) / N(node) )

The first term q is the node's average value (exploitation โ€” go where things have looked good). The second term grows for nodes with few visits relative to their parent (exploration โ€” try branches you haven't probed). The constant c tunes the trade-off. A node with zero visits is treated as infinitely attractive, so every freshly expanded child is guaranteed at least one look. This is the same formula AlphaGo-style search uses; the only difference is what a "state" means.

Expansion โ€” the LLM proposes candidate next steps

Once selection lands on a leaf, LATS asks the LLM to sample n distinct candidate continuations from that node โ€” candidate final answers, or candidate next reasoning/acting steps in a longer-horizon task. This is the key adaptation: instead of one greedy next token-sequence, you draw a small fan of genuinely different children. The branching factor n is your main cost/quality dial.

Evaluation โ€” a value function scores each node

Each new child gets a scalar score in [0, 1]. LATS's default trick is to reuse the LLM as a value function: prompt it to rate how well a candidate solves the task (e.g. on a 0โ€“10 scale, normalized), avoiding any separately-trained heuristic. In tasks with an environment, this LM self-eval is combined with external feedback (test pass/fail, tool results). Crucially the value function is a clean injection point โ€” you can swap in a deterministic scorer for tests or a learned model in production.

Backpropagation โ€” update visits and values up the tree

After scoring a child, its value flows back along the path to the root. Each ancestor's visit count increments and its running value updates:

N(node)      <- N(node) + 1
value(node)  <- (reward + N_old * value_old) / N(node)

So a node's q becomes the average reward of all trajectories that passed through it. High-value subtrees accumulate high q and get revisited; dead ends accumulate low q and get starved by the UCT math. (The original paper also runs a simulation/rollout to a terminal state before backpropagating; the lab keeps a one-step variant for clarity and cost.)

Reflection โ€” learning from failed branches

When a trajectory ends in failure, full LATS prompts the LLM to write a short verbal reflection summarizing what went wrong, and stores it as extra context for later iterations. This is the same self-improvement signal you build in the Reflexion lab โ€” LATS layers it on top of the value signal so the tree learns semantically (in words), not just numerically (in q-values).

Tradeoffs: When Tree Search Earns Its Cost

LATS is not a free upgrade. The math is unforgiving: per iteration you pay roughly n expansion calls plus n evaluation calls, multiplied across iterations and tree depth. A run that a ReAct agent finishes in 5 LLM calls can cost LATS 30โ€“50+. That is latency and money.

The payoff shows up specifically on hard, verifiable tasks โ€” competitive coding, multi-hop QA, web navigation, math โ€” where (a) a single greedy path has a real chance of going wrong, and (b) there's a reliable way to score progress (unit tests, a checker, a trustworthy LLM judge). There, the backtracking and deliberation convert extra compute into a large accuracy jump. The paper's 32% โ†’ 71% on HotPotQA is the canonical example.

It is not worth it when the task is easy enough that one pass usually succeeds, when latency budgets are tight (interactive chat), or โ€” most dangerously โ€” when your value function is noisy. MCTS amplifies whatever signal the value function gives it; a bad scorer makes the search confidently explore the wrong subtree. Treat the quality of value_fn as the load-bearing assumption. Practical guidance: reach for LATS only after a ReAct or Reflexion baseline has plateaued, and keep n_expand and iterations small until you've confirmed the value signal is trustworthy.

Run it

The runnable LATS lives at agents-lab/agents_lab/lats.py. It implements the MCTS loop directly (a search, not a state machine, so it is not modeled as a LangGraph graph) with an injectable value function. DeepSeek is the only paid API โ€” and because value_fn is injectable, the example below runs with zero API calls by passing a deterministic scorer.

python
from agents_lab.lats import run_lats

# value_fn: (task, candidate) -> float in [0, 1]. Deterministic here, so no LLM calls.
res = run_lats(
    "pick the best answer",
    value_fn=lambda task, cand: 0.95 if "good" in cand else 0.2,
    n_expand=2,     # candidates proposed per expansion (branching factor)
    iterations=4,   # MCTS rounds before giving up
)
print(res.answer, res.value, res.nodes_explored)

run_lats returns a LatsResult with three fields:

  • answer โ€” the highest-q node found.
  • value โ€” that node's averaged value, in [0, 1].
  • nodes_explored โ€” total candidates expanded and scored (your cost proxy).

The search stops early the moment a node's value reaches solved_threshold (default 0.9), otherwise it runs the full iterations budget. Omit value_fn and LATS falls back to LLM self-evaluation with DeepSeek โ€” the model scores each candidate 0โ€“10 and normalizes to [0, 1]:

python
res = run_lats("Write a one-line Python function to reverse a string.")
print(res.answer)        # best candidate the tree found
print(res.value)         # self-eval value of that node
print(res.nodes_explored)

Tuning knobs map straight onto the algorithm:

python
res = run_lats(
    task="...",
    n_expand=3,            # wider fan: more diversity, more cost
    iterations=8,          # deeper search: more backtracking
    c=1.4,                 # UCT exploration constant (higher = explore more)
    solved_threshold=0.9,  # early-exit value
)

From the command line:

bash
uv run python -m agents_lab.cli lats "pick the best answer"

Raise n_expand and iterations together and watch nodes_explored climb โ€” that number is the deliberation budget you're spending, and the whole point of LATS is deciding when buying more of it is worth the cost.

Where to Go Next

LATS is the heaviest, most deliberate point on the autonomy-vs-control spectrum laid out in agent architectures. Build up to it: start from the greedy baseline in the ReAct lab, add failure-memory with the Reflexion lab, and reach for tree search only once you've confirmed a single trajectory genuinely isn't enough and you have a value function you trust. Judge all three the same way โ€” on the agent evaluation of the full trajectory, not just the final answer.

โ† PreviousPlan-and-Execute Agents: Decompose First, Then ActNext โ†’Supervisor Multi-Agent Lab: Routing Work Across Specialist Workers