Lesson 13 of 19 in Phase 4 · Agents & Orchestration

Tree of Thoughts: Deliberate Search Over Reasoning Steps

🤖 Phase 4 · Agents & OrchestrationIntermediate~8 min read

← PreviousSupervisor Multi-Agent Lab: Routing Work Across Specialist Workers Next →Self-Refine: Iterative Refinement with Self-Feedback

Chain-of-Thought asks a model to think out loud, but it does so left-to-right and commits to a single path: one bad intermediate step and the whole answer is sunk, with no way to reconsider. Tree of Thoughts (ToT) breaks that constraint. Instead of one reasoning trajectory it grows a tree of them — at each step it proposes several candidate next thoughts, evaluates how promising each resulting partial solution is, and searches the tree (keeping the best few branches) so it can look ahead and prune dead ends. Yao et al. introduced it in "Tree of Thoughts: Deliberate Problem Solving with Large Language Models" (NeurIPS 2023, arXiv:2305.10601), where on the Game of 24 it lifted GPT-4 from 4% (chain-of-thought) to 74% success. This lesson is the code-first companion to that paper: a runnable ToT in this repo's agents-lab/ package, paired with the broader agent architectures survey.

Mental Model

The mental model for ToT is turn reasoning into deliberate search: propose a fan of thoughts, score them, keep the best breadth, and repeat — buying lookahead and backtracking at the price of many more LLM calls. Chain-of-Thought is a single greedy walk down the tree; it never reconsiders, so its accuracy is hostage to the first plausible-but-wrong step it takes. ToT keeps several partial paths alive at once, asks a value heuristic which ones still look winnable, and spends its next round of proposals only on the survivors. A thought that reaches the goal can short-circuit the whole search.

The three verbs are the entire algorithm. Propose generates k distinct next thoughts from a frontier state (the branching factor). Evaluate maps each candidate partial path to a scalar in [0, 1] — the paper's trick is to reuse the LLM itself as the value function, prompting it to rate how promising the reasoning is. Search is the controller that decides which states to expand next; here it is breadth-first / beam search keeping the top breadth states by value. Whether a finished trajectory is actually correct is, as always, an agent evaluation question.

Propose → Evaluate → Search

Propose — k candidate thoughts

A "thought" is a coherent intermediate step — a sentence of reasoning, a partial plan, an equation. From each state on the frontier, ToT prompts the model to emit n_propose distinct next thoughts. This is the fan-out: with breadth survivors and n_propose thoughts each, you generate breadth * n_propose candidates per depth. The lab's prompt also tells the model to prefix a thought with ANSWER: when it believes it has reached the final answer, which is how the search knows it can stop.

python

PROPOSE_SYSTEM = (
    "Propose {n} distinct next reasoning steps that move toward solving the task. "
    "One step per line, no numbering. If a step reaches the final answer, prefix "
    "it with 'ANSWER: '."
)

Evaluate — a value heuristic

Each new partial path is scored by a value function (task, partial_path) -> float in [0, 1]. The default in the lab reuses the LLM as a judge: it prompts the model to rate the reasoning 0–10 and normalizes to [0, 1]. This is the paper's core idea — no separately trained heuristic, the same model both proposes and evaluates. Because the value function is the search's only signal, it is also the cleanest injection point: pass your own value_fn to make tests deterministic and offline, or to plug in a learned scorer or a verifier in production.

python

VALUE_SYSTEM = (
    "Rate how promising this partial reasoning is for solving the task, 0-10. "
    "Reply with the number only."
)

Search — keep top `breadth`, BFS / beam

The controller drives everything. The lab implements breadth-first beam search: at each depth, expand every frontier state, score all candidates, sort by value, and keep the top breadth. That fixed-width cut is the pruning — low-value branches are discarded and never expanded again, so compute concentrates on promising regions. The loop runs for at most max_depth levels; the best-scoring path seen becomes the answer if nothing fires ANSWER: first.

python

for _ in range(max_depth):
    candidates: list[list[str]] = []
    for path in frontier:
        for thought in _propose(model, task, path, n_propose):
            new_path = [*path, thought]
            if _is_answer(thought):                       # early stop
                return ToTResult(_answer_text(thought), new_path, 1.0, nodes)
            candidates.append(new_path)
    scored = sorted(((value(task, p), p) for p in candidates),
                    key=lambda x: x[0], reverse=True)
    frontier = [p for _, p in scored[:breadth]]            # beam: keep top breadth

The original paper also describes a DFS variant with explicit backtracking and bounded lookahead; the BFS/beam controller here is the simplest member of the same family and the easiest to reason about cost-wise.

ToT vs Chain-of-Thought vs LATS

Versus Chain-of-Thought. CoT is the degenerate case of ToT where n_propose = 1 and breadth = 1: a single un-evaluated path. It has no notion of "this branch looks bad, try another." ToT adds the value heuristic and the beam, so it can explore alternatives and abandon losing lines — which is exactly why it wins on search-heavy tasks like Game of 24 where the first plausible step is often a trap.

Versus LATS. Both grow a tree and reuse the LLM as a value function, so it is easy to conflate them. The difference is the search controller:

ToT is beam search + a value heuristic. It expands by depth, keeps a fixed-width frontier, and (in this implementation) never revisits a pruned branch. Deterministic, simple, cheaper.
LATS is Monte-Carlo Tree Search + backpropagation. It selects nodes by UCT (exploit vs explore), expands one promising leaf at a time, and backpropagates values up the tree so visit counts steer future selection — plus it folds in environment feedback and reflection. It explores and backtracks more aggressively, at a higher LLM-call cost.

In short: ToT prunes by a static width cut; LATS allocates its next call wherever the explore/exploit math points. See the LATS lab for the MCTS version, and the ReAct lab for the greedy single-trajectory baseline both of these improve on.

Run it

The runnable module is agents-lab/agents_lab/tree_of_thoughts.py. solve_tot takes the task and the three search knobs and returns a ToTResult with answer, path, value, and nodes_evaluated.

python

from agents_lab.tree_of_thoughts import solve_tot

# Injectable value_fn: (task, partial_path) -> float in [0, 1].
# Returning a constant makes the search deterministic and offline (great for tests).
res = solve_tot(
    "task",
    value_fn=lambda task, path: 0.9,
    n_propose=2,    # k candidate thoughts proposed per frontier state
    breadth=2,      # beam width: top-N partial paths kept each depth
    max_depth=3,    # max reasoning steps before returning the best path
)
print(res.answer, res.value)

Omit value_fn and ToT falls back to LLM self-evaluation (the 0–10 prompt above, normalized). Omit llm and it uses the lab's default model. A proposed thought prefixed with ANSWER: stops the search early and returns immediately with value 1.0 — so a confident model can finish in one depth.

From the command line:

bash

uv run python -m agents_lab.cli tot "Use 4, 9, 10, 13 and +,-,*,/ to make 24."

DeepSeek is the only paid API this lab calls; everything else (the proposal parsing, the beam controller, the ValueFn injection point) runs locally, so you can exercise the full search with a stubbed value_fn before spending a token. This is the runnable companion to the agent architectures overview — read that for where deliberate-search agents sit in the wider design space.

Knobs that matter

n_propose (branching factor k) — more candidates per state means a wider net but a linear increase in proposal calls. The paper uses small values (often 1–5).
breadth (beam width) — how many partial paths survive each cut. breadth = 1 collapses ToT toward CoT; larger values buy robustness against an early wrong turn at the cost of more evaluations.
max_depth — the horizon. Set it to the number of reasoning steps the task genuinely needs; too shallow and the search returns its best incomplete path.
value_fn quality — the search is only as good as its heuristic. A noisy value function prunes good branches; this is the first thing to harden (with a verifier or task-specific scorer) when ToT underperforms.

Total LLM calls scale roughly as max_depth * breadth * (n_propose + n_propose) — proposals plus evaluations — which is the price of deliberation over the single call CoT makes. Tune the three knobs against that budget.

Sources: arXiv:2305.10601, princeton-nlp/tree-of-thought-llm

Continue Learning

← PreviousSupervisor Multi-Agent Lab: Routing Work Across Specialist Workers Next →Self-Refine: Iterative Refinement with Self-Feedback

On this page

Tree of Thoughts: Deliberate Search Over Reasoning Steps

Mental Model

Propose → Evaluate → Search

Propose — k candidate thoughts

Evaluate — a value heuristic

Search — keep top breadth, BFS / beam

ToT vs Chain-of-Thought vs LATS

Run it

Knobs that matter

Search — keep top `breadth`, BFS / beam