Lesson 15 of 19 in Phase 4 · Agents & Orchestration

ReWOO and LLMCompiler: Efficient Tool Planning

🤖 Phase 4 · Agents & OrchestrationIntermediate~8 min read

Recommended prerequisite:#106 Self-Refine: Iterative Refinement with Self-Feedback

← PreviousSelf-Refine: Iterative Refinement with Self-Feedback Next →Multi-Agent Debate: Convergence Through Critique

The default agent loop is wasteful. In ReAct, the model reasons, calls a tool, waits for the observation, then re-reads the entire growing transcript to decide the next step — paying for every prior token on every turn, and serializing tool calls that could have run at the same time. Two papers attack this from different angles. ReWOO plans the whole chain of tool calls up front, before any observation comes back, so the model reasons once instead of once per step. LLMCompiler goes further: it plans a DAG of tasks and runs the independent ones in parallel. This lesson is the code-first companion to two runnable modules — agents_lab/rewoo.py and agents_lab/llm_compiler.py — and a sibling to agent architectures and plan-and-execute.

Mental Model

Reason once, not once per observation — and parallelize the calls that don't depend on each other. ReAct interleaves thought and observation: each tool result is fed back into the context and the model re-reasons over the whole history. That is the source of both the token blowup (quadratic context growth) and the latency (strictly sequential round-trips). ReWOO decouples reasoning from observations: the planner writes a complete plan with placeholder variables for results it cannot yet see, a worker fills the placeholders by executing tools, and a solver reads the filled-in evidence once. LLMCompiler keeps the up-front plan but treats it as a dependency graph, dispatching every independent branch concurrently. Both are extensions of plain tool use — the difference is when you decide what to call, and how many you call at a time.

ReWOO: Planner → Worker → Solver

ReWOO (Reasoning WithOut Observation) splits the agent into three roles. The Planner emits the entire plan as a list of steps, each binding a result variable:

Plan: multiply 6 by 7.
#E1 = calculator[6 * 7]
Plan: add 1 to that.
#E2 = calculator[#E1 + 1]

Critically, the planner has never seen a tool output. It commits to the full sequence blind, using #E1, #E2 … as forward references to results that don't exist yet. This is the decoupling: reasoning happens once, in one prompt, instead of being re-triggered after every observation.

The Worker executes the plan top to bottom, substituting earlier results into later inputs:

python

# from agents_lab/rewoo.py — the worker loop
evidence: dict[str, str] = {}
for var, tool_name, arg in plan:
    resolved = _VAR_RE.sub(lambda m: evidence.get(m.group(0)[1:], m.group(0)), arg)
    tool = TOOLS_BY_NAME.get(tool_name)
    evidence[var] = tool.invoke(resolved) if tool else f"error: unknown tool {tool_name}"

When the worker reaches #E2 = calculator[#E1 + 1], it looks up #E1 in the evidence dict (now "42") and calls calculator["42 + 1"]. No LLM call happens during execution — the worker is pure substitution and dispatch.

The Solver sees the original task plus the gathered evidence block and writes the final answer:

python

evidence_block = "\n".join(f"#{k} = {v}" for k, v in evidence.items())
answer = model.invoke([
    SystemMessage(SOLVER_SYSTEM),
    HumanMessage(f"Task: {task}\nEvidence:\n{evidence_block}"),
]).content

Why this is cheaper

In ReAct, the prompt for step n contains the system prompt, the task, and all n−1 prior thought/action/observation triples. Token cost grows quadratically in the number of steps, and every step is a fresh LLM round-trip. ReWOO makes exactly two LLM calls regardless of plan length: one to plan, one to solve. The worker's tool dispatch costs no tokens at all. The paper reports roughly 5× token efficiency and a ~4% accuracy gain on HotpotQA versus ReAct, plus better robustness when a tool fails (a bad observation can't derail subsequent reasoning, because the reasoning already happened). The trade-off: the planner is committing without feedback, so it cannot adapt the plan to a surprising result. ReWOO suits tasks whose structure is knowable in advance.

LLMCompiler: Planner → Executor (parallel waves) → Joiner

ReWOO plans a linear chain — each #E step runs after the previous one even when it doesn't have to. LLMCompiler keeps the up-front planning but recognizes that a plan is really a DAG: tasks only need to wait for tasks they actually reference. Borrowing from compiler design, it has a Function Calling Planner, a Task Fetching Unit, and a parallel Executor.

The Planner emits numbered tasks; arguments reference earlier results with $k:

1. calculator[2 * 3]
2. calculator[10 - 4]
3. calculator[$1 + $2]

Tasks 1 and 2 share no dependency, so they can run simultaneously. Task 3 depends on both. The parser turns each line into a Task with an explicit dependency set:

python

# from agents_lab/llm_compiler.py
tid, tool, arg = int(m.group(1)), m.group(2).lower(), m.group(3).strip()
deps = {int(d) for d in _DEP_RE.findall(arg)}   # $1, $2 -> {1, 2}
tasks[tid] = Task(tid, tool, arg, deps)

The Executor (the task-fetching unit) repeatedly selects every task whose dependencies are already satisfied and runs that whole group in one parallel wave:

python

with ThreadPoolExecutor(max_workers=max_workers) as pool:
    while pending:
        ready = [tid for tid in pending if tasks[tid].deps <= results.keys()]
        if not ready:        # cycle or dangling ref — bail out defensively
            break
        ready.sort()
        wave_results = list(pool.map(lambda tid: _run_task(tasks[tid], results), ready))
        for tid, res in zip(ready, wave_results, strict=True):
            results[tid] = res
        waves.append(ready)
        pending -= set(ready)

For the plan above, the executor produces waves == [[1, 2], [3]]: tasks 1 and 2 fire together in wave 1, task 3 runs in wave 2 once its inputs exist. The wall-clock cost is the number of waves (the DAG's depth), not the number of tasks. Three sequential calls collapse to two waves.

The Joiner then composes the final answer from all results — and in the full paper can also decide to replan if the results are insufficient, making LLMCompiler usable for dynamic, multi-step problems rather than only static plans.

python

evidence = "\n".join(f"${tid} = {results[tid]}" for tid in sorted(results))
answer = model.invoke([
    SystemMessage(JOINER_SYSTEM),
    HumanMessage(f"Task: {task}\nResults:\n{evidence}"),
]).content

The paper reports up to 3.7× latency speedup, 6.7× cost savings, and ~9% accuracy improvement over ReAct, by automatically identifying which calls are independent.

ReWOO vs LLMCompiler

Both decouple planning from execution and both make a fixed number of reasoning LLM calls (plan + compose). The difference is the shape of the plan and how it executes:

	ReWOO	LLMCompiler
Plan shape	linear chain (`#E1` → `#E2` → …)	DAG (`$k` references)
Execution	sequential worker substitution	parallel waves (dependency-ready groups)
Wins on	token cost, tool-failure robustness	latency on independent calls
Adaptivity	none (commits blind)	joiner can replan
Best when	steps are inherently sequential	many sub-calls are independent

Put simply: ReWOO is linear-decoupled, LLMCompiler is parallel-DAG. If your task is "look up three unrelated facts then combine them," LLMCompiler wins — those three lookups belong in one wave. If your task is a strict chain where each step feeds the next, the DAG has depth equal to its length and you get no parallelism, so ReWOO's simpler two-call structure is the cleaner choice. Neither beats ReAct on tasks that genuinely need to see an observation before deciding the next move — that adaptivity is what you trade away for efficiency.

Run it

Both modules expose a single function and a CLI. DeepSeek is the only paid API used (configure your key per the lab README); the calculator and other tools are local.

python

from agents_lab.rewoo import run_rewoo

r = run_rewoo("compute (6*7)+1")
print(r.answer)     # -> "43"
print(r.plan)       # parsed steps: [("E1", "calculator", "6 * 7"),
                    #                 ("E2", "calculator", "#E1 + 1")]
print(r.evidence)   # {"E1": "42", "E2": "43"}

The plan lines the planner emits look like:

#E1 = calculator[6 * 7]
#E2 = calculator[#E1 + 1]

Note #E2 references #E1 — the worker substitutes 42 before calling the tool. Two LLM calls total (plan, solve); the arithmetic happens in tools, costing no tokens.

Now LLMCompiler on a task with independent sub-calls:

python

from agents_lab.llm_compiler import run_llm_compiler

r = run_llm_compiler("compute (2*3) + (10-4)")
print(r.answer)     # -> "12"
print(r.results)    # {1: "6", 2: "6", 3: "12"}
print(r.waves)      # [[1, 2], [3]]  <- tasks 1 and 2 ran in parallel

The planner emits a DAG like:

1. calculator[2 * 3]
2. calculator[10 - 4]
3. calculator[$1 + $2]

r.waves == [[1, 2], [3]] is the payoff made visible: the two independent multiplications/subtractions ran together in wave 1, and the dependent sum ran in wave 2. Two waves for three tasks — the latency is the DAG depth, not the task count.

From the command line:

bash

uv run python -m agents_lab.cli rewoo "compute (6*7)+1"
uv run python -m agents_lab.cli llm-compiler "compute (2*3) + (10-4)"

Run the LLMCompiler example with a few more independent lookups and watch the first wave grow wider while the wave count stays flat — that widening is exactly the parallelism the DAG buys you, and the reason its wall-clock latency stays low as the fan-out increases.

Sources

Continue Learning

← PreviousSelf-Refine: Iterative Refinement with Self-Feedback Next →Multi-Agent Debate: Convergence Through Critique

On this page