← all lessons/πŸ€– Phase 4 Β· Agents & Orchestration/#108
Lesson 16 of 19 in Phase 4 Β· Agents & Orchestration

Multi-Agent Debate: Convergence Through Critique

πŸ€– Phase 4 Β· Agents & OrchestrationIntermediate~8 min read
Recommended prerequisite:#107 ReWOO and LLMCompiler: Efficient Tool Planning
← PreviousReWOO and LLMCompiler: Efficient Tool PlanningNext β†’Generative Agents: The Memory Stream

Ask one language model a hard question and you get one chain of reasoning, with all its blind spots baked in. Ask three models the same question independently, then let each one read the others' answers and revise, and something different happens: the agents that were wrong for idiosyncratic reasons tend to get pulled toward the answer that the majority can independently justify. This is multi-agent debate β€” propose, critique, revise, repeat, then take the majority. It is the runnable companion to multi-agent systems, grounded in Du et al., "Improving Factuality and Reasoning in Language Models through Multiagent Debate" (ICML 2024, arXiv:2305.14325), which showed the technique beats zero-shot chain-of-thought and self-reflection across six reasoning and factuality benchmarks using one identical prompt loop for every task. This lesson is the code-first lab for agents-lab/agents_lab/debate.py.

Mental Model

Debate is peers converging through mutual critique, not a boss handing out work. Every agent answers the same question, sees the same dissent, and revises on equal footing β€” there is no router, no specialist roles, no shared scratchpad of who-did-what. The only structure is the round: a synchronized barrier where all answers from round r become the context for every agent's round r+1.

Contrast this sharply with the supervisor lab. A supervisor is a router: one LLM decides which specialist acts next and when to stop, and the workers never talk to each other. Debate is the opposite topology β€” homogeneous peers, all-to-all visibility, no central decision-maker. The supervisor buys you specialization (each worker owns one narrow capability); debate buys you error correction through redundancy (N independent attempts at the same problem, cross-checked). Reach for a supervisor when subtasks differ; reach for debate when one hard question needs a more reliable single answer.

It is also a generalization of self-consistency / few-shot CoT. Self-consistency samples N independent chains and takes the majority vote β€” but those chains never see each other. Debate adds the missing edge: between sampling and voting, each agent reads the others and gets a chance to change its mind. Self-consistency is debate with zero revision rounds.

The loop: round 0, then R rounds of revision

The whole algorithm is two phases. Phase one is round 0: every agent answers the raw question alone, with no knowledge of the others. These are the independent samples β€” the diversity that makes the rest work.

python
SOLO_SYSTEM = "Answer the task concisely. State only your answer."

current = [
    str(model.invoke([SystemMessage(SOLO_SYSTEM),
                      HumanMessage(f"Task: {task}")]).content)
    for _ in range(n_agents)
]
rounds = [current]  # rounds[0] is the independent round

Phase two is R rounds of revision. In each round, every agent is shown its own previous answer plus the other agents' latest answers, and asked to critique-and-update. The prompt deliberately frames this as adversarial-but-honest: defend if you are right, update if they are more convincing.

python
DEBATE_SYSTEM = (
    "Here are other agents' answers to the task. Use them to critique and "
    "improve your own answer. If they are more convincing, update; otherwise "
    "defend. State only your (possibly revised) answer."
)

for _ in range(max(0, n_rounds - 1)):     # n_rounds total, round 0 already done
    revised = [
        str(model.invoke([
            SystemMessage(DEBATE_SYSTEM),
            HumanMessage(
                f"Task: {task}\nYour answer: {current[i]}\n"
                f"Other answers:\n{_others(current, i)}"
            ),
        ]).content)
        for i in range(n_agents)
    ]
    current = revised
    rounds.append(current)

_others(current, i) simply concatenates every agent's answer except agent i's β€” that is the "exposure to dissent" channel. Note the synchronization: all agents in a round read the same snapshot (current) and write into revised. No agent sees another agent's already-revised answer mid-round. That barrier is what keeps the rounds clean and the process reproducible.

After the last round, the final answer is the majority of the final round β€” not of all rounds, not of round 0. By the last round the agents have (ideally) converged, so the mode of current is the consensus:

python
from collections import Counter
answer = Counter(current).most_common(1)[0][0]

Why exposure to dissent reduces hallucination

A single model's hallucinations are correlated with its own reasoning path: once it commits to a wrong premise, every subsequent token reinforces it. Debate breaks that self-reinforcement. When agent A confidently states a wrong fact, agents B and C β€” which sampled different paths β€” usually did not make the same mistake, and their answers appear in A's next prompt as concrete counter-evidence. A model is far more willing to abandon a claim when shown a specific competing answer than when merely asked "are you sure?"

The asymmetry is the key insight from Du et al.: correct answers tend to be defensible under scrutiny, while hallucinations tend not to be. When the majority can independently arrive at and re-justify the same answer, a lone dissenter usually folds. When agents genuinely disagree because the question is hard, the rounds surface that disagreement instead of papering over it β€” which is itself useful signal. This is why debate improves factuality specifically, not just reasoning: factual errors are exactly the kind of mistake that one agent makes and the others do not, so they get voted out.

Cost: it scales with agents Γ— rounds

Debate is not free, and the cost is blunt: you make roughly n_agents Γ— n_rounds model calls for a single question. A 3-agent, 2-round debate is 6 calls; bumping to 5 agents and 3 rounds is 15 calls. That is 5–15Γ— the cost of a single answer, and the revision-round prompts are longer than round 0 because each one carries every other agent's answer in its context.

So treat the two knobs as a budget you spend deliberately:

  • More agents widens the independent-sample base β€” better majority signal, more diverse dissent. Diminishing returns set in fast; 3 is a sensible default, beyond ~5 you rarely earn the cost.
  • More rounds gives convergence more time. The Du et al. results show most of the gain by round 2–3; additional rounds mostly burn tokens once the agents have already agreed.

The honest framing: debate trades compute for reliability. Use it on the questions where a wrong answer is expensive β€” factual claims, multi-step math, anything you would otherwise verify by hand β€” not on every call. Measure the lift with agent evaluation before you pay for it in production; if a 3Γ—2 debate does not beat a single call on your task, the extra five calls are waste.

Diversity needs temperature > 0

Debate has a silent failure mode: if all agents share one model at temperature 0, round 0 produces N identical answers. There is no dissent to expose, every revision round is a no-op, and you have paid for N calls to reproduce one deterministic answer. The entire mechanism depends on the round-0 samples actually differing.

In production, the agents in debate.py share one DeepSeek model, so you get diversity by sampling with temperature > 0 (0.7 is a reasonable starting point) β€” that is what makes the N independent answers genuinely independent. In the lab and in tests the model is injected (run_debate(..., llm=fake_model)) so runs stay deterministic and free; the temperature concern only bites when you wire in the real API. DeepSeek is the only paid API in this lab β€” each of the n_agents Γ— n_rounds calls hits it.

Run it

The runnable module lives at agents-lab/agents_lab/debate.py. The public entry point is run_debate:

python
from agents_lab.debate import run_debate

res = run_debate("Is 1729 special?", n_agents=3, n_rounds=2)
print(res.answer)   # last-round majority answer
print(res.rounds)   # answers per round, per agent

run_debate returns a DebateResult with two fields:

  • res.answer β€” the final answer, computed as the majority (mode) of the last round's agent answers.
  • res.rounds β€” a list[list[str]]: rounds[r][i] is agent i's answer in round r. rounds[0] holds the independent round-0 answers; the last entry holds the answers the majority vote was taken over. Inspect this to watch convergence happen β€” compare res.rounds[0] against res.rounds[-1] to see which agents changed their minds and whether dissent collapsed toward agreement.

For 1729 (the Hardy–Ramanujan number β€” the smallest number expressible as a sum of two cubes in two distinct ways), you can watch an agent that opened with a vague answer get pulled toward the precise one once it sees a peer state the property explicitly.

From the CLI

bash
uv run python -m agents_lab.cli debate "Is 1729 special?"

The CLI runs the default 3-agent, 2-round debate against the live DeepSeek model and prints the answer along with the per-round trace, so you can see the same rounds structure from your shell.

What to take away

  • Debate = independent round 0 β†’ R rounds of peer-conditioned revision β†’ majority of the last round. Each agent revises after seeing the others' answers.
  • It reduces hallucination because factual errors are usually uncorrelated across independent agents, so dissent votes them out; correct answers survive scrutiny.
  • It is peers converging, not a supervisor routing β€” homogeneous agents with all-to-all visibility and no central decider. It generalizes self-consistency / few-shot CoT by adding revision between sampling and voting.
  • Cost is n_agents Γ— n_rounds calls β€” a real budget tradeoff. Spend it on high-stakes questions and confirm the lift with agent evaluation.
  • In production you need temperature > 0 for the round-0 diversity the whole method depends on.

This is the runnable companion to multi-agent systems β€” read that for the theory, run debate.py to feel the convergence in your hands.

← PreviousReWOO and LLMCompiler: Efficient Tool PlanningNext β†’Generative Agents: The Memory Stream