Most ways to make a model "learn" cost a training run: collect a reward signal, compute gradients, update weights. Reflexion (Shinn et al., "Reflexion: Language Agents with Verbal Reinforcement Learning", NeurIPS 2023, arXiv:2303.11366) does something cheaper and stranger โ it lets the agent learn within an inference session by writing notes to itself. After a failed attempt, the agent composes a short natural-language reflection on what went wrong, files it in memory, and retries with that reflection in its context. The weights never move; the policy that changes is the prompt. This lesson is the runnable companion to the Reflexion section of agent architectures, and it lives in this repo's agents-lab/ package so you can run the loop yourself.
Mental Model
Reflexion is gradient descent done in English: the loss is "did the evaluator say I failed?", and the update is a sentence you append to your own context. A normal agent that fails just fails. A Reflexion agent treats each failure as a labeled training example โ but instead of backpropagating, it verbalizes the lesson and carries it into the next trial. The "reinforcement" is the accumulating stack of reflections, and the learning rate is how sharply the model conditions on them.
Two properties fall out of this. First, you need a real success/failure signal โ without a trustworthy evaluator, the reflection reinforces noise. Second, you need episodic memory: reflections must persist across trials, or every attempt starts cold. Reflexion is therefore as much a agent memory pattern as a reasoning one.
The Three Roles
Reflexion factors the agent into three distinct roles. In the paper these are three separate models; in a minimal implementation they can be three different prompts to the same model.
- Actor. Generates the attempt โ text, code, or a sequence of actions โ conditioned on the task and any reflections accumulated so far. This is the policy being improved.
- Evaluator. Produces a reward signal for the attempt. Crucially, this signal can come from outside the model entirely: a unit-test runner, a compiler, a string match against a known answer, or an environment that returns done/not-done. When no external signal exists, an LLM judge stands in.
- Self-Reflection. Takes the failed attempt plus the evaluator's feedback and writes a concise, actionable reflection โ not "I was wrong" but "I assumed the input was sorted; next time validate ordering first." This verbal lesson is what gets stored.
The loop is: Actor attempts โ Evaluator scores โ on failure, Self-Reflection writes a lesson โ the lesson is appended to episodic memory โ Actor retries with memory in context. Repeat until success or a trial budget is exhausted.
Verbal Reinforcement Learning
The phrase "verbal reinforcement learning" is precise, not metaphor. Classical RL parameterizes a policy with weights and nudges them via a scalar reward. Reflexion parameterizes the policy as the agent's memory plus a fixed LLM, and the "reward" is converted into language โ the self-reflection โ which is then fed back as context. There are no gradients, no optimizer, no checkpoint to save. The entire learning signal is text the agent wrote and re-read.
This buys three things fine-tuning can't:
- Zero training infrastructure. It works on a frozen, API-only model. (The lab uses DeepSeek for exactly this reason โ see Run it.)
- Interpretability. The "learned" knowledge is a readable list of reflections. You can inspect it, edit it, delete a bad lesson, or seed the memory with hand-written ones.
- Sample efficiency on a single task. A weight update needs many examples to move the loss meaningfully; a verbal lesson can fix a specific failure on the very next attempt.
The cost is that the learning is episodic and shallow โ it lives in the context window, so it disappears when the session ends unless you persist it, and it competes for context budget with everything else.
Why the Evaluator Is the Linchpin
Reflexion's improvements are bounded by the quality of the reward signal. If the evaluator says "pass" when the answer is wrong, the agent stops too early; if it says "fail" on a correct answer, the agent reflects on phantom mistakes and degrades. Garbage verdicts produce garbage reflections โ the loop amplifies whatever signal you give it.
This is why the strongest results come from tasks with a programmatic, ground-truth evaluator. On code generation, the evaluator is a test suite: it is cheap, deterministic, and correct by construction. The paper reports Reflexion reaching 91% pass@1 on HumanEval, above the GPT-4 baseline of ~80%, precisely because unit tests give an unambiguous fail signal to reflect on. On decision-making in ALFWorld, the environment itself returns success/failure per episode. On reasoning (HotPotQA), an exact-match check against the gold answer plays the evaluator role.
When no programmatic check exists, you fall back to an LLM-as-judge evaluator โ a separate model call that grades the attempt. This is the territory of agent evaluation: the judge has its own error rate and biases, so for anything high-stakes you want a deterministic check wherever you can manufacture one. The design lesson: make the evaluator as objective as the task allows, because every downstream reflection inherits its reliability.
Run it
This repo ships a minimal, runnable Reflexion loop in agents-lab/agents_lab/reflexion.py. It is built on LangGraph (actor โ evaluate โ reflect nodes with a conditional edge back to the actor) and uses DeepSeek under the hood โ the only paid API the lab depends on. If you have worked through the ReAct lab, the actor node will look familiar; Reflexion wraps a second loop around that kind of attempt.
The key design choice is an injectable evaluator. You pass a (task, answer) -> bool function โ a unit-test runner, a grader, or a plain equality check โ and the loop uses it as the reward signal. Omit it and the agent self-evaluates with an LLM judge.
from agents_lab.reflexion import run_reflexion
# Deterministic evaluator: the reward signal is a programmatic check, not a judge.
final = run_reflexion(
"solve it",
evaluator=lambda task, ans: ans == "correct answer",
max_trials=3,
)
print(final["answer"]) # the final attempt (best effort if never solved)
print(final["reflections"]) # the verbal lessons accumulated across trials
print(final["success"]) # True if the evaluator ever returned True
print(final["trials"]) # how many attempts were spent
Three things worth tracing in the source:
-
Reflections accumulate across trials. The state field
reflectionsis reduced with(a or []) + (b or []), so each failed trial appends a new lesson rather than overwriting. On trial N, the actor sees the lessons from trials 1..N-1 rendered as a bulleted "Lessons from past attempts" block. That accumulating stack is the episodic memory buffer. -
The evaluator is the only source of the fail signal.
evaluate_nodecalls your function (or the default LLM judge) and setssuccess. The router sends the loop toreflectonly whensuccessis false โ so a flaky evaluator directly produces noisy reflections, exactly as the agent evaluation discussion above warns. -
max_trialsbounds the loop. Without a budget a stubborn task would reflect-and-retry forever, burning tokens. The router returnsENDoncetrials >= max_trials, returning the best attempt so far. (The compiled graph also sets arecursion_limitof3 * max_trials + 5as a hard backstop.)
To watch it self-improve, supply an evaluator that the first attempt is likely to miss and read the reflections that get generated. A self-evaluating run (no evaluator argument) lets the model judge itself โ useful when there is no ground truth, but remember the judge can be wrong.
Run it from the CLI:
uv run python -m agents_lab.cli reflexion "write a python function that returns the nth fibonacci number"
# Programmatic evaluator example: reward = "the produced code passes these tests"
import subprocess, tempfile, textwrap
def tests_pass(task: str, answer: str) -> bool:
src = textwrap.dedent(answer) + "\nassert fib(10) == 55\nassert fib(0) == 0\n"
with tempfile.NamedTemporaryFile("w", suffix=".py", delete=False) as f:
f.write(src)
path = f.name
return subprocess.run(["python", path], capture_output=True).returncode == 0
final = run_reflexion(
"write a python function fib(n) returning the nth Fibonacci number",
evaluator=tests_pass,
max_trials=4,
)
Here the evaluator is a real unit-test run. A first attempt that bungles the base cases fails the assertion; the Self-Reflection node writes a lesson like "fib(0) must return 0, not 1 โ check the base cases"; the next attempt reads that lesson and fixes it. That is the entire mechanism: failure โ words โ a better retry, no weights touched.
Where Reflexion Fits
Reach for Reflexion when three conditions hold: the task is retryable (a second attempt is cheap and meaningful), a reliable verdict exists (tests, a verifier, or a trustworthy judge), and the failure modes are correctable in language (the model knows better but slipped, rather than fundamentally lacking the capability). It shines on code generation and bounded decision-making for exactly this reason.
It is the wrong tool when there is no success signal to reflect on, when a single attempt is expensive and you cannot afford retries, or when the knowledge needs to persist across thousands of tasks โ at that scale, distilling reflections into weights (fine-tuning) or a durable retrieval store (agent memory) beats re-deriving them in-context every session. For the broader map of how Reflexion sits alongside ReAct, Plan-and-Execute, and tree search, see agent architectures.