← all lessons/🤖 Phase 4 · Agents & Orchestration/#106
Lesson 14 of 19 in Phase 4 · Agents & Orchestration

Self-Refine: Iterative Refinement with Self-Feedback

🤖 Phase 4 · Agents & OrchestrationIntermediate~7 min read
Recommended prerequisite:#105 Tree of Thoughts: Deliberate Search Over Reasoning Steps
← PreviousTree of Thoughts: Deliberate Search Over Reasoning StepsNext →ReWOO and LLMCompiler: Efficient Tool Planning

Most ways to improve a model's answer add machinery: a second model to critique it, a test harness to grade it, a retrieval step to ground it. Self-Refine (Madaan et al., "Self-Refine: Iterative Refinement with Self-Feedback", NeurIPS 2023, arXiv:2303.17651) throws all of that out. One model does everything: it generates a first draft, feeds back on its own draft, and refines the draft using that feedback — looping until the feedback says "good enough" or a budget runs out. No fine-tuning, no reward model, no external evaluator. Just the same model, prompted three different ways, talking to itself. This lesson is the runnable companion to the refinement section of agent architectures, and it lives in this repo's agents-lab/ package so you can run the loop yourself.

Mental Model

Self-Refine is a writer who edits their own work: draft, read it critically, rewrite the same draft, repeat — never starting over, never asking anyone else. The insight is that an LLM is usually better at criticizing an answer than at producing a perfect one in a single shot. Generation and evaluation are different cognitive jobs, and splitting them across turns lets the model catch mistakes it couldn't avoid while it was busy generating. The "refinement" is not a fresh attempt; it's a surgical rewrite of the existing draft conditioned on the model's own critique.

Two properties fall out of this. First, the feedback has to be specific and actionable — "make it better" does nothing; "the second example is wrong because X, fix it" drives a real edit. Second, you need a stopping signal: the feedback step must be able to say "I cannot meaningfully improve this," or the loop refines forever (and often degrades). Both live entirely inside the prompts.

One Model, Three Jobs

The whole method is three prompts pointed at the same model. The lab makes them explicit as system messages:

python
GENERATE_SYSTEM = "Produce a first attempt at the task. Be concise."
FEEDBACK_SYSTEM = (
    "Critique the answer for correctness, completeness and clarity. If it cannot "
    "be meaningfully improved, reply exactly 'STOP'. Otherwise give specific, "
    "actionable feedback."
)
REFINE_SYSTEM = "Rewrite the answer to address the feedback. Output only the improved answer."

There is no separate generator, critic, and editor — those are roles, not models. The same weights play all three. What changes between turns is the system prompt and what gets carried in the user message: the feedback step sees the task and the current answer; the refine step sees the task, the current answer, and the feedback it just produced.

The Loop

The control flow is small enough to read end to end. Generate once, then alternate feedback and refine until a satisfaction marker appears or the budget is spent:

python
answer = model.invoke(
    [SystemMessage(GENERATE_SYSTEM), HumanMessage(f"Task: {task}")]
).content

history = []
for _ in range(max_iters):
    feedback = model.invoke(
        [SystemMessage(FEEDBACK_SYSTEM),
         HumanMessage(f"Task: {task}\nAnswer: {answer}")]
    ).content
    history.append((answer, feedback))
    if _satisfied(feedback):
        break
    answer = model.invoke(
        [SystemMessage(REFINE_SYSTEM),
         HumanMessage(f"Task: {task}\nAnswer: {answer}\nFeedback: {feedback}")]
    ).content

Notice the loop feeds back first, refines second. Each iteration critiques the latest answer before deciding whether to spend a refine call, so a draft that is already good costs only one cheap feedback call before the loop exits.

The stopping check is deliberately forgiving — the model rarely emits a bare STOP, so the marker test scans for any of a few satisfaction phrases:

python
_SATISFIED = ("stop", "no further", "no issues", "looks good", "no changes")

def _satisfied(feedback: str) -> bool:
    low = feedback.strip().lower()
    return any(m in low for m in _SATISFIED)

If none of those appear and max_iters is reached, the loop returns the latest refined draft regardless — the budget is the hard backstop.

The Contrast with Reflexion

Self-Refine looks superficially like the Reflexion lab, and the two are easy to confuse. They are different patterns solving different problems:

Self-RefineReflexion
EvaluatorThe model critiques itself — no external evaluatorAn external / programmatic evaluator (tests, judge) decides pass/fail
What changesThe same draft is edited in placeEach trial retries from scratch
MemoryNone — feedback is consumed in the next turn and discardedVerbal reflections persist across episodes in memory
Signal"Can this be improved?" (soft, self-judged)"Did I succeed?" (hard, externally graded)
Best whenQuality is subjective and self-assessable: prose, code style, explanationsSuccess is checkable: passing tests, correct math, completing a task

The line that matters: Self-Refine has no ground-truth signal and refines one artifact; Reflexion has a real success/failure signal and starts each attempt over, carrying lessons forward. If you have a reliable evaluator, reach for Reflexion. If you only have the model's own taste, Self-Refine is the honest tool — but it inherits the model's blind spots, since a mistake it can't recognize is a mistake it can't fix.

Run it

The lab exposes one function. Give it a task and a budget; it returns the final answer, how many refine passes it actually spent, and the full (answer, feedback) trace:

python
from agents_lab.self_refine import run_self_refine

res = run_self_refine("write a haiku about gradient descent", max_iters=3)
print(res.answer)      # the final, refined draft
print(res.iterations)  # number of refine passes actually taken (0..max_iters)
print(res.history)     # list of (answer, feedback) tuples, one per feedback turn

iterations counts refine steps, so it can be less than max_iters when the feedback stops the loop early — and len(res.history) is always one more than iterations when an early stop fired (the last feedback entry is the one that triggered the break). Walking history is the best way to see the loop work: each tuple shows what the model thought of a draft and what it changed in response.

There's also a CLI for quick experiments:

bash
uv run python -m agents_lab.cli self-refine "explain async/await to a beginner"

DeepSeek is the only paid API the lab calls, so a refine loop with max_iters=3 is three to four cheap completions — generate plus up to three feedback/refine pairs. Watch the cost characteristic: every iteration re-sends the growing draft, so longer tasks and higher budgets multiply tokens quickly.

When It Helps, When It Hurts

Self-Refine shines on tasks where the model can see its own mistakes after the fact: tightening prose, catching an off-by-one in code it just wrote, adding a missing edge case to an explanation. It hurts when the failure mode is invisible to the model — a factual error it's confident about will survive every refine pass untouched, because the feedback step shares the same blind spot as the generate step. It can also over-edit: without a hard stopping signal, a model will keep "improving" a perfectly good answer into something worse, which is exactly why the satisfaction marker and max_iters backstop both exist.

For deciding whether the loop actually helps on your task, treat it like any other agent change and measure it — see agent evaluation for setting up A/B comparisons between a single-shot baseline and the refined output. The honest test is whether iteration moves your metric, not whether the drafts feel more polished.

← PreviousTree of Thoughts: Deliberate Search Over Reasoning StepsNext →ReWOO and LLMCompiler: Efficient Tool Planning