Lesson 18 of 19 in Phase 4 · Agents & Orchestration

Agent Persistence & Human-in-the-Loop: Making Agents Shippable

🤖 Phase 4 · Agents & OrchestrationIntermediate~9 min read

Recommended prerequisite:#109 Generative Agents: The Memory Stream

← PreviousGenerative Agents: The Memory Stream Next →Benchmarking Agents: Suites, Trajectory Metrics, and a Regression Gate

This is a code-first companion to four production-infra modules in the agents-lab/ Python package: persistence.py, human_in_the_loop.py, streaming.py, and tracing.py. The agent architectures elsewhere in this roadmap — the ReAct lab, the supervisor, the planner — answer "can the agent solve the task?" These four answer a different question: "can you run that agent in production without it losing work, doing something irreversible behind your back, or going dark for thirty seconds?" The reasoning loop is the engine; these are the chassis, seatbelts, and dashboard. Everything below maps onto runnable code, built on LangGraph and langgraph-checkpoint-sqlite — both open source. The only paid API anywhere in the lab is DeepSeek for the LLM calls, and none of these four modules even need it: the demo graphs are deterministic so persistence, approval, streaming, and tracing are observable and testable without a key.

Mental Model

Autonomous is not the same as unsupervised: an agent becomes shippable when its state is durable, its consequential actions wait for approval, and its progress is visible. A bare reasoning loop holds everything in memory and runs to completion or death. That is fine in a notebook and unacceptable in production, where processes get redeployed mid-task, humans step away, networks stall, and on-call engineers need to know what the thing actually did at 3am. The fix is three properties bolted onto the same graph. Durability means state survives a crash or restart — you resume instead of restarting. Approval means the agent can stop before an irreversible action and hand the decision to a person. Visibility means every step is observable as it happens and every span is recorded after. LangGraph supplies all three through one shared primitive — the checkpointer — and they compose: human-in-the-loop is literally just persistence plus a pause.

1. Persistence: durable state through a checkpointer

A checkpointer is the foundation everything else stands on. Compile a graph with one and pass a thread_id, and LangGraph saves a StateSnapshot after every super-step and reloads it on the next invoke. The thread_id is the persistent cursor: reuse it and the run continues from where it left off; use a fresh one and you get an empty state. The lab's demo graph is a deterministic accumulator — each invoke ticks a counter and appends a marker — so you can watch state survive across calls.

python

from agents_lab.persistence import (
    build_accumulator, memory_checkpointer, sqlite_checkpointer, step,
)

# Ephemeral: lives only as long as the process.
graph = build_accumulator(memory_checkpointer())

step(graph, thread_id="alice")   # {'count': 1, 'items': ['tick 1']}
step(graph, thread_id="alice")   # {'count': 2, 'items': ['tick 1', 'tick 2']}
step(graph, thread_id="bob")     # {'count': 1, 'items': ['tick 1']}  — separate thread

State accumulates per thread because the reducers on the state schema merge each step's update into the saved snapshot, not because the demo holds anything in a Python variable. alice and bob are fully isolated. The only thing that changes between ephemeral and durable is which checkpointer you hand to build_accumulator:

python

# Durable: SQLite on disk, survives process restarts.
ckpt = sqlite_checkpointer("agents_lab.sqlite")
graph = build_accumulator(ckpt)

step(graph, thread_id="alice")   # {'count': 1, ...}
# ... process exits, machine redeploys, you come back tomorrow ...

# New process, new graph object, SAME file + SAME thread_id:
graph = build_accumulator(sqlite_checkpointer("agents_lab.sqlite"))
step(graph, thread_id="alice")   # {'count': 2, ...}  — resumed from disk

InMemorySaver (from langgraph.checkpoint.memory) is for tests and demos; SqliteSaver (from the open-source langgraph-checkpoint-sqlite package) writes to a file and survives restarts. The interface is identical — the lab's sqlite_checkpointer just opens a connection with check_same_thread=False (so it works across LangGraph's worker threads) and calls saver.setup() to create the tables. In production you would swap in the Postgres saver the same way. The point is that durability is a one-line decision, and it wraps any agent graph in the lab, not just the accumulator.

2. Human-in-the-loop: pause for approval before acting

Some actions are irreversible — spending money, sending an email, running a destructive command. For those, the agent should stop and ask. LangGraph's interrupt() does exactly that: called inside a node, it suspends the graph, persists the paused state through the checkpointer (which is why HITL requires one), and surfaces a request to the caller. When you resume with Command(resume=decision), the node re-runs and interrupt() returns the human's decision as its value.

python

from agents_lab.human_in_the_loop import build_approval_graph, start, resume

def transfer_funds(request: str) -> str:
    return f"EXECUTED: {request}"

graph = build_approval_graph(sqlite_checkpointer(":memory:"), action=transfer_funds)

# start() runs until the interrupt, then returns the pending request — the action
# has NOT run yet; the paused state is persisted under the thread_id.
pending = start(graph, request="wire $5,000 to vendor", thread_id="t1")
print(pending)   # interrupt payload: {'approve_action': 'wire $5,000 to vendor'}

At this point the graph is frozen on disk. The caller inspects the request — show it in a UI, post it to Slack, page a human — and only then resumes. Approval runs the action; rejection skips it and records "rejected":

python

final = resume(graph, decision="approve", thread_id="t1")
print(final["result"])   # 'EXECUTED: wire $5,000 to vendor'

# A different thread that gets rejected:
start(graph, request="delete prod database", thread_id="t2")
final = resume(graph, decision="reject", thread_id="t2")
print(final["result"])   # 'rejected'  — action never called

The critical production rule lives in the thread_id: you must resume with the same one you started with, because that is how the checkpointer knows which frozen state to restore. The lab's _APPROVE set treats True, "approve", "yes", "y", and "ok" as approval; anything else is a reject — so the resume value can come straight from a button click or a chat reply. Because the pause is just a persisted checkpoint, the human can answer in five seconds or five hours; the graph waits indefinitely and costs nothing while idle. This is the approval-gate building block, and you wrap any agent's tool/act node with it. See LangGraph human-in-the-loop for the full interrupt API and patterns like editing state or routing to different branches on resume.

3. Streaming: surface progress as it happens

An agent can churn for thirty seconds across many LLM and tool calls. A frozen spinner tells the user nothing and tells your logs less. LangGraph graphs expose .stream(); the lab wraps it into a plain generator of (node_name, update) events that a CLI, an SSE endpoint, or a test can consume one step at a time.

python

from agents_lab.streaming import iter_updates, build_demo_pipeline

graph = build_demo_pipeline()   # 2-node text pipeline, no LLM

for node, update in iter_updates(graph, {"text": "hello"}):
    print(node, "->", update)
# upper   -> {'text': 'HELLO', 'log': ['upper']}
# exclaim -> {'text': 'HELLO!', 'log': ['exclaim']}

iter_updates runs the graph with stream_mode="updates", so each yielded chunk is the delta produced by one node — exactly what you want to render as a progress line or push down a websocket. Pass config={"configurable": {"thread_id": ...}} as the third argument and streaming composes with persistence: you get live progress and a durable record. The same loop wraps any agent in the lab — point it at the ReAct lab graph and you'll see each reason/act/observe step arrive as it completes instead of after the whole run. For richer modes — token-by-token messages streaming, custom events, or streaming through interrupts — see LangGraph streaming & observability.

4. Tracing: structured spans, no SaaS

Streaming is for live progress; tracing is the recorded flight log you read after the fact, and it is the heart of agent observability. Production tracing usually means signing up for a SaaS. The lab's JsonlTracer is a stdlib-only LangChain callback handler: it records a timed span for every LLM and tool call — name, inputs, output, wall-clock latency, token usage — and appends each as a JSON line. Attach it to any .invoke() or .stream() through the standard callbacks config, so it layers on top of every other agent in the lab without touching the agent's code.

python

from agents_lab.tracing import JsonlTracer

tracer = JsonlTracer(path="trace.jsonl")

result = agent.invoke(
    {"input": "..."},
    config={"callbacks": [tracer]},
)

# Each completed span was appended to trace.jsonl as it finished:
#   {"type": "tool", "name": "search", "output": "...", "duration_ms": 12.4}
#   {"type": "llm",  "name": "deepseek-chat", "token_usage": {...}, "duration_ms": 803.1}

print(tracer.summary())
# {'llm': {'count': 3, 'total_ms': 2410.7}, 'tool': {'count': 5, 'total_ms': 61.2}}

The handler keys spans by run_id: on_*_start stamps a perf_counter() and records the name; on_*_end computes duration_ms, attaches the output or token usage, and either buffers the span or writes it to JSONL if you passed a path. summary() aggregates counts and total latency by span type, which is enough to answer the two questions you actually have in an incident — where did the time go, and how many tokens did this cost. Because the spans are plain JSON, you pipe the file into any viewer; because the handler is just LangChain's BaseCallbackHandler, you extend the on_* methods to emit OpenTelemetry when you outgrow JSONL. The default stays dependency-free and fully inspectable — no external service, no vendor lock-in.

Putting it together

These four modules are deliberately orthogonal and deliberately composable. Persistence is the substrate; human-in-the-loop is persistence with a pause; streaming is the same run observed live; tracing is the same run recorded for later. You attach them to a graph by configuration, not by rewriting the agent — a checkpointer at compile time, a thread_id and callbacks at invoke time — which is exactly why they wrap any architecture in the lab interchangeably. The reasoning loop decides what is possible; this layer is what makes it shippable. For how these patterns fit into larger systems, see agent orchestration; for the broader LangGraph runtime they sit on, LangGraph.

Sources: LangGraph Interrupts docs, Mastering Persistence in LangGraph.

Continue Learning

← PreviousGenerative Agents: The Memory Stream Next →Benchmarking Agents: Suites, Trajectory Metrics, and a Regression Gate

On this page