Back to Autonomous agents

Autonomous Agents — Audio Guide

🎧 10 min listen · 8 chapters · autonomous agents, narrated — how far to let the model steer, and the checkpoints, pauses, and guardrails that make that safe. Every concept is paired with how the agentic-sales platform implements it in production.

01. The Autonomy Spectrum

There is a spectrum from workflows to autonomous agents. At one end, workflows have predetermined code paths designed to operate in a certain order. At the other end, agents are dynamic and define their own processes and tool usage. Workflows give predictability. Agents give flexibility for open ended problems. Many real systems sit somewhere in between. Consider a sales platform. Most of its fleet uses fixed single shot classification and enrichment workflows. These are reliable and efficient. Model directed loops are reserved for the few places they earn their cost. Two examples are open ended search and judging hard low confidence cases. In open ended search, the platform uses a cost aware tool hierarchy. The glob tool returns only paths, capped at three hundred results. The grep tool returns file and line matches, capped at two hundred. The read tool returns full files with line numbers, capped at three hundred lines. This balances cost and depth. For hard low confidence cases, an agent decides the next best action. This flexibility is worth the extra complexity. The platform blends both approaches. It uses workflows where predictability is key. It uses agents where open ended problems require dynamic decisions. This spectrum shows how to match method to need. Workflows provide a solid foundation. Agents handle the unexpected. Together, they cover the full range of tasks.

02. An Agent Loop Deployed

A production agent loop works like this. The search agent runs a fixed number of turns. Each turn, the model picks one tool to call or decides it is done. Every tool call is recorded in its own trace. On the final turn, the agent forces a fallback answer from whatever evidence it gathered. That way, the loop can never run forever, and it never returns nothing.

Why is the turn budget a parameter? Because different tasks need different depths. Searching a large codebase might need more turns than a simple lookup. The loop tracks its cost in tokens and dollars. So you can see exactly what each run used. Unlike a single call to the model, this multi-turn loop costs more per request. But it ensures the agent does not stop without an answer. The budget keeps the loop bounded. It is a trade-off between spending more on deeper searches and guaranteeing a result. The loop also records the total number of steps and whether it hit the limit. That makes production monitoring straightforward. You know exactly when an agent exhausted its turns or found a clean answer.

03. Durable Execution

Imagine an agent that can run for hours, handling complex tasks. It might crash or need to wait for a response. How does it survive? The answer is a checkpointer. This system saves the graph’s state after every step. It uses a thread identifier to track each conversation. If the process dies mid-run, it can resume exactly where it left off. The platform relies on a serverless database. Checkpoints are written over the network for durability. But this power comes with a trade-off. Durable execution requires every step to be replayable. The same input must always produce the same output. Otherwise, resuming from a checkpoint could give inconsistent results. To keep storage from growing forever, checkpoint history is compacted. Only the most recent milestones are kept. Older intermediates are pruned. This saves space without losing the ability to resume. So the agent can survive crashes, restarts, and long waits. It uses a reliable system that balances storage against resilience. Every step is designed to be rerun safely. That is the price of a long-running memory.

04. Pausing For A Human

The outreach pipeline includes a human approval gate. This gate collects every drafted message. It raises an interrupt before anything is sent. But if auto confirm was requested, it skips the interrupt. The system pauses the process and waits for a human reviewer. That reviewer sees only the actions that need a decision. The resume command carries the human choices. Only approved drafts proceed. The compose graph itself never sends anything. That responsibility lies elsewhere in the system. The pause can last indefinitely. It costs nothing because the state is checkpointed. The system saves the entire state. Then it can resume later without losing any work. This design keeps the workflow safe. Every draft is reviewed before it goes out. The reviewer only sees items that truly need approval. This avoids wasted effort. The checkpoints prevent infinite waiting. You can resume any time after the decision. The human gate is a key safety layer. It balances automation with control. The result is a reliable, auditable process.

05. Long Term Memory

The platform remembers what it learns from email conversations. It stores short distilled facts instead of raw text. These facts are organized under namespaces built from the contact, the company, and the recipient. When recalling information, it reads the most specific namespace available. A known contact beats a company-level guess, and a company guess beats a fallback by recipient name. That keeps memories from bleeding across different people. At most two searches happen per run, never a global query.

Why only distilled facts? Raw inbound email could hide prompt injections. So a language model extracts neutral, third-person facts in a separate step. Those facts are tagged with their source. Facts from our own outbound emails are first-party and safe to inject. Facts from inbound replies are marked inbound unverified and excluded from auto-sent drafts. Only distilled text ever reaches memory.

Every memory operation is fail open. A missing store, a slow call, or an authentication error simply degrades to no memory. The draft stays byte identical to baseline. The system never breaks. This is the core safety pattern: let the run complete successfully even when memory fails. The written count stays zero, and nothing is retried. So recalling past context is a best-effort add-on, never a requirement.

06. Reflection Writes Memory

A separate graph turns finished email threads into long-term memory. It does this after the email is sent and the recipient's reply is classified, not during drafting. This way, the drafting process stays fast and free from extra work. The graph first distills the conversation using a call to a large language model. That model produces up to four short factual statements about the contact or company. These facts are neutral third-person summaries, not raw text from the email. Then a persistence step writes them to a store called mem zero, scoped to the contact or company. The design is deliberately safe. Raw inbound text never reaches the store. Only the distilled facts are saved. Facts from an inbound reply are tagged with a special label that prevents them from being used in auto-sent messages. If the extraction step fails for any reason, the graph writes nothing at all. This fail open approach means a faulty model call or a network issue cannot pollute the store with garbage. The trade off is that memory is slightly delayed, but the drafting path stays simple and resistant to injection attacks. Each fact also gets a short hash key to help avoid duplicates. The whole process runs in the background, fired and forgotten, so the main user request never waits on it.

07. Many Agents One System

Multi-agent architectures solve hard problems. A cheap classifier runs first for simple decisions. When it has low confidence, the task goes to a debate panel. The panel uses several large language model reasoners. Each one returns a verdict and a confidence score. They can also see each other's reasons in a second round. Then a judge combines the majority vote. The judge writes a short rationale. The minority view is kept as dissent. This takes more model calls. But it gives confidence exactly where confidence is scarce. That is the trade off. Another pattern uses a router agent. The router sends each task to the matching specialist. It is stateless, so every request needs a new routing call. Handoffs work differently. Control passes directly between peer agents. The first agent stays active across turns. This saves about half the calls on repeat requests. Subagents start fresh each time. That gives strong isolation but repeats the full flow. Each pattern has its own cost and strength. The debate panel is for hard cases. It preserves dissenting opinions. This keeps the final verdict grounded in all views.

08. Keeping Autonomy Honest

Several controls limit autonomous behaviour in production. The search agent has a turn budget that stops it from running forever. This prevents runaway costs, but it means the agent may return without a complete answer if the budget runs out.

Inbound email and scraped pages are wrapped in clear delimiters with a note that the text is data, not instructions. The large language model treats it as information to read, not commands to follow. This is the primary defence at the model level. There is also a fallback that strips hidden control characters, so subtle injection tricks do not work.

Sending an email or running a structured query language write query requires a human to approve the action first. The agent pauses and waits for a decision. That gate keeps irreversible actions from happening without review, even if a malicious prompt slips through.

The memory store uses a fail-open design. If the infrastructure wobbles, the run degrades gracefully instead of crashing. A write that fails simply returns false, and the system continues.

The core idea is that autonomy is earned by adding guardrails, not by removing them. Each layer buys a little more trust, so the system can act independently without becoming reckless.