Lesson 10 of 10 in Phase 2 · Prompting & Structured Output

Spec-Driven Development: Specs as the Durable Contract for Coding Agents

💬 Phase 2 · Prompting & Structured OutputIntermediate~19 min read~14 min listen

← PreviousPublic Speaking: Structure, Delivery, and Audience EngagementFrom: 🏗 Appendix · Engineering & Communication Next →ReAct Agents Lab: Building the Reason-Act-Observe LoopUp next: 🤖 Phase 4 · Agents & Orchestration

Spec-Driven Development (SDD) is the professional response to the chaos of unsupervised AI code generation. Instead of prompting an agent ("create me a button") and iterating against whatever it produces, you decouple the specification — the what and the why — from the implementation — the how. The spec becomes a permanent technical artifact and a contract between humans, and between humans and the agent. This lesson distills the DeepLearning.AI / JetBrains course Spec-Driven Development with Coding Agents into the engineering principles that make it work, and connects them to the rest of Phase 2: a project constitution is a system prompt for an entire codebase, a feature spec is a structured-output contract for generated code, and the whole discipline is the intent-fidelity principle from prompt engineering fundamentals applied at project scale. The course teaches one opinionated workflow; the back half of this lesson widens the lens — the competing tool landscape, a maturity taxonomy, and a deliberately skeptical reading drawn from Birgitta Böckeler's Thoughtworks survey Understanding Spec-Driven Development — so you can judge when SDD earns its overhead rather than adopt it by reflex.

Mental Model

The mental model for SDD is a control system with a small, durable setpoint driving a large, disposable output. Vibe coding is open-loop: a high-level prompt produces code, you eyeball it, you correct the agent in a long conversational dialogue whose history is never saved, and technical debt mounts because nothing authoritative records what was intended. SDD closes the loop by promoting intent to a versioned file. A few sentences in a spec — "use SQLite with Prisma ORM" — amplify into hundreds of lines of implementation; change one sentence to "MongoDB" and the same amplification regenerates downstream. Because the setpoint is tiny relative to the output, the cognitive overhead of supervising an ultra-fast coding agent collapses from "read every diff" to "review the spec, then spot-check the amplification." This is why the spec, not the code, is the primary engineering artifact: it has the highest leverage-to-size ratio in the system, it survives across agent sessions, and it is the only place where human architectural judgment is recorded once and reused indefinitely. The agent is a stateless amplifier; the spec is the memory.

Why Specs Beat Vibe Coding

Vibe coding is fast for a button and catastrophic for a project. You write a prompt, hope for the best, point out what is wrong, and repeat. The result is disposable code and an unsaved dialogue — there is no artifact a teammate, or a future agent session, can read to learn what the system is supposed to do. SDD trades a small amount of upfront writing for three compounding benefits that the course states explicitly.

First, leverage: large code changes are controlled by small spec changes. One clause about look and feel can translate to hundreds of lines of CSS, so editing the spec is dramatically more efficient than editing code by hand.

Second, context durability: specs eliminate context decay between sessions. Agents are stateless — every session boots fresh — so the highest-quality context must be loaded at boot time. A spec is exactly that durable boot payload, which is why this is the same discipline as context engineering: you are deciding what the model must know the instant it starts.

Third, intent fidelity: when you define the problem, success criteria, and constraints precisely, the agent can elaborate a fuller plan that matches what you actually need rather than what a terse prompt implied. One effective authoring technique from the course is to converse with an agent (Claude Code, Gemini, Codex) to make the key architectural trade-offs using your own judgment, then have the agent summarize the agreed decisions into the spec.

The Constitution: A System Prompt for the Whole Project

Before any feature, SDD establishes a Constitution — project-level decisions formalized into three documents the course names directly: mission.md (the why: vision, audience, scope), tech-stack.md (the common understanding of development and deployment technologies and constraints), and roadmap.md (a living, phased sequence of features). A Constitution is agent-agnostic and more structured than a single top-level AGENTS.md, and it captures the agreement on key decisions both between the human and the agent and between humans on the team.

text

# mission.md
AgentClinic — a place for AI agents to get relief from their humans.
Audience: developers learning agentic workflows. Scope: web app, parody
of PetClinic. Non-negotiable: every feature ships behind a feature spec.

# tech-stack.md
Runtime: Node + TypeScript (strict). Web: Hono. DB: SQLite + plain SQL
migrations. Tests required for validation. Deploy: single container.

# roadmap.md
Phase 1: Hello Hono (placeholder home page)
Phase 2: Agents & Ailments (CRUD over SQLite)
Phase 3: MVP — implement the remaining roadmap

You do not write the Constitution alone — you write it in conversation with the agent, which surfaces architecture patterns you had not considered, existing packages that already do the work, and trade-offs such as speed versus data fidelity. Architecturally this is identical to the discipline taught in system prompts: the Constitution is the highest-priority, lowest-trust-required instruction set that constrains everything the agent generates afterward. It is the codebase's constitution in the same sense a system prompt is an application's constitution — durable, authoritative, and the first thing loaded.

In an enterprise setting the Constitution earns a second job: automatic policy enforcement. Treated as a guardrail it yields four compounding benefits — consistency enforcement (architectural drift across long projects evaporates because every feature references the same standards), compliance documentation (regulatory and security policies become explicit, auditable artifacts), institutional-knowledge capture (hard-won lessons from security teams and architects survive personnel changes), and reduced cognitive load (developers stop manually checking each generated plan against an internal checklist because the agent does it). The categories that scale to enterprise are concrete: technology standards (approved cloud platforms, frameworks, databases), security requirements (auth provider, encryption, secret storage, PII handling), performance targets (p95 latency, concurrency, async thresholds), coding standards (style, coverage floor, documentation), and compliance/governance (audit logging, accessibility, retention). A constitution clause like "All data must be encrypted at rest" means the agent never proposes plaintext storage in a generated plan; "Authenticate all API requests using Microsoft Entra ID tokens" means generated code arrives with the auth attribute already in place. The principle of specific and testable applies the same review pressure to constitutional prose that the eval discipline applies to model output — replace "system should be fast" with "API responses complete within 200 ms for 95% of requests" or you have written a wish, not a constraint.

text

# constitution.md (enterprise excerpt)
## Technology Standards
- All cloud resources hosted on Microsoft Azure.
- Back-end services use .NET 8 or later; secrets via Azure Key Vault.
- Database: Azure SQL Database or Cosmos DB (no on-premises SQL Server).
## Security Requirements
- Authenticate all API requests using Microsoft Entra ID tokens.
- Encrypt data at rest (AES-256) and in transit (TLS 1.2+).
- Never log personally identifiable information (PII).
## Performance and Scalability
- APIs respond within 200 ms for 95th percentile requests.
- Operations exceeding 5 s use asynchronous processing.
## Compliance and Governance
- Audit-log all data modifications; retain ≥ 90 days.
- Accessibility: WCAG 2.1 Level AA minimum.

The Per-Feature Loop: Specify → Plan → Implement → Validate

The four-step loop below shows how each feature moves from a blank branch to a merged, validated diff.

Once the Constitution exists, every feature runs through a repeatable loop. The key skill is choosing the right level of detail: treat the agent as a highly capable pair programmer — give it rich context about goals, mission, audience, and constraints, and less about low-level decisions it can figure out itself.

You start each feature with fresh agent context on a dedicated branch. The agent draws what it needs from the authoritative source — the Constitution — then helps produce a feature spec, a task plan, collected requirements, and a validation scorecard. The agent asks clarifying questions; you make the key decisions (pin the framework version, enforce strict TypeScript, choose plain SQL migrations) and watch for conflicts. After implementation you /clear context and review against the spec, focusing on whether the feature works and reflects the spec rather than on which CSS classes were used. A mistake in the code usually traces back to a mistake in the plan, so you correct both the spec and the implementation to keep requirements and validation in sync. This generate-then-verify rhythm is the human-in-the-loop.

Implementing feature-by-feature, with frequent commits, keeps each diff manageable. For areas where small mistakes compound — security, database migrations — run task groups one at a time instead of implementing the whole plan in a single step. The validation scorecard turns "does this look right?" into a checkable contract, which is the structured-output principle applied to feature acceptance.

Made concrete, the loop has four moves that any tool — or a hand-authored skill — can encode. Specify: draft the spec from the Constitution. Clarify: interactively close underspecified gaps ("Maximum concurrent uploads?" "What does the user see on validation failure?") with multiple-choice options that update the spec, before any code. Plan: produce the technical plan with explicit sections showing how each Constitution principle is satisfied, then decompose it into ordered task groups. Validate: run cross-artifact consistency checks ("plan proposes PostgreSQL but the Constitution requires SQLite"; "spec requires audit logging, plan doesn't describe it"; "tasks omit the migration scripts") and walk a spec-specific checklist — "unit tests for English prose" ("Does every requirement have acceptance criteria?" "Are nonfunctional requirements measurable?"). The clarify and validate moves are where the discipline actually pays: they turn the spec into a reviewable artifact a human can disagree with before any code is written, which is the only place feedback is still cheap. This very site encodes that loop without a vendor framework — a feature is a dated requirements.md / plan.md / validation.md triplet under specs/<family>/, and an Agent Skill runs the clarify step as a fixed three-question interview before writing the files (detailed under Build Your Own Workflow below).

Replanning, the MVP Gamble, and Legacy Projects

Between features you deliberately run slow to run fast. The Constitution is a living document: when you discover a missing testing preference, or the product manager reports that 40% of users are on mobile and you must emphasize responsive design, you make the change on a dedicated replanning branch so you can track which version of the Constitution produced which code. You then instruct the agent to update existing feature specs and implementations to reflect the constitutional change — spec and code evolve together, never independently.

The MVP gamble is a controlled stress test: a variation of the standard prompt that tells the agent to implement the rest of the roadmap at once, with guidance about existing specs. You only take this risk when you are confident in the quality of your Constitution and specs and can handle the review load. If the result diverges from intent, that is a signal to run a disciplined replanning phase to eliminate whatever led the agent astray — the divergence is diagnostic feedback on your context quality, not just a bug.

SDD is not only for greenfield work. To bring it to a legacy project, start a fresh agent session on main without a specs folder and run the Constitution step against existing artifacts — a README.md, a TODO.md, issue trackers, spreadsheets. The agent explores the codebase through tool calls and reverse-engineers the SDD artifacts: it extracts the file structure, framework versions, and roadmap items from what already exists. The Constitution then aligns future agent changes with what past developers already built.

Runtime Internals

Understanding why SDD works requires looking at how a coding agent actually runs. An agent is stateless across sessions: each invocation boots with an empty working memory and a finite context budget. There is no persistence of the previous dialogue unless something external recorded it — which is precisely the failure mode of vibe coding. SDD treats the spec set as the agent's externalized memory: at boot, the Constitution and the active feature spec are loaded as the highest-quality context, so the model's limited budget is spent on the next unit of work rather than on reconstructing forgotten intent. Between phases you issue /clear to flush the context window deliberately, ensuring the next feature is driven by the written spec (intent) and not by a stale memory snapshot of the last conversation. Versioning is the other half of the runtime: because specs and code live in git, a replanning branch records exactly which Constitution version produced which implementation, so a regression can be traced to a constitutional change rather than guessed at. The agent discovers artifacts at runtime through tool calls — reading files, exploring directories — which is why the tool-use layer and good repository hygiene matter: the spec is only authoritative if the agent can reliably find and read it. This boot-context-plus-versioned-artifacts model is what converts a fast but forgetful generator into a supervised, auditable engineering process.

Build Your Own Workflow, and Agent Replaceability

Once the loop is mastered, the repeated prompting ("write these three files, plan, implement, validate") is friction worth automating. The course automates it with an Agent Skill — an open-standard, reusable capability authored with the agent's own skill-creator. Skills can be per-project or global and are invoked through progressive disclosure: the agent reads a skill's description and decides when to call it. Because that judgment degrades as the context window grows, apply the same heuristic as file tagging — if you know you want a skill used, name it explicitly to save thinking tokens. This is the code-agents automation pattern: capturing a repeatable engineering process as a first-class, named artifact.

bash

# The SDD loop, per feature, as a scripted ritual a Skill encodes:
git checkout -b feat/agents-and-ailments
# 1. Specify: agent drafts spec.md, plan.md, validation.md from the Constitution
# 2. Plan:    review task groups; run risky groups (db, security) one at a time
# 3. Implement
# 4. Validate: human-in-the-loop review against validation.md
git add -A && git commit -m "feat: agents & ailments (spec-driven)"
git checkout main && git merge --no-ff feat/agents-and-ailments

This site is its own worked example. SDD here is homegrown — no vendor framework. Each body of work is a family under specs/ (the agentic-sales audio guide, the LlamaIndex content pipeline, the longform-storytelling generator), and every family carries the same three-document Constitution: mission.md, tech-stack.md, roadmap.md. A single meta-constitution at specs/_sdd/constitution.md records the conventions all families share, so a new family inherits the discipline instead of reinventing it. A feature is one dated directory — specs/<family>/YYYY-MM-DD-<slug>/ — holding the requirements.md / plan.md / validation.md triplet, copied from a domain-neutral template at specs/_sdd/_template/. The clarify step is a fixed three-axis interview — scope · decisions · context; the agentic-sales family encodes it as a specialized Agent Skill that adds audio-experience questions, and every other family follows the same ritual by hand against the meta-constitution. Crucially, validation is not prose review — it is make-driven, pass/fail gates named in each validation.md (make audio-gate, make eval-audio-source, build and lint), so "does the amplification match intent?" becomes a checkable contract rather than a judgment call. That is the whole loop, owned end to end, with no tool you have to trust on faith.

The final principle is agent replaceability. Models and agents improve monthly, so you do not want a workflow welded to one vendor. Open standards make agents swappable while the SDD workflow and tools stay put: MCP for external tools, AGENTS.md for rules, Agent Skills for repeatable workflows-plus-context, and ACP (the Agent Client Protocol) for connecting agents to editors. A feature-spec skill authored for Claude Code runs unchanged in Codex once copied to its path; the ACP registry automates discovering, installing, and connecting agents to clients across their lifecycle. SDD moves the work from the how to the what and why, so the how — which specific agent executes the spec — becomes an interchangeable implementation detail. The specs you write today become the memory of your projects tomorrow; keep them sharp, and the agent stays a replaceable driver of an engineering process you own. The same intent-first reasoning underlies prompt optimization: improve the durable instruction, not the disposable output.

The SDD Tool Landscape: Kiro, OpenSpec & Tessl

The course teaches one workflow, but "spec-driven development" is a label several tools wear differently — as Böckeler puts it, SDD "is not just one thing." Three are worth knowing because they bracket the design space. Kiro is the lightweight end: a VS Code-based tool with a fixed Requirements → Design → Tasks flow, one markdown document per step, requirements expressed as As a… user stories with GIVEN…WHEN…THEN… acceptance criteria, and a flexible memory bank it calls steering (product.md, structure.md, tech.md). OpenSpec is the assistant-agnostic middle: an open spec format and CLI that treats plain-markdown specs as the source of truth and pushes them into whichever coding agent you run, so the workflow stays put while the agent underneath is swappable — the same separation this lesson keeps drawing between the durable what and the disposable how. It pairs naturally with the emerging agents.md convention, a "README for agents" that standardizes where per-repo agent context lives; the useful boundary is that the constitution governs (durable principles, cross-session rules) while agents.md merely informs (the boot context an agent reads), and conflating the two is what bloats so many setups. Tessl Framework (private beta, CLI that doubles as an MCP server) is the only one explicitly chasing the deep end: code files carry a // GENERATED FROM SPEC - DO NOT EDIT header, tessl document --code reverse-engineers a spec, @generate/@test tags and an API section pin the exposed interface, and tessl build regenerates the file. The decisive axes that separate them are workflow opinionation, how many artifacts a single spec sprawls into, whether a "memory bank" is optional or mandatory, and the abstraction level the spec sits at. Pick the tool by problem shape, not by brand — the same selection discipline you apply to an agent harness.

A Maturity Taxonomy: Spec-First → Spec-Anchored → Spec-as-Source

Böckeler's most useful contribution is a maturity ladder that cuts through the marketing. Spec-first: a well-considered spec is written, used for the task, and then effectively discarded — every change starts a new spec. Spec-anchored: the spec survives the task and is maintained as the feature evolves, edited alongside the code over time. Spec-as-source: the spec is the primary artifact; humans edit only the spec and never touch the (generated) code. Every approach she examined is at least spec-first, but few genuinely reach spec-anchored, and the maintenance strategy over time is usually left vague — some tools branch per spec, which reads more like change-request scope than feature-lifetime anchoring. Her working definition is worth memorizing: a spec is a structured, behavior-oriented artifact — or a set of related artifacts — written in natural language that expresses software functionality and serves as guidance to AI coding agents. Crucially, separate the spec (task-scoped, only relevant to the change it drives) from the memory bank (cross-session rules and product/architecture context relevant to every session) — conflating them is why so many setups feel bloated; this is the same boundary context engineering draws between durable and per-task context. The honest warning attached to the top of the ladder is the Model-Driven Development parallel: spec-as-source is MDD with a natural-language model and an LLM code generator. MDD never took hold for business applications — awkward abstraction level, too much overhead — and LLMs remove the parseable-DSL overhead only by trading it for non-determinism, risking the downsides of both: inflexibility and unpredictability, minus the tool support that once validated specs for completeness.

When SDD Is Over-Engineering

A balanced engineer holds both truths: spec-first is genuinely valuable — "how do I structure my memory bank?" and "how do I write a good spec for AI?" are among the most-asked practitioner questions — and the elaborate end of SDD can be Verschlimmbesserung, making things worse while trying to make them better. Böckeler's field reports are the cautionary data. Problem-size mismatch: asked to fix a small bug, Kiro inflated it into four user stories and sixteen acceptance criteria — "a sledgehammer to crack a nut"; a 3–5 point feature run through a heavyweight SDD toolchain produced so many files to review she felt she'd have shipped it faster with plain AI-assisted coding and stayed more in control. Review burden: verbose, repetitive markdown that is more tedious to review than the code it describes — an effective tool needs a great spec-review experience, not just generation. False sense of control: bigger context windows do not mean the agent honors everything in them — it regenerated existing classes as duplicates by ignoring "this already exists" notes, and elsewhere over-applied a constitution rule too zealously. Functional/technical separation stays slippery, and our profession's track record at keeping requirements free of implementation is poor. The decision is therefore not "adopt SDD" but "match ceremony to the problem": small or well-understood changes favor tight iterative loops (the control argument from adversarial-prompting's small-blast-radius logic and from code agents); large and well-specified work is where heavyweight SDD pays for its overhead. And because agents interpret specs non-deterministically, the validation scorecard is not optional decoration — it is the eval harness that tells you the amplification matched intent. The term is already semantically diffused — people now say "spec" to mean "a detailed prompt" — so the engineering value is not the label but the judgment of when a durable, right-sized spec beats a fast, disposable one. Treat this lesson's enthusiastic methodology and Böckeler's skepticism as the two error bars on the same measurement.

Continue Learning

On this page