← all lessons/💬 Phase 2 · Prompting & Structured Output/#10
Lesson 1 of 9 in Phase 2 · Prompting & Structured Output

Prompt Engineering Fundamentals: Principles & Patterns

💬 Phase 2 · Prompting & Structured OutputIntermediate~22 min read

Prompt engineering has emerged as the primary interface between human intent and large language model capabilities, yet it remains poorly understood as a discipline. This article examines the foundational principles that govern effective prompting, the recurring patterns that practitioners rely on, and the tension between treating prompts as natural language instructions versus structured programs. We draw on empirical findings from recent research to move beyond folklore and toward a principled understanding of how prompts shape model behavior. For the broader discipline of managing everything that goes into a model's context window -- retrieved documents, conversation history, tool outputs, and more -- see Context Engineering.

The Prompt as a Programming Paradigm

The shift from traditional software engineering to prompt-based programming represents a fundamental change in how we specify computation. In conventional programming, developers write explicit instructions in formal languages with well-defined semantics. In prompt engineering, developers write natural language instructions that are interpreted by a probabilistic model whose exact behavior is not fully predictable.

Reynolds and McDonell (2021) characterized this as "prompt programming," arguing that prompts function as programs in a soft, probabilistic language. The key insight is that a prompt does not merely ask a question -- it configures the model's conditional distribution over outputs. Every word in a prompt shifts the probability mass over possible continuations, making prompt construction an exercise in probabilistic steering.

This framing has practical consequences. When a developer writes "You are a senior Python developer. Review the following code for bugs:", they are not simply asking a question. They are:

  1. Conditioning the output distribution on a specific expertise profile
  2. Narrowing the task space to code review rather than generation or explanation
  3. Implicitly specifying output format (bug reports rather than refactored code)
  4. Setting a quality bar ("senior" implies thoroughness and depth)

Understanding prompts as programs helps explain why small changes in wording can produce dramatically different outputs -- a phenomenon that has frustrated many practitioners but becomes more predictable with principled approaches.

Core Principles of Effective Prompting

Principle 1: Specificity Over Ambiguity

The most robust finding across prompt engineering research is that specific instructions outperform vague ones. Mishra et al. (2022) demonstrated in their "Reframing Instructional Prompts" work that decomposing ambiguous instructions into explicit sub-tasks consistently improved model performance across dozens of NLP benchmarks.

Consider the difference:

# Vague
Summarize this article.

# Specific
Summarize this article in 3 bullet points. Each bullet should capture
one key finding. Use plain language accessible to a general audience.
Keep the total summary under 100 words.

The specific version constrains the output along multiple dimensions: format (bullets), content (key findings), style (plain language), and length (100 words). Each constraint reduces the space of acceptable outputs, making the model more likely to produce what the user actually wants.

However, specificity has diminishing returns. Over-specifying can lead to rigid outputs that miss the spirit of the task while satisfying the letter of every constraint. The art lies in specifying the dimensions that matter while leaving room for the model to exercise judgment on dimensions that don't.

Principle 2: Structure Reflects Intent

Well-structured prompts are not just easier for humans to read -- they are processed more effectively by language models. Research from Anthropic's prompt engineering guidelines and OpenAI's best practices documentation converges on the finding that structured prompts with clear delineation between sections produce more reliable outputs.

Effective structural patterns include:

# Clear section delineation
## Context
You are a data analyst working with sales data for a retail company.

## Task
Analyze the following quarterly sales data and identify trends.

## Data
[sales data here]

## Output Requirements
- Format: Markdown table followed by 2-3 paragraph analysis
- Include: year-over-year comparisons, seasonal patterns
- Tone: Professional, suitable for executive summary

The use of headers, labeled sections, and clear boundaries between context, task, data, and output requirements follows what might be called the "separation of concerns" principle borrowed from software engineering. Each section serves a distinct purpose, and the model can attend to each appropriately.

Principle 3: Examples Are Worth a Thousand Words

Providing examples -- few-shot prompting -- remains one of the most effective techniques for steering model behavior. Brown et al. (2020) demonstrated in the GPT-3 paper that few-shot prompting could match or exceed fine-tuned models on many tasks. The mechanism is straightforward: examples demonstrate the input-output mapping more precisely than instructions alone. For a thorough treatment of few-shot design, example selection, and chain-of-thought reasoning, see Article 8: Few-Shot & Chain-of-Thought Prompting.

# Instruction-only (less reliable)
Extract the company name and revenue from each sentence.

# With examples (more reliable)
Extract the company name and revenue from each sentence.

Input: "Acme Corp reported $4.2B in revenue last quarter."
Output: {"company": "Acme Corp", "revenue": "$4.2B"}

Input: "Revenue for TechStart Inc. reached 890 million dollars."
Output: {"company": "TechStart Inc.", "revenue": "$890M"}

Input: "GlobalTrade's annual revenue exceeded twelve billion."
Output:

Examples serve multiple functions simultaneously: they demonstrate the output format, show how to handle edge cases (different ways of expressing monetary values), and implicitly define the extraction scope (company name and revenue only, not time period).

Principle 4: Order Matters

The order in which information appears in a prompt significantly affects model behavior. Liu et al. (2023) showed in "Lost in the Middle" that language models exhibit a U-shaped attention pattern -- they attend most strongly to information at the beginning and end of the context, with reduced attention to middle sections.

This has direct implications for prompt design:

  • Place the most critical instructions at the beginning and end of the prompt
  • Put reference data in the middle where exact reproduction is less critical
  • Repeat key constraints if the prompt is long
  • Place the task description immediately before the expected output to maximize its influence on generation

Principle 5: Negative Instructions Are Fragile

Telling a model what not to do is less effective than telling it what to do. This mirrors findings in cognitive psychology about negative instructions ("don't think of a pink elephant") and has been empirically confirmed in prompt engineering practice.

# Fragile (negative instruction)
Summarize this article. Do not include opinions. Do not use jargon.
Do not exceed 200 words. Do not use bullet points.

# Robust (positive instruction)
Summarize this article using only factual claims from the text.
Use plain language accessible to a general audience. Write in
paragraph form. Keep the summary under 200 words.

The positive version achieves the same constraints but frames them as what to do rather than what to avoid. Models tend to follow positive instructions more reliably because the desired behavior is explicitly specified rather than implicitly defined as "everything except what's prohibited."

Common Prompting Patterns

The Role Pattern

Assigning a role or persona to the model is one of the oldest and most widely used prompting patterns. The mechanism works because role descriptions activate relevant knowledge and behavioral patterns from the training data.

You are a senior security engineer with 15 years of experience
in application security. You specialize in identifying OWASP
Top 10 vulnerabilities in web applications.

The role pattern is effective but often misunderstood. The role does not give the model capabilities it lacks -- it shifts the distribution of responses toward those that would be expected from such a persona. A "senior security engineer" role makes the model more likely to identify subtle vulnerabilities, use appropriate terminology, and structure findings in a professional format. But it does not give the model knowledge of vulnerabilities absent from its training data.

Research from Shanahan et al. (2023) in "Role-Play with Large Language Models" explores the theoretical underpinnings of this pattern, arguing that models engage in a form of simulated role-play rather than actually adopting an identity. This distinction matters when pushing the boundaries of the pattern.

The Task Decomposition Pattern

Complex tasks benefit from explicit decomposition into subtasks. This pattern draws on the same principle underlying chain-of-thought prompting but applies it at the prompt design level rather than the reasoning level.

Analyze this codebase for potential performance issues.

Step 1: Identify all database queries and classify them as
        reads or writes.
Step 2: For each query, assess whether it could benefit from
        indexing, caching, or batching.
Step 3: Look for N+1 query patterns in data fetching logic.
Step 4: Check for unnecessary data loading (selecting columns
        that aren't used).
Step 5: Compile findings into a prioritized list, ranked by
        estimated performance impact.

Explicit decomposition serves three purposes: it ensures comprehensive coverage (the model won't skip important aspects), it structures the output in a predictable way, and it reduces the cognitive load on the model by breaking a complex judgment into simpler sub-judgments.

The Format Specification Pattern

Specifying the exact output format is critical for downstream processing and user experience. This pattern has become increasingly important as LLM outputs are consumed by other software components rather than read directly by humans.

Respond with a JSON object matching this schema:
{
  "sentiment": "positive" | "negative" | "neutral",
  "confidence": float between 0 and 1,
  "key_phrases": string[] (max 5 phrases),
  "summary": string (max 50 words)
}

The format specification pattern works best when combined with examples showing the expected output for representative inputs. Schema definitions alone can be ambiguous (what counts as a "key phrase"?), but examples resolve that ambiguity.

The Context Window Pattern

As context windows have grown from 4K to 128K tokens and beyond, managing what goes into the context has become a design challenge in itself. The context window pattern involves deliberately curating and structuring the information provided to the model.

python
def build_prompt(user_query, relevant_docs, conversation_history):
    """Structure the context window deliberately."""
    prompt = f"""## System Instructions
    You are a technical support agent for CloudDB.

    ## Relevant Documentation
    {format_docs(relevant_docs)}

    ## Conversation History
    {format_history(conversation_history[-5:])}  # Last 5 turns only

    ## Current Query
    {user_query}

    ## Response Guidelines
    - Cite documentation sections when answering
    - If unsure, say so rather than guessing
    - Suggest next steps when applicable
    """
    return prompt

This pattern treats the context window as a structured data container rather than a simple text dump. Each section has a purpose, and the overall structure helps the model understand the relationships between different pieces of information.

Clarity vs. Verbosity Tradeoffs

A persistent debate in prompt engineering concerns the tradeoff between clarity (which often requires more words) and conciseness (which reduces token usage and can sometimes improve focus). The evidence suggests this is a false dichotomy when prompts are well-structured, but a real concern when additional words add noise rather than signal.

When verbosity helps:

  • Specifying edge cases that the model would otherwise handle inconsistently
  • Providing examples that demonstrate desired behavior
  • Adding context that disambiguates the task
  • Including explicit reasoning steps for complex judgments

When verbosity hurts:

  • Redundant rephrasing of the same instruction
  • Excessive caveats and hedging in instructions
  • Including irrelevant context that competes for attention
  • Over-specifying dimensions that the model handles well by default

A useful heuristic from Anthropic's research: every sentence in a prompt should either provide information the model needs or constrain the output in a way that matters. Sentences that do neither should be removed.

Model-Specific Considerations

Different model families respond differently to the same prompts, and effective prompt engineering accounts for these differences. While the core principles apply broadly, the optimal implementation varies.

Instruction-Following Strength

Models trained with extensive RLHF or constitutional AI methods (Claude, GPT-4) tend to follow explicit instructions more reliably than base models or lightly fine-tuned models. For these models, clear instructions are often sufficient without extensive few-shot examples. For weaker instruction followers, more examples and more explicit structure may be needed.

Context Window Utilization

Models with larger context windows don't automatically use all provided context equally well. Research has shown that performance degrades with context length even within the supported window size. Practitioners should provide relevant context rather than all available context, regardless of window size.

Tokenization Effects

Different tokenizers split text differently, which can affect how models interpret prompts. For example, code-specific models may tokenize programming constructs differently than general-purpose models, affecting how code-related prompts are processed. While this is rarely a primary concern, it can explain unexpected behavior in edge cases. See Article 3: Tokenization for a deep dive into BPE, SentencePiece, and how vocabulary design shapes what models can and cannot represent.

python
# Tokenization can affect prompt interpretation
# GPT-4 tokenization of " indentation" vs "indentation"
# These may produce different token sequences, subtly
# affecting how the model processes the instruction

Multimodal Prompting

The expansion of language models into vision-language models (VLMs) introduces a new dimension to prompt engineering: prompting with images alongside text. Models like GPT-4V, Claude's vision capabilities, and Gemini Pro Vision accept interleaved image-text inputs, but the principles for effective multimodal prompting are still being established through practice rather than theory.

The core challenge is that images are dense, ambiguous inputs. A photograph of a whiteboard contains text, diagrams, spatial relationships, and handwriting quality signals simultaneously. Without textual guidance, the model must guess which aspect matters. Effective multimodal prompts constrain the model's visual attention in the same way that task-specific text prompts constrain textual generation.

# Weak multimodal prompt
What do you see in this image?

# Strong multimodal prompt
This image shows a UML class diagram for a payment processing system.
1. List all classes and their attributes.
2. Identify inheritance and composition relationships.
3. Flag any design pattern violations (e.g., God classes, circular dependencies).
Output your findings as a markdown table with columns: Class, Issue, Severity.

Several practices have emerged as reliable across VLMs:

Describe what the image contains before asking about it. Providing a brief textual description of the image type ("This is a chest X-ray," "This is a screenshot of a React component") anchors the model's interpretation and reduces hallucinated content. This mirrors the role pattern for text -- you are conditioning the model's visual processing on domain-relevant priors.

Use spatial references explicitly. When asking about specific regions, use directional language ("in the upper-left quadrant," "the third row of the table") rather than assuming the model will attend to the right area. VLMs process images as patch sequences, and spatial grounding in the prompt helps align textual attention with visual attention.

Annotate images when possible. Adding bounding boxes, arrows, or numbered labels to images before sending them to the model dramatically improves precision. A prompt that says "Explain the error highlighted in the red box" is far more reliable than "Find the error in this code screenshot." For a deeper treatment of VLM architectures and how visual encoders interact with language decoders, see Article 49: Vision-Language Models.

Manage image resolution and token cost. VLMs tokenize images into visual tokens -- often hundreds or thousands per image. Higher-resolution images consume more tokens and increase latency. When fine detail is not required (e.g., classifying a chart type versus reading axis labels), downsizing images before submission saves cost without sacrificing accuracy. Tokenization choices for visual inputs follow analogous tradeoffs to text tokenization (see Article 3: Tokenization for the foundational concepts).

Model-Specific Formatting Conventions

While the core principles of prompt engineering transfer across models, the formatting conventions that maximize compliance differ significantly between model families. These differences arise from training data composition, fine-tuning procedures, and the specific instruction-following objectives each provider optimizes for.

XML tags for Claude. Anthropic's Claude models respond particularly well to XML-style delimiters for structuring prompts. Wrapping sections in tags like <context>, <instructions>, and <examples> produces more reliable section-boundary recognition than markdown headers or plain-text labels. This likely reflects deliberate training choices -- Claude's instruction-following data emphasizes XML-tagged structure.

xml
<context>
You are reviewing a pull request for a Python web application.
The codebase uses FastAPI with SQLAlchemy ORM.
</context>

<instructions>
Review the code diff below for:
1. Security vulnerabilities (SQL injection, XSS, auth bypasses)
2. Performance issues (N+1 queries, missing indexes)
3. Code style violations against PEP 8
</instructions>

<diff>
{{code_diff}}
</diff>

<output_format>
Return findings as a JSON array of objects with keys:
file, line, severity, category, description
</output_format>

Markdown for GPT models. OpenAI's models are optimized for markdown-structured prompts. Headers (##), bullet lists, and code fences align well with GPT-4's training distribution. Triple-backtick code blocks are parsed with particular reliability for structured output specifications.

Chat template differences. Beyond surface formatting, models differ in how they handle the system/user/assistant message structure. Some models treat system messages as immutable high-priority instructions; others weight them only slightly above user messages. Understanding these differences is critical for production deployments where the system prompt must reliably override user inputs -- a topic explored in depth in Article 9: System Prompt Design.

The practical implication: prompts are not portable across model families without adaptation. A prompt that achieves 95% compliance on Claude may drop to 80% on GPT-4 (or vice versa) simply because of formatting conventions. Teams working across multiple models should maintain model-specific prompt variants and test each variant against model-specific evaluation suites.

Prompting for Extended Reasoning

The introduction of reasoning-focused models -- OpenAI's o1 and o3 series, DeepSeek-R1, and Claude's extended thinking mode -- has created a fork in prompt engineering strategy. These models allocate internal "thinking tokens" before producing a response, fundamentally changing the role of the prompt from reasoning scaffold to task specification.

With standard models (GPT-4o, Claude Sonnet, Gemini Pro), prompting for complex reasoning requires explicit scaffolding: "Think step by step," chain-of-thought examples, or structured decomposition into sub-problems. These techniques work because the model's reasoning happens in the output tokens, and the prompt must initiate and structure that reasoning process. This family of techniques is covered extensively in Article 8: Few-Shot & Chain-of-Thought Prompting.

With reasoning models, the calculus inverts. The model already reasons internally before responding. Adding "think step by step" to a prompt for o3 is at best redundant and at worst counterproductive -- the model may produce a shallow, prompt-satisfying reasoning trace on top of its deeper internal reasoning, wasting tokens without improving accuracy.

The emerging best practices for reasoning models:

Specify the task, not the reasoning process. Instead of decomposing a problem into steps, describe the desired outcome and let the model determine how to reason toward it. Reasoning models have learned through reinforcement learning when to reason deeply and when a problem is straightforward -- a calibration that manual CoT prompting cannot replicate.

# For standard models (scaffolded reasoning)
Analyze whether this merge sort implementation is correct.
Think through each step:
1. Check the base case
2. Verify the divide step
3. Verify the merge step
4. Consider edge cases (empty array, single element, duplicates)

# For reasoning models (task specification)
Determine whether this merge sort implementation is correct.
If there are bugs, identify them with line numbers and explain
the fix. If it is correct, state why briefly.

Control reasoning depth through problem framing, not explicit instructions. Reasoning models allocate more thinking tokens to harder problems. You can influence reasoning depth by how you frame the task -- asking for a proof versus asking for a quick check will naturally elicit different levels of internal reasoning.

Use reasoning models for verification, not just generation. One of the most effective patterns is using a reasoning model to verify or critique output from a faster standard model. The reasoning model's internal deliberation is well-suited to catching subtle errors that the generating model missed. This verification pattern is becoming a production staple for high-stakes applications.

Prompt Economics

Every token in a prompt has a cost, and at production scale those costs compound into a significant engineering concern. Prompt economics -- the practice of designing prompts with cost, latency, and caching efficiency in mind -- has become an essential skill for teams deploying LLM applications beyond prototyping.

Token cost asymmetry. Most API pricing charges less for input tokens than output tokens (often 2-4x less), but input tokens still dominate total cost when prompts include large system instructions, few-shot examples, or retrieved context. A system prompt of 2,000 tokens served 10 million times per month costs meaningfully more than a 200-token system prompt -- even before considering the output side.

Prompt caching changes the calculus. Anthropic's prompt caching, OpenAI's cached completions, and similar features allow providers to cache the prefix of a prompt across requests. When a long system prompt is reused identically across many requests, only the first request pays the full processing cost; subsequent requests benefit from cached KV states. This creates a direct incentive to design prompts with a long, stable prefix (system instructions, examples, schemas) followed by a short, variable suffix (the user's actual input).

python
# Cache-optimized prompt structure
# ---- Cached prefix (stable across requests) ----
SYSTEM_PROMPT = """
You are a medical coding assistant. You help assign ICD-10 codes
to clinical notes.

## Coding Guidelines
[... 1500 tokens of detailed guidelines ...]

## Examples
[... 800 tokens of few-shot examples ...]

## Output Schema
{... JSON schema specification ...}
"""

# ---- Variable suffix (changes per request) ----
def build_prompt(clinical_note):
    return f"{SYSTEM_PROMPT}\n\n## Clinical Note\n{clinical_note}"

This structure maximizes cache hit rates. The guidelines, examples, and schema -- which are the same for every request -- form the cached prefix. Only the clinical note varies. For systematic approaches to reducing prompt cost through automated optimization, see Article 11: Prompt Optimization.

Designing for token efficiency. Several practical techniques reduce token count without sacrificing prompt quality:

  • Abbreviate where the model can infer. Instead of verbose natural language instructions, use concise structured formats. Models trained on code and documentation understand terse specifications.
  • Use references instead of repetition. Rather than restating a complex constraint in three different ways, state it once clearly and reference it elsewhere ("Apply the same format as Example 1 above").
  • Prune few-shot examples. More examples are not always better. Two well-chosen examples that cover the key edge cases often outperform five examples that are redundant. Test with fewer examples before assuming more are needed.
  • Compress retrieved context. When feeding RAG-retrieved documents into a prompt, summarize or extract relevant passages rather than dumping full documents. The "Lost in the Middle" problem (discussed in Principle 4 above) means that excess context actively hurts performance, not just cost.

Latency as cost. Token count affects not just monetary cost but latency. Longer prompts take longer to process, and longer outputs take longer to stream. For interactive applications, prompt economics is also a UX concern -- a 500ms response feels qualitatively different from a 3-second response, and prompt length is one of the levers engineers have to control that difference.

The Instruction Hierarchy

Modern LLM deployments typically involve multiple layers of instructions: system prompts set by developers, user messages from end users, and sometimes tool or function descriptions that further shape behavior. Understanding how these layers interact is essential for production prompt engineering.

The general hierarchy, from highest to lowest priority:

  1. System prompt -- Developer-set instructions that define the application's behavior
  2. Tool/function definitions -- Schemas and descriptions that shape tool use
  3. User message -- The end user's actual request
  4. In-context examples -- Demonstrations of desired behavior
  5. Implicit conventions -- Patterns learned during training

This hierarchy is not absolute -- models can be confused by conflicting instructions across levels -- but it represents the intended priority in most API implementations. Effective prompt engineering respects this hierarchy and avoids placing instructions at inappropriate levels.

Building a Prompt Development Workflow

Treating prompts as software artifacts requires adopting development practices borrowed from software engineering:

Version control: Store prompts in version control alongside application code. Track changes over time and maintain the ability to roll back.

Testing: Define evaluation criteria and test prompts against representative inputs before deployment. This includes both automated metrics and human evaluation.

python
# Pseudocode for a prompt testing framework
class PromptTest:
    def __init__(self, prompt_template, test_cases):
        self.prompt = prompt_template
        self.tests = test_cases

    def run(self):
        results = []
        for test in self.tests:
            output = call_llm(self.prompt.format(**test.inputs))
            passed = test.evaluate(output)
            results.append({"input": test.inputs, "output": output,
                            "passed": passed})
        return results

# Define test cases with expected properties
tests = [
    TestCase(
        inputs={"text": "Great product, love it!"},
        evaluate=lambda out: json.loads(out)["sentiment"] == "positive"
    ),
    TestCase(
        inputs={"text": "Terrible experience, never again."},
        evaluate=lambda out: json.loads(out)["sentiment"] == "negative"
    ),
]

Iteration: Start with a simple prompt, evaluate it, identify failure modes, and iterate. Each iteration should address specific failure cases without introducing new ones -- a process that mirrors debugging in traditional software development.

Documentation: Document the intent behind each prompt, the failure modes it addresses, and the model version it was tested against. Prompts that work on one model version may not work on the next.

Summary and Key Takeaways

  • Prompts function as programs in a probabilistic language; understanding this framing helps explain why small wording changes can produce large output differences.
  • The five core principles -- specificity, structure, examples, ordering, and positive framing -- provide a reliable foundation for prompt design across models and tasks.
  • Common patterns (role, task decomposition, format specification, context window management) encode proven solutions to recurring prompt design challenges.
  • The clarity-verbosity tradeoff resolves in favor of clarity: every sentence should provide necessary information or meaningful constraint.
  • Multimodal prompting extends the same principles to vision-language inputs -- describe the image, constrain visual attention, annotate when possible, and manage visual token cost.
  • Formatting conventions are not cosmetic: XML tags, markdown structure, and chat template choices measurably affect model compliance and should be adapted per model family.
  • Reasoning models require a different prompting strategy: specify the task and desired outcome, not the reasoning process, and avoid redundant chain-of-thought scaffolding.
  • Prompt economics -- token cost, caching, latency -- are first-class design concerns at production scale; cache-friendly prompt architectures with long stable prefixes pay for themselves quickly.
  • Model-specific considerations matter but are secondary to the universal principles; prompts should be tested against the specific model they will be used with.
  • Treating prompts as software artifacts -- with version control, testing, iteration, and documentation -- is essential for production applications.
  • The instruction hierarchy (system > tools > user > examples > implicit) governs how conflicting instructions are resolved, and effective prompt engineering works with this hierarchy rather than against it.

The field of prompt engineering continues to evolve rapidly as models become more capable and as our understanding of how prompts influence model behavior deepens. The principles outlined here represent the current state of knowledge, but practitioners should expect these to be refined as empirical research progresses.

← PreviousPre-training: Data Curation, Objectives & CurriculumFrom: 🧠 Phase 1 · Models & Inference on Workers AINext →Few-Shot & Chain-of-Thought Prompting