Manually crafting prompts is labor-intensive, brittle across model versions, and difficult to systematically improve. A growing body of research treats prompt design as an optimization problem, applying search algorithms, gradient-free optimization, gradient-based text differentiation, and compiler-like abstractions to automatically discover prompts that maximize task performance. This article examines the leading frameworks and techniques -- DSPy, APE, OPRO, TextGrad, prompt compression, multi-objective optimization, and meta-prompting -- analyzing how each works, what problems it solves, and how these automated approaches fit into practical prompt development workflows.
Prerequisite reading. This article assumes familiarity with prompting fundamentals covered in Article 07 -- Prompt Engineering Fundamentals and the few-shot and chain-of-thought techniques discussed in Article 08 -- Few-Shot & Chain-of-Thought Prompting. Where prompt optimization intersects with retrieval-augmented generation, see Article 18 -- RAG Evaluation for evaluation methodology that applies equally to optimized retrieval prompts.
Manual prompt engineering suffers from several structural problems:
Non-transferability. A prompt optimized for GPT-4 may perform poorly on Claude or Llama. When models are updated (even minor version bumps), previously effective prompts can degrade. Manual re-optimization is expensive and slow.
Local optima. Human prompt engineers tend to iterate incrementally, making small changes to existing prompts. This hill-climbing approach often gets stuck in local optima -- prompt configurations that are "good enough" but far from the best achievable performance.
Evaluation gaps. Without systematic evaluation, prompt engineers rely on intuition and spot-checking. This misses failure modes that only appear on specific input distributions or edge cases.
Scale limitations. A human can reasonably iterate on a handful of prompts per day. Automated optimization can explore thousands of prompt variants in the same time, systematically mapping the performance landscape.
These problems motivate the development of automated prompt optimization techniques that treat prompts as learnable parameters in a system, analogous to how weights are learnable parameters in neural networks.
DSPy (Khattab et al., 2023, "DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines"; substantially updated in DSPy 2.x, 2024) represents the most ambitious rethinking of how developers interact with language models. Its central thesis: developers should specify what they want (signatures, modules) rather than how to prompt (specific prompt text). DSPy then "compiles" these specifications into optimized prompts or fine-tuning configurations.
The 2.x release streamlined the API surface, unified the optimizer interface, and introduced MIPROv2 as the recommended default optimizer for most workloads.
The key abstractions in DSPy are:
Signatures define input-output specifications for LM calls:
import dspy
# A signature is a declarative specification
class SentimentAnalysis(dspy.Signature):
"""Classify the sentiment of a product review."""
review: str = dspy.InputField(desc="Product review text")
sentiment: str = dspy.OutputField(desc="positive, negative, or neutral")
confidence: float = dspy.OutputField(desc="Confidence between 0 and 1")
Modules are composable building blocks that use signatures:
class ReviewAnalyzer(dspy.Module):
def __init__(self):
self.classify = dspy.ChainOfThought(SentimentAnalysis)
def forward(self, review):
return self.classify(review=review)
Optimizers (called "teleprompters" in DSPy 1.x, renamed to dspy.optimize in 2.x) automatically optimize the prompts used by modules:
# DSPy 2.x API
# Define a simple metric
def accuracy_metric(example, prediction, trace=None):
return prediction.sentiment == example.sentiment
# Compile the module with optimization
optimizer = dspy.BootstrapFewShot(metric=accuracy_metric, max_bootstraps=4)
optimized_analyzer = optimizer.compile(
ReviewAnalyzer(),
trainset=train_examples
)
DSPy's optimization process operates at multiple levels:
Few-shot example selection. The BootstrapFewShot optimizer selects demonstrations from the training set that maximize task performance. Unlike random selection, it evaluates each candidate example's contribution to downstream accuracy and selects the combination that performs best.
Instruction optimization. MIPROv2 (Multi-prompt Instruction Proposal Optimizer v2, the recommended optimizer in DSPy 2.x) generates and evaluates candidate instructions for each module, searching for instruction text that maximizes the evaluation metric. MIPROv2 improved upon the original MIPRO in several ways:
auto mode that sets hyperparameters based on dataset size and compute budgetPipeline optimization. For multi-step pipelines (e.g., retrieve-then-read, chain-of-thought-then-classify), DSPy optimizes each step jointly, accounting for how the output of one step affects the performance of the next. This is where DSPy's advantage over manual prompt engineering is most pronounced -- the interaction effects between stages in a multi-hop pipeline are nearly impossible to reason about by hand (see Article 08 for the chain-of-thought techniques that DSPy can automatically apply and optimize).
Tip: When switching models (e.g., GPT-4 to Claude), don't rewrite prompts manually -- recompile your DSPy program against the new model. The optimizer discovers prompts tailored to the new model's characteristics.
# DSPy 2.x API -- MIPROv2 is the recommended default optimizer
optimizer = dspy.MIPROv2(
metric=accuracy_metric,
auto="medium", # auto-configures based on compute budget
num_threads=4,
)
optimized = optimizer.compile(
ReviewAnalyzer(),
trainset=train_examples,
valset=val_examples,
)
# The optimized module contains automatically discovered
# instructions and few-shot examples. Inspect them:
optimized.inspect_history(n=3)
The term "compilation" in DSPy is deliberate. Just as a compiler translates high-level source code into optimized machine code, DSPy translates high-level task specifications into optimized prompts. The developer works at the specification level, and the compiler handles the prompt engineering.
This has profound implications for prompt portability. When switching from GPT-4 to Claude, a DSPy developer does not rewrite prompts -- they recompile their program against the new model. The optimizer discovers new prompts that work well with the new model's specific characteristics. This aligns with the fundamental prompting principle discussed in Article 07: different models respond differently to the same instructions, and what works for one model may be suboptimal for another.
# Same program, different models -- DSPy 2.x API
dspy.configure(lm=dspy.LM("openai/gpt-4o"))
gpt4_optimized = optimizer.compile(analyzer, trainset=data)
dspy.configure(lm=dspy.LM("anthropic/claude-3-5-sonnet"))
claude_optimized = optimizer.compile(analyzer, trainset=data)
DSPy shines brightest in multi-stage reasoning pipelines where manual prompt engineering becomes unwieldy:
class MultiHopQA(dspy.Module):
def __init__(self, num_hops=3):
self.retrieve = dspy.Retrieve(k=5)
self.generate_query = [
dspy.ChainOfThought("context, question -> search_query")
for _ in range(num_hops)
]
self.generate_answer = dspy.ChainOfThought(
"context, question -> answer"
)
def forward(self, question):
context = []
for hop in range(len(self.generate_query)):
query = self.generate_query[hop](
context=context, question=question
).search_query
passages = self.retrieve(query).passages
context = deduplicate(context + passages)
return self.generate_answer(context=context, question=question)
Optimizing this pipeline manually would require tuning prompts for each query generation step and the final answer generation, accounting for how they interact. DSPy handles this automatically through joint optimization.
Zhou et al. (2023) introduced APE (Automatic Prompt Engineering) in "Large Language Models Are Human-Level Prompt Engineers," demonstrating that LLMs themselves can generate effective prompts through a systematic search process.
APE works in three phases:
Phase 1 -- Proposal generation. Given a set of input-output examples, APE asks a language model to propose instruction candidates that could produce the desired outputs from the given inputs.
I have some input-output pairs. Please generate an instruction
that could have produced these outputs from these inputs.
Input: "The movie was absolutely terrible"
Output: "negative"
Input: "Best purchase I've ever made!"
Output: "positive"
Input: "It works fine, nothing special"
Output: "neutral"
Propose 10 different instructions that would produce these outputs:
Phase 2 -- Evaluation. Each proposed instruction is evaluated on a held-out set of examples. The evaluation uses the proposed instruction as a prompt and measures how often it produces the correct output.
Phase 3 -- Refinement. The best-performing instructions are refined through iterative resampling, taking the top instructions and asking the model to generate variations that might perform even better.
def ape_optimize(examples, model, n_proposals=50, n_iterations=3):
# Phase 1: Generate initial proposals
proposals = model.generate_instructions(
examples=examples[:10],
n=n_proposals
)
for iteration in range(n_iterations):
# Phase 2: Evaluate
scores = []
for instruction in proposals:
accuracy = evaluate_instruction(
instruction, examples[10:], model
)
scores.append((instruction, accuracy))
# Phase 3: Refine top candidates
top_k = sorted(scores, key=lambda x: x[1], reverse=True)[:5]
proposals = []
for instruction, score in top_k:
variations = model.generate_variations(
instruction, n=n_proposals // 5
)
proposals.extend(variations)
return max(scores, key=lambda x: x[1])
Note: APE independently discovered chain-of-thought prompting ("Let's think step by step") on reasoning tasks -- the same technique that Kojima et al. (2022) reported as a human insight. This suggests effective prompting strategies are discoverable through search.
The APE paper contained several surprising results:
LLM-generated prompts matched or exceeded human-written prompts on 24 of 24 NLP tasks tested, including tasks where significant prompt engineering effort had been invested.
The best prompts were often non-obvious. For some tasks, the optimal prompt contained phrasing that a human prompt engineer would not have considered. This supports the idea that the prompt space contains good solutions that human intuition fails to discover.
Zero-shot CoT emerged naturally. On reasoning tasks, APE independently discovered prompts similar to "Let's think step by step" -- the same technique that Kojima et al. (2022) reported as a human insight. This suggests that the space of effective prompting strategies is discoverable through search.
Yang et al. (2023) introduced OPRO (Optimization by PROmpting) in "Large Language Models as Optimizers," taking a different approach to prompt optimization. Instead of using a model to propose and refine instructions, OPRO treats the optimization problem itself as a prompting task.
OPRO maintains a "trajectory" of previous optimization attempts and their scores, then asks the model to propose new candidates that improve upon the best seen so far:
Your task is to generate an instruction for a language model
to perform sentiment classification.
Here are some previous instructions and their accuracy scores:
Instruction: "Classify the sentiment"
Score: 0.72
Instruction: "Determine if the text is positive or negative"
Score: 0.78
Instruction: "Read the text carefully and classify its emotional tone
as positive, negative, or neutral"
Score: 0.85
Generate a new instruction that scores higher than 0.85:
The optimization trajectory provides the model with a gradient-like signal: it can see which instructions performed well and which performed poorly, and it can (implicitly) identify the patterns that distinguish good instructions from bad ones.
| Strength | Description |
|---|---|
| Simplicity | No special tooling needed; the full optimization loop fits in a few dozen lines |
| Interpretability | Trajectory is natural language -- inspect it to understand why prompts work |
| Meta-learning | Later iterations improve as the model accumulates context about what works |
def opro_optimize(task_examples, model, max_iterations=20):
trajectory = []
# Seed with a few initial candidates
initial_prompts = [
"Classify the following text.",
"Analyze the sentiment of this text.",
"What is the sentiment? Respond with positive/negative/neutral."
]
for prompt in initial_prompts:
score = evaluate_prompt(prompt, task_examples, model)
trajectory.append({"prompt": prompt, "score": score})
for i in range(max_iterations):
# Build the meta-prompt with trajectory
meta_prompt = build_meta_prompt(trajectory)
# Generate new candidate
new_prompt = model.generate(meta_prompt)
# Evaluate
score = evaluate_prompt(new_prompt, task_examples, model)
trajectory.append({"prompt": new_prompt, "score": score})
# Early stopping
if score >= 0.95:
break
return max(trajectory, key=lambda x: x["score"])
Meta-prompting extends the idea of using LLMs to optimize prompts into a more general framework where LLMs reason about and improve their own prompting strategies.
Suzgun and Kalai (2024) formalized meta-prompting as a system where a "conductor" model orchestrates multiple "expert" models, each specialized for different aspects of a task. The conductor decides how to decompose a problem, which experts to invoke, and how to synthesize their outputs.
class MetaPromptSystem:
def __init__(self, conductor_model, expert_models):
self.conductor = conductor_model
self.experts = expert_models
def solve(self, problem):
# Conductor analyzes the problem and creates a plan
plan = self.conductor.generate(f"""
Analyze this problem and create a plan for solving it.
Available experts: {list(self.experts.keys())}
Problem: {problem}
For each step, specify which expert to use and what to ask them.
""")
# Execute the plan
context = {"problem": problem}
for step in parse_plan(plan):
expert = self.experts[step.expert_name]
result = expert.generate(
step.prompt.format(**context)
)
context[step.output_name] = result
# Conductor synthesizes final answer
return self.conductor.generate(f"""
Based on the following expert analyses, provide the final answer.
{format_context(context)}
""")
A powerful meta-prompting pattern involves having the model critique and improve its own prompts:
def self_refining_prompt(task_description, examples, model, iterations=3):
# Start with a simple prompt
current_prompt = f"Perform the following task: {task_description}"
for i in range(iterations):
# Evaluate current prompt
results = evaluate_on_examples(current_prompt, examples, model)
failures = [r for r in results if not r["correct"]]
if not failures:
break
# Ask model to analyze failures and improve the prompt
improvement = model.generate(f"""
The following prompt was used for a task:
"{current_prompt}"
It failed on these examples:
{format_failures(failures)}
Analyze why it failed and generate an improved prompt
that would handle these cases correctly while maintaining
performance on the cases it already handles well.
""")
current_prompt = extract_improved_prompt(improvement)
return current_prompt
This approach combines the strengths of automated evaluation (systematic coverage) with the model's ability to reason about prompt design (understanding why failures occur).
While APE and OPRO treat prompt optimization as a black-box search problem, TextGrad (Yuksekgonul et al., 2024, "TextGrad: Automatic 'Differentiation' via Text") introduces a fundamentally different paradigm: using LLMs themselves as differentiable operators to perform backpropagation through natural language. The core insight is that if a language model can evaluate how well a piece of text achieves an objective, it can also produce "textual gradients" -- natural language feedback describing how the text should change to better satisfy that objective.
In standard neural network training, backpropagation computes how each parameter contributes to the loss, then adjusts parameters in the direction that reduces the loss. TextGrad mirrors this structure but operates entirely in text space:
# Conceptual TextGrad-style optimization loop
def textgrad_optimize(prompt, examples, llm, evaluator, steps=5):
current_prompt = prompt
for step in range(steps):
# Forward pass: run the prompt on examples
outputs = [llm.generate(current_prompt.format(input=ex.input))
for ex in examples]
# Loss evaluation: get textual feedback
feedback = evaluator.generate(f"""
The following prompt was used:
"{current_prompt}"
Here are the outputs and expected results:
{format_pairs(outputs, examples)}
Provide specific, actionable feedback on how to modify the
prompt text to improve correctness. Focus on what the prompt
should say differently, not what the model should do differently.
""")
# Backward pass + update: revise the prompt using the gradient
current_prompt = llm.generate(f"""
Original prompt: "{current_prompt}"
Feedback (textual gradient): {feedback}
Rewrite the prompt incorporating this feedback.
Preserve any instructions that are working well.
Output only the revised prompt.
""")
return current_prompt
OPRO provides the optimizer with a trajectory of (prompt, score) pairs and asks it to propose something better. TextGrad provides structured feedback about why a prompt failed and how to fix it -- a richer optimization signal, analogous to the difference between zeroth-order optimization (only function values) and first-order optimization (function values plus gradients).
In practice, TextGrad converges faster on complex tasks where the relationship between prompt wording and output quality is nuanced. The original paper demonstrates its application beyond prompts: code optimization, molecular design, and reasoning chain improvement. For prompt engineering specifically, TextGrad is most valuable when the evaluation criteria are complex and multi-dimensional -- precisely the situations where a scalar score provides insufficient signal for improvement.
TextGrad inherits the limitations of its LLM evaluator. If the evaluator cannot accurately assess output quality, the textual gradients will be noisy or misleading.
As context windows have grown, so has the tendency to fill them. Long system prompts, extensive few-shot examples, and verbose retrieval contexts consume tokens that cost money and add latency. Prompt compression techniques aim to reduce token count while preserving the information that matters for task performance.
LLMLingua (Jiang et al., 2023) and its successor LongLLMLingua (Jiang et al., 2024) use a small, fast language model (e.g., GPT-2 or LLaMA-7B) to estimate the "importance" of each token in a prompt by computing perplexity contributions.
Tokens that contribute little information -- filler words, redundant phrasing, predictable continuations -- are dropped, while tokens carrying high information density are retained.
The compression process operates at three levels:
Demonstration-level compression. When a prompt contains multiple few-shot examples (see Article 08 for few-shot selection strategies), LLMLingua ranks demonstrations by their relevance to the current query and drops the least informative ones first.
Sentence-level compression. Within each demonstration or instruction block, sentences that are largely redundant or tangential are pruned.
Token-level compression. The remaining text is compressed by removing low-perplexity tokens -- words the model can predict from context without being told explicitly.
# Conceptual illustration of perplexity-based token compression
def compress_prompt(prompt, small_lm, target_ratio=0.5):
tokens = tokenize(prompt)
# Compute per-token perplexity using a small, fast model
perplexities = small_lm.per_token_perplexity(tokens)
# Higher perplexity = more informative = keep
# Lower perplexity = more predictable = safe to drop
n_keep = int(len(tokens) * target_ratio)
importance_indices = sorted(
range(len(tokens)),
key=lambda i: perplexities[i],
reverse=True
)[:n_keep]
# Preserve original token order
kept_indices = sorted(importance_indices)
return detokenize([tokens[i] for i in kept_indices])
Empirical results from the LLMLingua papers show that prompts can often be compressed to 50% of their original length with less than 2% degradation in task performance. At higher compression ratios (20-30% of original length), performance drops more sharply but remains surprisingly strong on tasks where the critical information is concentrated in a few key tokens.
For retrieval-augmented generation pipelines, prompt compression is particularly valuable. Retrieved passages often contain significant redundancy -- boilerplate, repeated facts, tangentially relevant sentences. Compressing these passages before injection into the prompt reduces cost and latency while focusing the model's attention on the most relevant content.
Note: This technique pairs naturally with the RAG evaluation methods discussed in Article 18: a faithfulness metric can verify that compression has not removed information the model needs to generate correct answers.
Prompt compression introduces a trade-off between token savings and the risk of removing critical information. In practice, teams should:
Tip: For RAG pipelines, prompt compression is especially high-leverage. Retrieved passages naturally contain redundancy and boilerplate; compressing them before injection reduces cost and refocuses model attention on the most relevant content.
Most prompt optimization research focuses on a single metric -- typically accuracy. Production systems must balance multiple, often competing, objectives: accuracy, cost, latency, safety, and format compliance. Multi-objective prompt optimization addresses this explicitly.
Consider a prompt that achieves 95% accuracy but uses 2,000 tokens (including a 1,500-token system prompt with 10 few-shot examples) versus a prompt that achieves 89% accuracy with only 200 tokens. Neither is strictly "better" -- the right choice depends on the cost and latency budget. Multi-objective optimization maps out the Pareto frontier: the set of prompt configurations where no objective can be improved without degrading another.
class MultiObjectivePromptOptimizer:
def __init__(self, objectives, model):
"""
objectives: list of (name, function, direction) tuples
e.g., [("accuracy", acc_fn, "maximize"),
("cost", cost_fn, "minimize"),
("latency_ms", latency_fn, "minimize")]
"""
self.objectives = objectives
self.model = model
def evaluate_candidate(self, prompt, eval_data):
scores = {}
for name, fn, _ in self.objectives:
scores[name] = fn(prompt, eval_data, self.model)
return scores
def is_pareto_dominated(self, scores_a, scores_b):
"""Returns True if scores_b dominates scores_a."""
dominated = True
for name, _, direction in self.objectives:
if direction == "maximize":
if scores_a[name] > scores_b[name]:
dominated = False
else:
if scores_a[name] < scores_b[name]:
dominated = False
return dominated
def pareto_frontier(self, candidates, eval_data):
evaluated = [
(c, self.evaluate_candidate(c, eval_data))
for c in candidates
]
frontier = []
for i, (c_i, s_i) in enumerate(evaluated):
is_dominated = any(
self.is_pareto_dominated(s_i, s_j)
for j, (_, s_j) in enumerate(evaluated) if i != j
)
if not is_dominated:
frontier.append((c_i, s_i))
return frontier
Few-shot example budgeting. Each few-shot example improves accuracy but increases cost and latency. Multi-objective optimization finds the smallest set of examples that meets the accuracy threshold. This directly extends the few-shot selection principles from Article 08 -- rather than adding examples until accuracy saturates, add examples until the marginal accuracy gain no longer justifies the marginal cost.
Model routing. Different queries have different difficulty levels. A multi-objective system can route easy queries to smaller, cheaper models and hard queries to larger models. The prompt optimizer produces different optimized prompts for each model tier, and a lightweight classifier (itself an optimized prompt) handles routing.
Cascaded evaluation. Instead of running every candidate prompt against the full evaluation suite, multi-objective optimization can use a cheap preliminary evaluation (small sample, fast metric) to prune clearly dominated candidates, then evaluate only the Pareto contenders on the full suite. This reduces the optimization cost itself -- an important meta-objective.
In production systems, prompt optimization objectives should map directly to business metrics:
A multi-objective optimizer that surfaces the Pareto frontier enables product teams to make informed trade-offs rather than defaulting to "maximize accuracy at any cost" -- which is the implicit objective of most single-metric optimization approaches.
It is important to distinguish prompt optimization (which modifies the text of prompts) from prompt tuning (which optimizes continuous embedding vectors prepended to the input). These are fundamentally different techniques despite similar names.
Prompt tuning, introduced by Lester et al. (2021) in "The Power of Scale for Parameter-Efficient Prompt Tuning," adds learnable continuous vectors (soft prompts) to the model's input embedding space. These vectors are optimized through gradient descent against a task-specific loss function.
# Conceptual illustration of prompt tuning
class PromptTunedModel:
def __init__(self, base_model, n_virtual_tokens=20):
self.model = base_model
# Learnable prompt embeddings
self.prompt_embeddings = nn.Parameter(
torch.randn(n_virtual_tokens, model.embed_dim)
)
def forward(self, input_ids):
input_embeds = self.model.embed(input_ids)
# Prepend learnable prompt embeddings
combined = torch.cat([self.prompt_embeddings, input_embeds], dim=0)
return self.model(inputs_embeds=combined)
def train(self, dataset):
optimizer = torch.optim.Adam([self.prompt_embeddings])
for batch in dataset:
loss = self.forward(batch).loss
loss.backward()
optimizer.step()
Prompt tuning requires access to model gradients, which means it only works with open-weight models. It is more sample-efficient than full fine-tuning but produces prompts that are not human-readable (they exist in embedding space, not token space).
Prompt optimization (DSPy, APE, OPRO) modifies actual text prompts through search and evaluation. It works with any model, including closed APIs, and produces human-readable prompts that can be inspected, understood, and manually adjusted.
| Dimension | Prompt Tuning | Prompt Optimization |
|---|---|---|
| Requires model gradients | Yes | No |
| Works with closed APIs | No | Yes |
| Output is human-readable | No | Yes |
| Sample efficiency | Higher | Lower |
| Optimization quality | Higher (continuous) | Lower (discrete) |
| Portability across models | No (embedding-specific) | Partial (re-optimize) |
For most production applications using closed APIs, prompt optimization is the practical choice. Prompt tuning is relevant for teams deploying open-weight models who need maximum performance on specific tasks.
Production prompt optimization follows a systematic pipeline:
class PromptOptimizationPipeline:
def __init__(self, task, eval_suite, model, optimizer):
self.task = task
self.eval_suite = eval_suite
self.model = model
self.optimizer = optimizer
def run(self, n_iterations=10):
results_log = []
for i in range(n_iterations):
# Optimize prompt
candidate = self.optimizer.propose(
task=self.task,
history=results_log
)
# Evaluate
scores = self.eval_suite.evaluate(
prompt=candidate,
model=self.model
)
results_log.append({
"iteration": i,
"prompt": candidate,
"scores": scores,
"aggregate": scores.mean()
})
# Check convergence
if self._has_converged(results_log):
break
# Return best prompt
best = max(results_log, key=lambda x: x["aggregate"])
return best
def _has_converged(self, log, window=3, threshold=0.01):
if len(log) < window:
return False
recent = [l["aggregate"] for l in log[-window:]]
return max(recent) - min(recent) < threshold
The quality of prompt optimization is bounded by the quality of the evaluation suite. A good evaluation suite:
class EvalSuite:
def __init__(self, test_cases, metrics):
self.tests = test_cases
self.metrics = metrics
def evaluate(self, prompt, model):
scores = {m.name: [] for m in self.metrics}
for test in self.tests:
output = model.generate(prompt.format(input=test.input))
for metric in self.metrics:
score = metric.compute(output, test.expected, test.metadata)
scores[metric.name].append(score)
return {
name: {
"mean": np.mean(values),
"std": np.std(values),
"min": np.min(values),
"failures": sum(1 for v in values if v < 0.5)
}
for name, values in scores.items()
}
Prompt optimization can be integrated into CI/CD pipelines, automatically re-optimizing prompts when models are updated or when new evaluation data is available:
# .github/workflows/prompt-optimization.yml
name: Prompt Optimization
on:
schedule:
- cron: '0 2 * * 1' # Weekly on Monday at 2 AM
workflow_dispatch:
inputs:
model_version:
description: 'Target model version'
required: true
jobs:
optimize:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run prompt optimization
run: python scripts/optimize_prompts.py
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
MODEL: ${{ inputs.model_version || 'gpt-4o' }}
- name: Compare with current prompts
run: python scripts/compare_prompts.py
- name: Create PR if improved
if: steps.compare.outputs.improved == 'true'
run: |
gh pr create \
--title "Prompt optimization: +${{ steps.compare.outputs.improvement }}% accuracy" \
--body "Automated prompt optimization results..."
Manual optimization is appropriate when:
Automated optimization is appropriate when:
For teams adopting prompt optimization, DSPy offers the most complete framework. A practical starting point:
dspy.Predict, not ChainOfThought)dspy.MIPROv2(auto="light") for an initial baseline, then increase to auto="medium" or auto="heavy" as compute budget allowsdspy.MIPROv2(auto="light"). This gives you a reproducible baseline faster than manual iteration.