Building robust evaluation datasets is one of the most time-consuming bottlenecks in LLM application development. DeepEval's Synthesizer addresses this by providing a systematic pipeline for generating high-quality synthetic test data -- called "Goldens" -- that can be used to evaluate LLM systems without manually authoring hundreds of test cases. The Synthesizer supports four distinct generation strategies, each suited to different data availability scenarios: generating from documents, from pre-built contexts, from scratch without any knowledge base, and from existing goldens. Each strategy produces both single-turn and multi-turn (conversational) goldens, and the entire pipeline is governed by configurable filtration, evolution, and styling stages that control the quality, complexity, and format of the generated data.
What Are Goldens?
A Golden is a single test case in DeepEval's evaluation framework. Each Golden represents a structured data point with fields that correspond to inputs, expected outputs, and supporting context. The Synthesizer generates these programmatically so you can build evaluation datasets at scale.
For single-turn goldens, a Golden object contains:
input: The generated question or promptactual_output: AlwaysNoneafter generation -- this field is populated later when you run the input through your LLM applicationexpected_output: A reference answer generated by the Synthesizer (wheninclude_expected_output=True)context: The knowledge base source text used to ground the goldenretrieval_context: Retrieved passages (populated during evaluation)source_file: The original document the context was extracted from
For multi-turn goldens, a ConversationalGolden object follows the same structure but represents a full conversation with multiple turns, and uses expected_outcome instead of expected_output.
A critical distinction: the Synthesizer does not generate actual_output values. These come from running your LLM application against the generated inputs. Similarly, multi-turn generation does not create individual turns -- use DeepEval's ConversationSimulator for that purpose.
Mental Model
The mental model for the Synthesizer is a quality-controlled factory for test cases: raw context goes in, structured "goldens" (input + expected context, not outputs) come out, and every station on the line either generates, mutates, or rejects. The point is not volume โ it is producing goldens that are hard in the ways your evaluation cares about, which is why evolution (deliberately complicating inputs) and filtration (discarding weak ones) matter more than raw count.
This only makes sense anchored to a downstream metric: a golden is good if it would expose a real weakness when scored by evaluation fundamentals and an LLM-as-judge. The Synthesizer is the supply side of that loop; the judge is the demand side. Concretely, the Synthesizer never invents actual_output โ it manufactures the probe (input plus the context the answer must be grounded in) and leaves the system-under-test to produce the answer the judge then grades. That separation is the whole reason synthetic goldens are trustworthy: the difficulty is engineered at generation time and the correctness signal is decided independently at scoring time, so a passing run cannot be gamed by the generator and a failing golden points at the model, not at fabricated ground truth.
The Generation Pipeline
Regardless of which generation method you use, the Synthesizer follows a four-stage pipeline:
Stage 1: Input Generation
The Synthesizer creates synthetic input values -- questions, prompts, or queries -- based on the source material you provide. The mechanism varies by method:
- From docs: Documents are parsed, chunked, embedded, and stored in a vector database. Contexts are selected and grouped, then used to generate inputs.
- From contexts: Pre-prepared context lists are used directly, skipping document processing.
- From scratch: Inputs are generated purely from a task description and scenario, with no source documents or contexts.
- From goldens: Existing goldens serve as templates for generating new, varied inputs.
Stage 2: Filtration
Generated inputs are scored on quality (0 to 1) based on two criteria:
- Self-containment: Can the input be understood without external references?
- Clarity: Does the input have clear intent without ambiguity?
Inputs scoring below the synthetic_input_quality_threshold (default 0.5) are regenerated up to max_quality_retries (default 3) times. If all retries fail, the highest-scoring attempt is used. This is controlled via FiltrationConfig.
Stage 3: Evolution
Inputs are evolved through complexity-increasing transformations inspired by the Evol-Instruct methodology (WizardLM). Each input can undergo one or more evolution steps, producing more challenging and realistic test cases. Seven evolution types are available:
| Evolution Type | Description | Context-Adherent |
|---|---|---|
Evolution.REASONING | Adds logical reasoning requirements | No |
Evolution.MULTICONTEXT | Requires synthesizing information from multiple context sources | Yes |
Evolution.CONCRETIZING | Makes inputs more specific and concrete | Yes |
Evolution.CONSTRAINED | Adds constraints to the expected response | Yes |
Evolution.COMPARATIVE | Requires comparing multiple options or entities | Yes |
Evolution.HYPOTHETICAL | Introduces hypothetical scenarios | No |
Evolution.IN_BREADTH | Explores broader related topics | No |
Context-adherent evolutions use the provided context to guide the transformation, ensuring the evolved input remains answerable from the source material.
Stage 4: Styling
The final stage formats inputs and expected outputs according to your specifications. This is controlled via StylingConfig and allows you to ensure generated data matches the format your application expects.
Synthesizer Initialization
The Synthesizer constructor accepts seven optional parameters that control global behavior across all generation methods:
from deepeval.synthesizer import Synthesizer
from deepeval.synthesizer.config import (
FiltrationConfig,
EvolutionConfig,
StylingConfig,
)
synthesizer = Synthesizer(
model="gpt-4.1", # LLM for generation
async_mode=True, # concurrent generation
max_concurrent=100, # max parallel operations
filtration_config=FiltrationConfig(), # quality filtering
evolution_config=EvolutionConfig(), # complexity evolution
styling_config=StylingConfig(), # output formatting
cost_tracking=True, # print LLM costs
)
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str or DeepEvalBaseLLM | "gpt-4.1" | The LLM used for all generation steps. Can be an OpenAI model string or a custom model implementing DeepEvalBaseLLM |
async_mode | bool | True | Enables concurrent golden generation for faster throughput |
max_concurrent | int | 100 | Maximum number of parallel generation operations when async_mode=True |
filtration_config | FiltrationConfig | Default values | Controls quality filtering of generated inputs |
evolution_config | EvolutionConfig | Default values | Controls complexity evolution of inputs |
styling_config | StylingConfig | Default values | Controls output formatting and style |
cost_tracking | bool | False | When enabled, prints LLM API costs during generation |
FiltrationConfig
Filtration ensures generated inputs meet a minimum quality bar. Inputs that score below the threshold are regenerated.
from deepeval.synthesizer.config import FiltrationConfig
filtration_config = FiltrationConfig(
critic_model="gpt-4.1",
synthetic_input_quality_threshold=0.5,
max_quality_retries=3,
)
synthesizer = Synthesizer(filtration_config=filtration_config)
| Parameter | Type | Default | Description |
|---|---|---|---|
critic_model | str or DeepEvalBaseLLM | Synthesizer's model, else "gpt-4.1" | The LLM used to evaluate input quality. Can differ from the generation model |
synthetic_input_quality_threshold | float | 0.5 | Minimum quality score (0-1) an input must achieve. Higher values produce better inputs but may increase generation time and cost |
max_quality_retries | int | 3 | Maximum regeneration attempts for inputs below the threshold. After exhausting retries, the highest-scoring attempt is used |
Quality scores are computed based on self-containment (the input is understandable without external references) and clarity (the input has unambiguous intent). These scores are available in the output DataFrame via the synthetic_input_quality column.
EvolutionConfig
Evolution transforms simple inputs into more complex, realistic test cases. You control which evolution types to apply and their relative sampling probabilities.
from deepeval.synthesizer.config import EvolutionConfig
from deepeval.synthesizer import Evolution
evolution_config = EvolutionConfig(
evolutions={
Evolution.REASONING: 1/4,
Evolution.MULTICONTEXT: 1/4,
Evolution.CONCRETIZING: 1/4,
Evolution.CONSTRAINED: 1/4,
},
num_evolutions=4,
)
synthesizer = Synthesizer(evolution_config=evolution_config)
| Parameter | Type | Default | Description |
|---|---|---|---|
evolutions | dict[Evolution, float] | Equal distribution | Maps evolution types to their sampling probabilities. Probabilities should sum to 1.0 |
num_evolutions | int | 1 | Number of sequential evolution steps applied to each input. Higher values produce more complex inputs but increase cost and may reduce naturalness |
When num_evolutions > 1, each step samples an evolution type independently based on the configured probabilities. An input might first be concretized, then have reasoning added, producing a multi-layered complexity increase.
Evolution Types in Detail
Evolution.REASONING: Transforms inputs to require logical reasoning, multi-step deduction, or inference chains. The evolved input can't be answered by simple extraction -- the model must reason over the context.
Evolution.MULTICONTEXT: Modifies inputs so that answering requires synthesizing information from multiple context chunks. This tests whether the model can combine disparate pieces of information. Context-adherent: the evolved input stays grounded in the provided contexts.
Evolution.CONCRETIZING: Makes abstract or general inputs more specific. For example, "What are the benefits of exercise?" might become "What cardiovascular benefits do adults over 50 gain from 30 minutes of daily walking?" Context-adherent.
Evolution.CONSTRAINED: Adds constraints to the expected response format or content. For example, "Explain quantum computing" might become "Explain quantum computing in exactly three sentences, using no technical jargon." Context-adherent.
Evolution.COMPARATIVE: Transforms inputs to require comparing multiple entities, approaches, or options. For example, "What is supervised learning?" might become "Compare supervised and unsupervised learning in terms of data requirements and typical use cases." Context-adherent.
Evolution.HYPOTHETICAL: Adds hypothetical scenarios or counterfactual reasoning requirements. For example, "What causes inflation?" might become "If a central bank doubled the money supply overnight, what inflationary effects would follow?"
Evolution.IN_BREADTH: Broadens the input to explore related topics, testing whether the model has wider knowledge beyond the specific context provided.
StylingConfig
Styling controls the format and framing of generated inputs and expected outputs. This is essential when your application expects inputs in a specific format (e.g., SQL queries, customer support tickets, technical questions).
from deepeval.synthesizer.config import StylingConfig
styling_config = StylingConfig(
input_format="Questions in English that ask for data in a database",
expected_output_format="SQL query based on the given input",
task="Answering text-to-SQL-related queries by querying a database and returning results to users",
scenario="Non-technical users trying to query a database using plain English",
)
synthesizer = Synthesizer(styling_config=styling_config)
| Parameter | Type | Default | Description |
|---|---|---|---|
input_format | str | None | Describes the desired format for generated inputs |
expected_output_format | str | None | Describes the desired format for expected outputs |
task | str | None | Describes the purpose of the LLM application being evaluated |
scenario | str | None | Describes the setting or context in which the application is used |
For multi-turn generation from scratch, use ConversationalStylingConfig instead:
from deepeval.synthesizer.config import ConversationalStylingConfig
conversational_styling_config = ConversationalStylingConfig(
conversational_task="Answering text-to-SQL-related queries through conversation",
scenario_context="Non-technical users trying to query a database using plain English",
participant_roles="A non-technical user asking database questions and an AI assistant responding",
)
synthesizer = Synthesizer(conversational_styling_config=conversational_styling_config)
| Parameter | Type | Default | Description |
|---|---|---|---|
conversational_task | str | None | The overall purpose of the conversation |
scenario_context | str | None | Environmental details and context for the conversation |
participant_roles | str | None | Description of the interaction participants |
Method 1: Generate Goldens From Documents
This is the most automated method, designed for RAG systems with existing knowledge bases. It handles document parsing, chunking, embedding, context selection, and golden generation in a single call.
Prerequisites
pip install chromadb langchain-core langchain-community langchain-text-splitters
These dependencies handle document parsing (langchain-text-splitters for chunking, langchain-community for document loaders) and context management (chromadb for embedding storage and retrieval).
Single-Turn Generation
from deepeval.synthesizer import Synthesizer
synthesizer = Synthesizer()
goldens = synthesizer.generate_goldens_from_docs(
document_paths=["knowledge_base.txt", "faq.pdf", "guide.docx"],
include_expected_output=True,
max_goldens_per_context=2,
)
| Parameter | Type | Default | Description |
|---|---|---|---|
document_paths | list[str] | Required | File paths to source documents. Supported formats: .txt, .docx, .pdf, .md, .markdown, .mdx |
include_expected_output | bool | True | Generate a reference expected_output for each golden |
max_goldens_per_context | int | 2 | Maximum goldens generated per constructed context |
context_construction_config | ContextConstructionConfig | Default values | Controls how contexts are built from documents |
The total maximum number of goldens produced is max_goldens_per_context * max_contexts_per_document * number_of_documents, not simply max_goldens_per_context.
Multi-Turn Generation
synthesizer.generate_conversational_goldens_from_docs(
document_paths=["knowledge_base.txt", "faq.pdf"],
include_expected_outcome=True,
max_goldens_per_context=2,
)
| Parameter | Type | Default | Description |
|---|---|---|---|
document_paths | list[str] | Required | File paths to source documents |
include_expected_outcome | bool | True | Generate expected_outcome for each ConversationalGolden |
max_goldens_per_context | int | 2 | Maximum goldens per context |
context_construction_config | ContextConstructionConfig | Default values | Controls context construction |
ContextConstructionConfig
Unlike other Synthesizer configurations (which are set at initialization), context construction is configured at generation time because it is specific to document-based generation.
from deepeval.synthesizer.config import ContextConstructionConfig
synthesizer.generate_goldens_from_docs(
document_paths=["knowledge_base.txt"],
context_construction_config=ContextConstructionConfig(
embedder="text-embedding-3-small",
chunk_size=1024,
chunk_overlap=0,
max_contexts_per_document=3,
min_contexts_per_document=1,
max_context_length=3,
min_context_length=1,
context_quality_threshold=0.5,
context_similarity_threshold=0.5,
max_retries=3,
critic_model="gpt-4.1",
),
)
| Parameter | Type | Default | Description |
|---|---|---|---|
embedder | str or DeepEvalBaseEmbeddingModel | "text-embedding-3-small" | Embedding model for document parsing and context grouping |
chunk_size | int | 1024 | Token size (not character size) of text chunks during document parsing |
chunk_overlap | int | 0 | Token overlap between consecutive chunks |
max_contexts_per_document | int | 3 | Maximum number of contexts extracted from each document |
min_contexts_per_document | int | 1 | Minimum number of contexts extracted from each document |
max_context_length | int | 3 | Maximum number of text chunks grouped into a single context |
min_context_length | int | 1 | Minimum number of text chunks in a context |
context_quality_threshold | float | 0.5 | Minimum quality score (0-1) for a context to be accepted |
context_similarity_threshold | float | 0.5 | Minimum cosine similarity for context grouping |
max_retries | int | 3 | Retry attempts for context selection and grouping failures |
critic_model | str or DeepEvalBaseLLM | Synthesizer's model, else "gpt-4.1" | LLM used to evaluate context quality scores |
encoding | str | Auto-detected | Text encoding for .txt, .md, .markdown, .mdx files |
The Context Construction Pipeline
Document-based generation runs three sub-stages before the main generation pipeline:
1. Document Parsing: Documents are split into chunks using TokenTextSplitter at the token level (governed by chunk_size and chunk_overlap). Each chunk is embedded using the configured embedder and stored in a ChromaDB vector database. If chunk_size is too large relative to the document size, an error is raised because there aren't enough unique chunks to build max_contexts_per_document contexts.
2. Context Selection: Random nodes are sampled from the vector database and scored for quality (0-1) by the critic_model. Quality is assessed on four dimensions:
- Clarity: Is the information comprehensible?
- Depth: Does it contain sufficient detail and insight?
- Structure: Is it well-organized and logically coherent?
- Relevance: Is it topically focused?
Nodes scoring below context_quality_threshold are re-sampled up to max_retries times. If all retries fail, the highest-scoring node is used regardless.
3. Context Grouping: Selected nodes are grouped with up to max_context_length similar nodes using cosine similarity. Nodes with similarity below context_similarity_threshold are retried up to max_retries times, falling back to the highest-similarity match. This ensures each context group is thematically coherent, producing more focused and answerable goldens.
After context construction completes, the constructed contexts are passed to the same generation pipeline used by generate_goldens_from_contexts().
Method 2: Generate Goldens From Contexts
Use this method when you already have prepared contexts -- for example, chunks stored in a vector database or manually curated context sets. This bypasses all document processing and context construction, feeding your contexts directly into the generation pipeline.
Single-Turn Generation
from deepeval.synthesizer import Synthesizer
synthesizer = Synthesizer()
goldens = synthesizer.generate_goldens_from_contexts(
contexts=[
[
"The Earth revolves around the Sun in approximately 365.25 days.",
"Planets are celestial bodies that orbit stars.",
],
[
"Water freezes at 0 degrees Celsius at standard atmospheric pressure.",
"The chemical formula for water is H2O.",
],
],
include_expected_output=True,
max_goldens_per_context=2,
source_files=["astronomy.txt", "chemistry.txt"],
)
| Parameter | Type | Default | Description |
|---|---|---|---|
contexts | list[list[str]] | Required | List of contexts, where each context is a list of related text strings. Strings within each inner list should share a common theme |
include_expected_output | bool | True | Generate a reference expected_output for each golden |
max_goldens_per_context | int | 2 | Maximum goldens generated per context |
source_files | list[str] or None | None | Optional source identifiers. If provided, length must match the contexts list length |
Multi-Turn Generation
conversational_goldens = synthesizer.generate_conversational_goldens_from_contexts(
contexts=[
[
"The Earth revolves around the Sun in approximately 365.25 days.",
"Planets are celestial bodies that orbit stars.",
],
[
"Water freezes at 0 degrees Celsius at standard atmospheric pressure.",
"The chemical formula for water is H2O.",
],
],
include_expected_outcome=True,
max_goldens_per_context=2,
source_files=["astronomy.txt", "chemistry.txt"],
)
| Parameter | Type | Default | Description |
|---|---|---|---|
contexts | list[list[str]] | Required | List of contexts, each a list of related strings |
include_expected_outcome | bool | True | Generate expected_outcome for each ConversationalGolden |
max_goldens_per_context | int | 2 | Maximum goldens per context |
source_files | list[str] or None | None | Source identifiers, must match contexts length if provided |
Relationship to Document-Based Generation
The generate_goldens_from_docs() method calls generate_goldens_from_contexts() under the hood. The only difference is the additional context construction step that parses, chunks, and groups document content into the contexts format. If you have already processed your documents into context groups, using generate_goldens_from_contexts() directly is more efficient.
Method 3: Generate Goldens From Scratch
This method generates goldens without any documents or contexts. It is designed for applications that don't rely on RAG -- for example, chatbots, code generators, text-to-SQL systems, or creative writing assistants. Since there is no source material, the StylingConfig (or ConversationalStylingConfig for multi-turn) becomes essential to guide the generation.
Single-Turn Generation
from deepeval.synthesizer import Synthesizer
from deepeval.synthesizer.config import StylingConfig
styling_config = StylingConfig(
input_format="Questions in English that ask for data in a database",
expected_output_format="SQL query based on the given input",
task="Answering text-to-SQL-related queries by querying a database and returning results to users",
scenario="Non-technical users trying to query a database using plain English",
)
synthesizer = Synthesizer(styling_config=styling_config)
goldens = synthesizer.generate_goldens_from_scratch(num_goldens=25)
| Parameter | Type | Default | Description |
|---|---|---|---|
num_goldens | int | Required | The number of synthetic goldens to generate |
Without a StylingConfig, the Synthesizer has no guidance on what kind of inputs to generate, making the config effectively mandatory for this method.
Multi-Turn Generation
from deepeval.synthesizer import Synthesizer
from deepeval.synthesizer.config import ConversationalStylingConfig
conversational_styling_config = ConversationalStylingConfig(
conversational_task="Helping users write and debug SQL queries through conversation",
scenario_context="Non-technical users interacting with a database assistant",
participant_roles="A user asking database questions and an AI SQL assistant",
)
synthesizer = Synthesizer(conversational_styling_config=conversational_styling_config)
conversational_goldens = synthesizer.generate_conversational_goldens_from_scratch(
num_goldens=25,
)
| Parameter | Type | Default | Description |
|---|---|---|---|
num_goldens | int | Required | The number of conversational goldens to generate |
When to Use From-Scratch Generation
- No existing knowledge base: Your application generates responses from parametric knowledge or external APIs, not from documents
- Task-specific evaluation: You need test cases that match a specific input/output format (SQL queries, code, structured data)
- Bootstrapping: You are starting evaluation from zero and need an initial dataset to iterate on
- Coverage testing: You want to test edge cases and scenarios that your existing documents don't cover
Method 4: Generate Goldens From Goldens
This method augments an existing set of goldens by generating new variations. It is useful for expanding small evaluation datasets, increasing diversity, or creating more challenging versions of existing test cases.
Single-Turn Generation
from deepeval.synthesizer import Synthesizer
from deepeval.dataset import Golden
existing_goldens = [
Golden(
input="What is the capital of France?",
expected_output="Paris",
context=["Paris is the capital and most populous city of France."],
),
Golden(
input="What is photosynthesis?",
expected_output="The process by which plants convert sunlight into energy.",
context=["Photosynthesis is a process used by plants to convert light energy into chemical energy."],
),
]
synthesizer = Synthesizer()
new_goldens = synthesizer.generate_goldens_from_goldens(
goldens=existing_goldens,
max_goldens_per_golden=2,
include_expected_output=True,
)
| Parameter | Type | Default | Description |
|---|---|---|---|
goldens | list[Golden] | Required | Existing goldens to use as templates for generating new ones |
max_goldens_per_golden | int | 2 | Maximum number of new goldens generated from each existing golden |
include_expected_output | bool | True | Generate expected_output for each new golden |
Multi-Turn Generation
new_conversational_goldens = synthesizer.generate_conversational_goldens_from_goldens(
goldens=existing_goldens,
max_goldens_per_golden=2,
include_expected_outcome=True,
)
| Parameter | Type | Default | Description |
|---|---|---|---|
goldens | list[Golden] | Required | Existing goldens as generation templates |
max_goldens_per_golden | int | 2 | Maximum new goldens per existing golden |
include_expected_outcome | bool | True | Generate expected_outcome for each ConversationalGolden |
Critical Constraints
Context requirement for expected outputs: Generated goldens will contain expected_output only if your existing goldens contain context. When context is present, the Synthesizer uses it to ground the new goldens in factual content. Without context, the method falls back to from-scratch techniques based on the input patterns alone.
Single/multi-turn symmetry: You can only generate single-turn goldens from existing single-turn goldens, and conversational goldens from existing conversational goldens. You cannot mix the two.
StylingConfig recommendation: While the method can extract styling patterns from existing goldens, explicitly providing a StylingConfig produces more accurate and consistent results.
Saving and Exporting Generated Data
Push to Confident AI
from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset(goldens=synthesizer.synthetic_goldens)
dataset.push(alias="My Generated Dataset")
This uploads the dataset to Confident AI's platform for versioning, collaboration, and integration with DeepEval's evaluation pipeline.
Save Locally
synthesizer.save_as(
file_type="json", # "json" or "csv"
directory="./synthetic_data",
file_name="my_dataset", # optional, without extension
)
| Parameter | Type | Default | Description |
|---|---|---|---|
file_type | str | Required | Output format: "json" or "csv" |
directory | str | Required | Folder path for the saved file |
file_name | str | None | Custom filename without extension. Auto-generated if omitted |
quiet | bool | False | Suppress output messages when True |
Inspect as DataFrame
df = synthesizer.to_pandas()
print(df.columns.tolist())
The DataFrame includes these columns:
| Column | Description |
|---|---|
input | The generated question or prompt |
actual_output | Always None (populated by your application) |
expected_output | Reference answer from the Synthesizer |
context | Source knowledge base text |
retrieval_context | Retrieved passages (populated during evaluation) |
n_chunks_per_context | Number of text chunks in the context |
context_length | Character length of the context |
context_quality | Context quality score (0-1), from context construction |
synthetic_input_quality | Input quality score (0-1), from filtration |
evolutions | Sequence of evolution types applied |
source_file | Original document source |
End-to-End Example
Here is a complete workflow that generates goldens from documents, configures all pipeline stages, and exports the results:
from deepeval.synthesizer import Synthesizer, Evolution
from deepeval.synthesizer.config import (
FiltrationConfig,
EvolutionConfig,
StylingConfig,
ContextConstructionConfig,
)
from deepeval.dataset import EvaluationDataset
# Configure pipeline stages
filtration_config = FiltrationConfig(
critic_model="gpt-4.1",
synthetic_input_quality_threshold=0.7,
max_quality_retries=5,
)
evolution_config = EvolutionConfig(
evolutions={
Evolution.REASONING: 0.3,
Evolution.MULTICONTEXT: 0.3,
Evolution.CONCRETIZING: 0.2,
Evolution.COMPARATIVE: 0.2,
},
num_evolutions=2,
)
styling_config = StylingConfig(
input_format="Technical questions about machine learning concepts",
expected_output_format="Detailed explanations with examples",
task="Answering ML-related questions from a knowledge base",
scenario="ML engineers looking up concepts in internal documentation",
)
# Initialize Synthesizer
synthesizer = Synthesizer(
model="gpt-4.1",
async_mode=True,
max_concurrent=50,
filtration_config=filtration_config,
evolution_config=evolution_config,
styling_config=styling_config,
cost_tracking=True,
)
# Configure context construction
context_config = ContextConstructionConfig(
chunk_size=512,
chunk_overlap=50,
max_contexts_per_document=5,
max_context_length=3,
context_quality_threshold=0.6,
context_similarity_threshold=0.6,
)
# Generate goldens
goldens = synthesizer.generate_goldens_from_docs(
document_paths=["docs/architecture.md", "docs/api-reference.md", "docs/tutorials.pdf"],
include_expected_output=True,
max_goldens_per_context=3,
context_construction_config=context_config,
)
# Inspect results
df = synthesizer.to_pandas()
print(f"Generated {len(goldens)} goldens")
print(f"Average input quality: {df['synthetic_input_quality'].mean():.3f}")
print(f"Average context quality: {df['context_quality'].mean():.3f}")
# Save locally and to Confident AI
synthesizer.save_as(file_type="json", directory="./eval_data", file_name="ml_docs_goldens")
dataset = EvaluationDataset(goldens=goldens)
dataset.push(alias="ML Docs Evaluation Set v1")
Runtime Internals
The pipeline diagram hides the mechanics that determine whether your goldens are usable.
Context construction and chunking
"From documents" first chunks and embeds the corpus, then groups related chunks into contexts โ a golden's quality is capped by the coherence of its context group. Bad chunking (splitting a definition from its example) produces goldens that are unanswerable, not hard. This is the same failure surface as RAG evaluation: garbage context, garbage verdict.
Evolution: controlled difficulty injection
Evolution rewrites a base input to be harder along chosen axes (multi-hop, reasoning, constraints). The runtime risk is over-evolution โ too many passes drift the input off its context so the "expected" answer is no longer derivable. Bounded evolution depth is the key knob.
Filtration as a quality gate
Filtration scores each candidate and drops those below threshold. Set it too high and you starve the dataset; too low and you pollute it. This is a precision/recall dial, not a boolean โ tune it against how the goldens perform on a known-weak model.
Cost and concurrency
Every generate/evolve/filter step is an LLM call, so a 1k-golden run is thousands of calls. The runtime concerns are the familiar LLM serving ones โ concurrency limits, retry/backoff, and caching context embeddings so a re-run does not re-pay the embedding cost.
Choosing the Right Generation Method
| Scenario | Method | Why |
|---|---|---|
| RAG system with document corpus | generate_goldens_from_docs | Automated end-to-end pipeline from raw documents to goldens |
| Pre-chunked data in vector DB | generate_goldens_from_contexts | Skip document processing, use existing embeddings directly |
| Non-RAG application (chatbot, code gen) | generate_goldens_from_scratch | No source material needed, guided by task description |
| Small existing eval dataset | generate_goldens_from_goldens | Augment and diversify existing test cases |
| Mixed: some docs + some manual cases | Combine methods | Use from_docs for document coverage, from_goldens to augment edge cases |
Practical Considerations
Manual inspection is essential: Synthetic data generation is not a fire-and-forget process. Always review a sample of generated goldens before using them for evaluation. Common issues include:
- Inputs that are too vague or too specific to be realistic
- Expected outputs that contradict the source context
- Evolution producing unnaturally complex or convoluted inputs
- Context grouping that combines unrelated topics
Cost management: Each golden requires multiple LLM calls (generation, filtration scoring, evolution, styling). With cost_tracking=True, monitor spend. Reduce costs by lowering max_quality_retries, using fewer evolution steps, or using a cheaper model for the critic_model while keeping a stronger model for generation.
OpenAI API key: The default embedder (text-embedding-3-small) and model (gpt-4.1) require an OPENAI_API_KEY. For non-OpenAI setups, provide a custom DeepEvalBaseLLM for the model and a custom DeepEvalBaseEmbeddingModel for the embedder.
Scaling: For large document corpora, increase max_concurrent and enable async_mode. For very large datasets (thousands of goldens), consider generating in batches and merging the results to avoid timeout issues.
Reproducibility: The Synthesizer uses LLM-based generation, which is inherently non-deterministic. Running the same configuration twice will produce different goldens. For reproducible evaluation datasets, generate once, inspect, curate, and version the results rather than regenerating each time.