Building robust evaluation datasets is one of the most time-consuming bottlenecks in LLM application development. DeepEval's Synthesizer addresses this by providing a systematic pipeline for generating high-quality synthetic test data -- called "Goldens" -- that can be used to evaluate LLM systems without manually authoring hundreds of test cases. The Synthesizer supports four distinct generation strategies, each suited to different data availability scenarios: generating from documents, from pre-built contexts, from scratch without any knowledge base, and from existing goldens. Each strategy produces both single-turn and multi-turn (conversational) goldens, and the entire pipeline is governed by configurable filtration, evolution, and styling stages that control the quality, complexity, and format of the generated data.
A Golden is a single test case in DeepEval's evaluation framework. Each Golden represents a structured data point with fields that correspond to inputs, expected outputs, and supporting context. The Synthesizer generates these programmatically so you can build evaluation datasets at scale.
For single-turn goldens, a Golden object contains:
input: The generated question or promptactual_output: Always None after generation -- this field is populated later when you run the input through your LLM applicationexpected_output: A reference answer generated by the Synthesizer (when include_expected_output=True)context: The knowledge base source text used to ground the goldenretrieval_context: Retrieved passages (populated during evaluation)source_file: The original document the context was extracted fromFor multi-turn goldens, a ConversationalGolden object follows the same structure but represents a full conversation with multiple turns, and uses expected_outcome instead of expected_output.
A critical distinction: the Synthesizer does not generate actual_output values. These come from running your LLM application against the generated inputs. Similarly, multi-turn generation does not create individual turns -- use DeepEval's ConversationSimulator for that purpose.
Regardless of which generation method you use, the Synthesizer follows a four-stage pipeline:
The Synthesizer creates synthetic input values -- questions, prompts, or queries -- based on the source material you provide. The mechanism varies by method:
Generated inputs are scored on quality (0 to 1) based on two criteria:
Inputs scoring below the synthetic_input_quality_threshold (default 0.5) are regenerated up to max_quality_retries (default 3) times. If all retries fail, the highest-scoring attempt is used. This is controlled via FiltrationConfig.
Inputs are evolved through complexity-increasing transformations inspired by the Evol-Instruct methodology (WizardLM). Each input can undergo one or more evolution steps, producing more challenging and realistic test cases. Seven evolution types are available:
| Evolution Type | Description | Context-Adherent |
|---|---|---|
Evolution.REASONING | Adds logical reasoning requirements | No |
Evolution.MULTICONTEXT | Requires synthesizing information from multiple context sources | Yes |
Evolution.CONCRETIZING | Makes inputs more specific and concrete | Yes |
Evolution.CONSTRAINED | Adds constraints to the expected response | Yes |
Evolution.COMPARATIVE | Requires comparing multiple options or entities | Yes |
Evolution.HYPOTHETICAL | Introduces hypothetical scenarios | No |
Evolution.IN_BREADTH | Explores broader related topics | No |
Context-adherent evolutions use the provided context to guide the transformation, ensuring the evolved input remains answerable from the source material.
The final stage formats inputs and expected outputs according to your specifications. This is controlled via StylingConfig and allows you to ensure generated data matches the format your application expects.
The Synthesizer constructor accepts seven optional parameters that control global behavior across all generation methods:
from deepeval.synthesizer import Synthesizer
from deepeval.synthesizer.config import (
FiltrationConfig,
EvolutionConfig,
StylingConfig,
)
synthesizer = Synthesizer(
model="gpt-4.1", # LLM for generation
async_mode=True, # concurrent generation
max_concurrent=100, # max parallel operations
filtration_config=FiltrationConfig(), # quality filtering
evolution_config=EvolutionConfig(), # complexity evolution
styling_config=StylingConfig(), # output formatting
cost_tracking=True, # print LLM costs
)
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str or DeepEvalBaseLLM | "gpt-4.1" | The LLM used for all generation steps. Can be an OpenAI model string or a custom model implementing DeepEvalBaseLLM |
async_mode | bool | True | Enables concurrent golden generation for faster throughput |
max_concurrent | int | 100 | Maximum number of parallel generation operations when async_mode=True |
filtration_config | FiltrationConfig | Default values | Controls quality filtering of generated inputs |
evolution_config | EvolutionConfig | Default values | Controls complexity evolution of inputs |
styling_config | StylingConfig | Default values | Controls output formatting and style |
cost_tracking | bool | False | When enabled, prints LLM API costs during generation |
Filtration ensures generated inputs meet a minimum quality bar. Inputs that score below the threshold are regenerated.
from deepeval.synthesizer.config import FiltrationConfig
filtration_config = FiltrationConfig(
critic_model="gpt-4.1",
synthetic_input_quality_threshold=0.5,
max_quality_retries=3,
)
synthesizer = Synthesizer(filtration_config=filtration_config)
| Parameter | Type | Default | Description |
|---|---|---|---|
critic_model | str or DeepEvalBaseLLM | Synthesizer's model, else "gpt-4.1" | The LLM used to evaluate input quality. Can differ from the generation model |
synthetic_input_quality_threshold | float | 0.5 | Minimum quality score (0-1) an input must achieve. Higher values produce better inputs but may increase generation time and cost |
max_quality_retries | int | 3 | Maximum regeneration attempts for inputs below the threshold. After exhausting retries, the highest-scoring attempt is used |
Quality scores are computed based on self-containment (the input is understandable without external references) and clarity (the input has unambiguous intent). These scores are available in the output DataFrame via the synthetic_input_quality column.
Evolution transforms simple inputs into more complex, realistic test cases. You control which evolution types to apply and their relative sampling probabilities.
from deepeval.synthesizer.config import EvolutionConfig
from deepeval.synthesizer import Evolution
evolution_config = EvolutionConfig(
evolutions={
Evolution.REASONING: 1/4,
Evolution.MULTICONTEXT: 1/4,
Evolution.CONCRETIZING: 1/4,
Evolution.CONSTRAINED: 1/4,
},
num_evolutions=4,
)
synthesizer = Synthesizer(evolution_config=evolution_config)
| Parameter | Type | Default | Description |
|---|---|---|---|
evolutions | dict[Evolution, float] | Equal distribution | Maps evolution types to their sampling probabilities. Probabilities should sum to 1.0 |
num_evolutions | int | 1 | Number of sequential evolution steps applied to each input. Higher values produce more complex inputs but increase cost and may reduce naturalness |
When num_evolutions > 1, each step samples an evolution type independently based on the configured probabilities. An input might first be concretized, then have reasoning added, producing a multi-layered complexity increase.
Evolution.REASONING: Transforms inputs to require logical reasoning, multi-step deduction, or inference chains. The evolved input can't be answered by simple extraction -- the model must reason over the context.
Evolution.MULTICONTEXT: Modifies inputs so that answering requires synthesizing information from multiple context chunks. This tests whether the model can combine disparate pieces of information. Context-adherent: the evolved input stays grounded in the provided contexts.
Evolution.CONCRETIZING: Makes abstract or general inputs more specific. For example, "What are the benefits of exercise?" might become "What cardiovascular benefits do adults over 50 gain from 30 minutes of daily walking?" Context-adherent.
Evolution.CONSTRAINED: Adds constraints to the expected response format or content. For example, "Explain quantum computing" might become "Explain quantum computing in exactly three sentences, using no technical jargon." Context-adherent.
Evolution.COMPARATIVE: Transforms inputs to require comparing multiple entities, approaches, or options. For example, "What is supervised learning?" might become "Compare supervised and unsupervised learning in terms of data requirements and typical use cases." Context-adherent.
Evolution.HYPOTHETICAL: Adds hypothetical scenarios or counterfactual reasoning requirements. For example, "What causes inflation?" might become "If a central bank doubled the money supply overnight, what inflationary effects would follow?"
Evolution.IN_BREADTH: Broadens the input to explore related topics, testing whether the model has wider knowledge beyond the specific context provided.
Styling controls the format and framing of generated inputs and expected outputs. This is essential when your application expects inputs in a specific format (e.g., SQL queries, customer support tickets, technical questions).
from deepeval.synthesizer.config import StylingConfig
styling_config = StylingConfig(
input_format="Questions in English that ask for data in a database",
expected_output_format="SQL query based on the given input",
task="Answering text-to-SQL-related queries by querying a database and returning results to users",
scenario="Non-technical users trying to query a database using plain English",
)
synthesizer = Synthesizer(styling_config=styling_config)
| Parameter | Type | Default | Description |
|---|---|---|---|
input_format | str | None | Describes the desired format for generated inputs |
expected_output_format | str | None | Describes the desired format for expected outputs |
task | str | None | Describes the purpose of the LLM application being evaluated |
scenario | str | None | Describes the setting or context in which the application is used |
For multi-turn generation from scratch, use ConversationalStylingConfig instead:
from deepeval.synthesizer.config import ConversationalStylingConfig
conversational_styling_config = ConversationalStylingConfig(
conversational_task="Answering text-to-SQL-related queries through conversation",
scenario_context="Non-technical users trying to query a database using plain English",
participant_roles="A non-technical user asking database questions and an AI assistant responding",
)
synthesizer = Synthesizer(conversational_styling_config=conversational_styling_config)
| Parameter | Type | Default | Description |
|---|---|---|---|
conversational_task | str | None | The overall purpose of the conversation |
scenario_context | str | None | Environmental details and context for the conversation |
participant_roles | str | None | Description of the interaction participants |
This is the most automated method, designed for RAG systems with existing knowledge bases. It handles document parsing, chunking, embedding, context selection, and golden generation in a single call.
pip install chromadb langchain-core langchain-community langchain-text-splitters
These dependencies handle document parsing (langchain-text-splitters for chunking, langchain-community for document loaders) and context management (chromadb for embedding storage and retrieval).
from deepeval.synthesizer import Synthesizer
synthesizer = Synthesizer()
goldens = synthesizer.generate_goldens_from_docs(
document_paths=["knowledge_base.txt", "faq.pdf", "guide.docx"],
include_expected_output=True,
max_goldens_per_context=2,
)
| Parameter | Type | Default | Description |
|---|---|---|---|
document_paths | list[str] | Required | File paths to source documents. Supported formats: .txt, .docx, .pdf, .md, .markdown, .mdx |
include_expected_output | bool | True | Generate a reference expected_output for each golden |
max_goldens_per_context | int | 2 | Maximum goldens generated per constructed context |
context_construction_config | ContextConstructionConfig | Default values | Controls how contexts are built from documents |
The total maximum number of goldens produced is max_goldens_per_context * max_contexts_per_document * number_of_documents, not simply max_goldens_per_context.
synthesizer.generate_conversational_goldens_from_docs(
document_paths=["knowledge_base.txt", "faq.pdf"],
include_expected_outcome=True,
max_goldens_per_context=2,
)
| Parameter | Type | Default | Description |
|---|---|---|---|
document_paths | list[str] | Required | File paths to source documents |
include_expected_outcome | bool | True | Generate expected_outcome for each ConversationalGolden |
max_goldens_per_context | int | 2 | Maximum goldens per context |
context_construction_config | ContextConstructionConfig | Default values | Controls context construction |
Unlike other Synthesizer configurations (which are set at initialization), context construction is configured at generation time because it is specific to document-based generation.
from deepeval.synthesizer.config import ContextConstructionConfig
synthesizer.generate_goldens_from_docs(
document_paths=["knowledge_base.txt"],
context_construction_config=ContextConstructionConfig(
embedder="text-embedding-3-small",
chunk_size=1024,
chunk_overlap=0,
max_contexts_per_document=3,
min_contexts_per_document=1,
max_context_length=3,
min_context_length=1,
context_quality_threshold=0.5,
context_similarity_threshold=0.5,
max_retries=3,
critic_model="gpt-4.1",
),
)
| Parameter | Type | Default | Description |
|---|---|---|---|
embedder | str or DeepEvalBaseEmbeddingModel | "text-embedding-3-small" | Embedding model for document parsing and context grouping |
chunk_size | int | 1024 | Token size (not character size) of text chunks during document parsing |
chunk_overlap | int | 0 | Token overlap between consecutive chunks |
max_contexts_per_document | int | 3 | Maximum number of contexts extracted from each document |
min_contexts_per_document | int | 1 | Minimum number of contexts extracted from each document |
max_context_length | int | 3 | Maximum number of text chunks grouped into a single context |
min_context_length | int | 1 | Minimum number of text chunks in a context |
context_quality_threshold | float | 0.5 | Minimum quality score (0-1) for a context to be accepted |
context_similarity_threshold | float | 0.5 | Minimum cosine similarity for context grouping |
max_retries | int | 3 | Retry attempts for context selection and grouping failures |
critic_model | str or DeepEvalBaseLLM | Synthesizer's model, else "gpt-4.1" | LLM used to evaluate context quality scores |
encoding | str | Auto-detected | Text encoding for .txt, .md, .markdown, .mdx files |
Document-based generation runs three sub-stages before the main generation pipeline:
1. Document Parsing: Documents are split into chunks using TokenTextSplitter at the token level (governed by chunk_size and chunk_overlap). Each chunk is embedded using the configured embedder and stored in a ChromaDB vector database. If chunk_size is too large relative to the document size, an error is raised because there aren't enough unique chunks to build max_contexts_per_document contexts.
2. Context Selection: Random nodes are sampled from the vector database and scored for quality (0-1) by the critic_model. Quality is assessed on four dimensions:
Nodes scoring below context_quality_threshold are re-sampled up to max_retries times. If all retries fail, the highest-scoring node is used regardless.
3. Context Grouping: Selected nodes are grouped with up to max_context_length similar nodes using cosine similarity. Nodes with similarity below context_similarity_threshold are retried up to max_retries times, falling back to the highest-similarity match. This ensures each context group is thematically coherent, producing more focused and answerable goldens.
After context construction completes, the constructed contexts are passed to the same generation pipeline used by generate_goldens_from_contexts().
Use this method when you already have prepared contexts -- for example, chunks stored in a vector database or manually curated context sets. This bypasses all document processing and context construction, feeding your contexts directly into the generation pipeline.
from deepeval.synthesizer import Synthesizer
synthesizer = Synthesizer()
goldens = synthesizer.generate_goldens_from_contexts(
contexts=[
[
"The Earth revolves around the Sun in approximately 365.25 days.",
"Planets are celestial bodies that orbit stars.",
],
[
"Water freezes at 0 degrees Celsius at standard atmospheric pressure.",
"The chemical formula for water is H2O.",
],
],
include_expected_output=True,
max_goldens_per_context=2,
source_files=["astronomy.txt", "chemistry.txt"],
)
| Parameter | Type | Default | Description |
|---|---|---|---|
contexts | list[list[str]] | Required | List of contexts, where each context is a list of related text strings. Strings within each inner list should share a common theme |
include_expected_output | bool | True | Generate a reference expected_output for each golden |
max_goldens_per_context | int | 2 | Maximum goldens generated per context |
source_files | list[str] or None | None | Optional source identifiers. If provided, length must match the contexts list length |
conversational_goldens = synthesizer.generate_conversational_goldens_from_contexts(
contexts=[
[
"The Earth revolves around the Sun in approximately 365.25 days.",
"Planets are celestial bodies that orbit stars.",
],
[
"Water freezes at 0 degrees Celsius at standard atmospheric pressure.",
"The chemical formula for water is H2O.",
],
],
include_expected_outcome=True,
max_goldens_per_context=2,
source_files=["astronomy.txt", "chemistry.txt"],
)
| Parameter | Type | Default | Description |
|---|---|---|---|
contexts | list[list[str]] | Required | List of contexts, each a list of related strings |
include_expected_outcome | bool | True | Generate expected_outcome for each ConversationalGolden |
max_goldens_per_context | int | 2 | Maximum goldens per context |
source_files | list[str] or None | None | Source identifiers, must match contexts length if provided |
The generate_goldens_from_docs() method calls generate_goldens_from_contexts() under the hood. The only difference is the additional context construction step that parses, chunks, and groups document content into the contexts format. If you have already processed your documents into context groups, using generate_goldens_from_contexts() directly is more efficient.
This method generates goldens without any documents or contexts. It is designed for applications that don't rely on RAG -- for example, chatbots, code generators, text-to-SQL systems, or creative writing assistants. Since there is no source material, the StylingConfig (or ConversationalStylingConfig for multi-turn) becomes essential to guide the generation.
from deepeval.synthesizer import Synthesizer
from deepeval.synthesizer.config import StylingConfig
styling_config = StylingConfig(
input_format="Questions in English that ask for data in a database",
expected_output_format="SQL query based on the given input",
task="Answering text-to-SQL-related queries by querying a database and returning results to users",
scenario="Non-technical users trying to query a database using plain English",
)
synthesizer = Synthesizer(styling_config=styling_config)
goldens = synthesizer.generate_goldens_from_scratch(num_goldens=25)
| Parameter | Type | Default | Description |
|---|---|---|---|
num_goldens | int | Required | The number of synthetic goldens to generate |
Without a StylingConfig, the Synthesizer has no guidance on what kind of inputs to generate, making the config effectively mandatory for this method.
from deepeval.synthesizer import Synthesizer
from deepeval.synthesizer.config import ConversationalStylingConfig
conversational_styling_config = ConversationalStylingConfig(
conversational_task="Helping users write and debug SQL queries through conversation",
scenario_context="Non-technical users interacting with a database assistant",
participant_roles="A user asking database questions and an AI SQL assistant",
)
synthesizer = Synthesizer(conversational_styling_config=conversational_styling_config)
conversational_goldens = synthesizer.generate_conversational_goldens_from_scratch(
num_goldens=25,
)
| Parameter | Type | Default | Description |
|---|---|---|---|
num_goldens | int | Required | The number of conversational goldens to generate |
This method augments an existing set of goldens by generating new variations. It is useful for expanding small evaluation datasets, increasing diversity, or creating more challenging versions of existing test cases.
from deepeval.synthesizer import Synthesizer
from deepeval.dataset import Golden
existing_goldens = [
Golden(
input="What is the capital of France?",
expected_output="Paris",
context=["Paris is the capital and most populous city of France."],
),
Golden(
input="What is photosynthesis?",
expected_output="The process by which plants convert sunlight into energy.",
context=["Photosynthesis is a process used by plants to convert light energy into chemical energy."],
),
]
synthesizer = Synthesizer()
new_goldens = synthesizer.generate_goldens_from_goldens(
goldens=existing_goldens,
max_goldens_per_golden=2,
include_expected_output=True,
)
| Parameter | Type | Default | Description |
|---|---|---|---|
goldens | list[Golden] | Required | Existing goldens to use as templates for generating new ones |
max_goldens_per_golden | int | 2 | Maximum number of new goldens generated from each existing golden |
include_expected_output | bool | True | Generate expected_output for each new golden |
new_conversational_goldens = synthesizer.generate_conversational_goldens_from_goldens(
goldens=existing_goldens,
max_goldens_per_golden=2,
include_expected_outcome=True,
)
| Parameter | Type | Default | Description |
|---|---|---|---|
goldens | list[Golden] | Required | Existing goldens as generation templates |
max_goldens_per_golden | int | 2 | Maximum new goldens per existing golden |
include_expected_outcome | bool | True | Generate expected_outcome for each ConversationalGolden |
Context requirement for expected outputs: Generated goldens will contain expected_output only if your existing goldens contain context. When context is present, the Synthesizer uses it to ground the new goldens in factual content. Without context, the method falls back to from-scratch techniques based on the input patterns alone.
Single/multi-turn symmetry: You can only generate single-turn goldens from existing single-turn goldens, and conversational goldens from existing conversational goldens. You cannot mix the two.
StylingConfig recommendation: While the method can extract styling patterns from existing goldens, explicitly providing a StylingConfig produces more accurate and consistent results.
from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset(goldens=synthesizer.synthetic_goldens)
dataset.push(alias="My Generated Dataset")
This uploads the dataset to Confident AI's platform for versioning, collaboration, and integration with DeepEval's evaluation pipeline.
synthesizer.save_as(
file_type="json", # "json" or "csv"
directory="./synthetic_data",
file_name="my_dataset", # optional, without extension
)
| Parameter | Type | Default | Description |
|---|---|---|---|
file_type | str | Required | Output format: "json" or "csv" |
directory | str | Required | Folder path for the saved file |
file_name | str | None | Custom filename without extension. Auto-generated if omitted |
quiet | bool | False | Suppress output messages when True |
df = synthesizer.to_pandas()
print(df.columns.tolist())
The DataFrame includes these columns:
| Column | Description |
|---|---|
input | The generated question or prompt |
actual_output | Always None (populated by your application) |
expected_output | Reference answer from the Synthesizer |
context | Source knowledge base text |
retrieval_context | Retrieved passages (populated during evaluation) |
n_chunks_per_context | Number of text chunks in the context |
context_length | Character length of the context |
context_quality | Context quality score (0-1), from context construction |
synthetic_input_quality | Input quality score (0-1), from filtration |
evolutions | Sequence of evolution types applied |
source_file | Original document source |
Here is a complete workflow that generates goldens from documents, configures all pipeline stages, and exports the results:
from deepeval.synthesizer import Synthesizer, Evolution
from deepeval.synthesizer.config import (
FiltrationConfig,
EvolutionConfig,
StylingConfig,
ContextConstructionConfig,
)
from deepeval.dataset import EvaluationDataset
# Configure pipeline stages
filtration_config = FiltrationConfig(
critic_model="gpt-4.1",
synthetic_input_quality_threshold=0.7,
max_quality_retries=5,
)
evolution_config = EvolutionConfig(
evolutions={
Evolution.REASONING: 0.3,
Evolution.MULTICONTEXT: 0.3,
Evolution.CONCRETIZING: 0.2,
Evolution.COMPARATIVE: 0.2,
},
num_evolutions=2,
)
styling_config = StylingConfig(
input_format="Technical questions about machine learning concepts",
expected_output_format="Detailed explanations with examples",
task="Answering ML-related questions from a knowledge base",
scenario="ML engineers looking up concepts in internal documentation",
)
# Initialize Synthesizer
synthesizer = Synthesizer(
model="gpt-4.1",
async_mode=True,
max_concurrent=50,
filtration_config=filtration_config,
evolution_config=evolution_config,
styling_config=styling_config,
cost_tracking=True,
)
# Configure context construction
context_config = ContextConstructionConfig(
chunk_size=512,
chunk_overlap=50,
max_contexts_per_document=5,
max_context_length=3,
context_quality_threshold=0.6,
context_similarity_threshold=0.6,
)
# Generate goldens
goldens = synthesizer.generate_goldens_from_docs(
document_paths=["docs/architecture.md", "docs/api-reference.md", "docs/tutorials.pdf"],
include_expected_output=True,
max_goldens_per_context=3,
context_construction_config=context_config,
)
# Inspect results
df = synthesizer.to_pandas()
print(f"Generated {len(goldens)} goldens")
print(f"Average input quality: {df['synthetic_input_quality'].mean():.3f}")
print(f"Average context quality: {df['context_quality'].mean():.3f}")
# Save locally and to Confident AI
synthesizer.save_as(file_type="json", directory="./eval_data", file_name="ml_docs_goldens")
dataset = EvaluationDataset(goldens=goldens)
dataset.push(alias="ML Docs Evaluation Set v1")
| Scenario | Method | Why |
|---|---|---|
| RAG system with document corpus | generate_goldens_from_docs | Automated end-to-end pipeline from raw documents to goldens |
| Pre-chunked data in vector DB | generate_goldens_from_contexts | Skip document processing, use existing embeddings directly |
| Non-RAG application (chatbot, code gen) | generate_goldens_from_scratch | No source material needed, guided by task description |
| Small existing eval dataset | generate_goldens_from_goldens | Augment and diversify existing test cases |
| Mixed: some docs + some manual cases | Combine methods | Use from_docs for document coverage, from_goldens to augment edge cases |
Manual inspection is essential: Synthetic data generation is not a fire-and-forget process. Always review a sample of generated goldens before using them for evaluation. Common issues include:
Cost management: Each golden requires multiple LLM calls (generation, filtration scoring, evolution, styling). With cost_tracking=True, monitor spend. Reduce costs by lowering max_quality_retries, using fewer evolution steps, or using a cheaper model for the critic_model while keeping a stronger model for generation.
OpenAI API key: The default embedder (text-embedding-3-small) and model (gpt-4.1) require an OPENAI_API_KEY. For non-OpenAI setups, provide a custom DeepEvalBaseLLM for the model and a custom DeepEvalBaseEmbeddingModel for the embedder.
Scaling: For large document corpora, increase max_concurrent and enable async_mode. For very large datasets (thousands of goldens), consider generating in batches and merging the results to avoid timeout issues.
Reproducibility: The Synthesizer uses LLM-based generation, which is inherently non-deterministic. Running the same configuration twice will produce different goldens. For reproducible evaluation datasets, generate once, inspect, curate, and version the results rather than regenerating each time.