DeepEval Synthesizer: Synthetic Golden Generation for LLM Evaluation

Building robust evaluation datasets is one of the most time-consuming bottlenecks in LLM application development. DeepEval's Synthesizer addresses this by providing a systematic pipeline for generating high-quality synthetic test data -- called "Goldens" -- that can be used to evaluate LLM systems without manually authoring hundreds of test cases. The Synthesizer supports four distinct generation strategies, each suited to different data availability scenarios: generating from documents, from pre-built contexts, from scratch without any knowledge base, and from existing goldens. Each strategy produces both single-turn and multi-turn (conversational) goldens, and the entire pipeline is governed by configurable filtration, evolution, and styling stages that control the quality, complexity, and format of the generated data.

What Are Goldens?

A Golden is a single test case in DeepEval's evaluation framework. Each Golden represents a structured data point with fields that correspond to inputs, expected outputs, and supporting context. The Synthesizer generates these programmatically so you can build evaluation datasets at scale.

For single-turn goldens, a Golden object contains:

input: The generated question or prompt
actual_output: Always None after generation -- this field is populated later when you run the input through your LLM application
expected_output: A reference answer generated by the Synthesizer (when include_expected_output=True)
context: The knowledge base source text used to ground the golden
retrieval_context: Retrieved passages (populated during evaluation)
source_file: The original document the context was extracted from

For multi-turn goldens, a ConversationalGolden object follows the same structure but represents a full conversation with multiple turns, and uses expected_outcome instead of expected_output.

A critical distinction: the Synthesizer does not generate actual_output values. These come from running your LLM application against the generated inputs. Similarly, multi-turn generation does not create individual turns -- use DeepEval's ConversationSimulator for that purpose.

The Generation Pipeline

Regardless of which generation method you use, the Synthesizer follows a four-stage pipeline:

Stage 1: Input Generation

The Synthesizer creates synthetic input values -- questions, prompts, or queries -- based on the source material you provide. The mechanism varies by method:

From docs: Documents are parsed, chunked, embedded, and stored in a vector database. Contexts are selected and grouped, then used to generate inputs.
From contexts: Pre-prepared context lists are used directly, skipping document processing.
From scratch: Inputs are generated purely from a task description and scenario, with no source documents or contexts.
From goldens: Existing goldens serve as templates for generating new, varied inputs.

Stage 2: Filtration

Generated inputs are scored on quality (0 to 1) based on two criteria:

Self-containment: Can the input be understood without external references?
Clarity: Does the input have clear intent without ambiguity?

Inputs scoring below the synthetic_input_quality_threshold (default 0.5) are regenerated up to max_quality_retries (default 3) times. If all retries fail, the highest-scoring attempt is used. This is controlled via FiltrationConfig.

Stage 3: Evolution

Inputs are evolved through complexity-increasing transformations inspired by the Evol-Instruct methodology (WizardLM). Each input can undergo one or more evolution steps, producing more challenging and realistic test cases. Seven evolution types are available:

Evolution Type	Description	Context-Adherent
`Evolution.REASONING`	Adds logical reasoning requirements	No
`Evolution.MULTICONTEXT`	Requires synthesizing information from multiple context sources	Yes
`Evolution.CONCRETIZING`	Makes inputs more specific and concrete	Yes
`Evolution.CONSTRAINED`	Adds constraints to the expected response	Yes
`Evolution.COMPARATIVE`	Requires comparing multiple options or entities	Yes
`Evolution.HYPOTHETICAL`	Introduces hypothetical scenarios	No
`Evolution.IN_BREADTH`	Explores broader related topics	No

Context-adherent evolutions use the provided context to guide the transformation, ensuring the evolved input remains answerable from the source material.

Stage 4: Styling

The final stage formats inputs and expected outputs according to your specifications. This is controlled via StylingConfig and allows you to ensure generated data matches the format your application expects.

Synthesizer Initialization

The Synthesizer constructor accepts seven optional parameters that control global behavior across all generation methods:

python

from deepeval.synthesizer import Synthesizer
from deepeval.synthesizer.config import (
    FiltrationConfig,
    EvolutionConfig,
    StylingConfig,
)

synthesizer = Synthesizer(
    model="gpt-4.1",                    # LLM for generation
    async_mode=True,                     # concurrent generation
    max_concurrent=100,                  # max parallel operations
    filtration_config=FiltrationConfig(), # quality filtering
    evolution_config=EvolutionConfig(),   # complexity evolution
    styling_config=StylingConfig(),       # output formatting
    cost_tracking=True,                  # print LLM costs
)

Parameter	Type	Default	Description
`model`	`str` or `DeepEvalBaseLLM`	`"gpt-4.1"`	The LLM used for all generation steps. Can be an OpenAI model string or a custom model implementing `DeepEvalBaseLLM`
`async_mode`	`bool`	`True`	Enables concurrent golden generation for faster throughput
`max_concurrent`	`int`	`100`	Maximum number of parallel generation operations when `async_mode=True`
`filtration_config`	`FiltrationConfig`	Default values	Controls quality filtering of generated inputs
`evolution_config`	`EvolutionConfig`	Default values	Controls complexity evolution of inputs
`styling_config`	`StylingConfig`	Default values	Controls output formatting and style
`cost_tracking`	`bool`	`False`	When enabled, prints LLM API costs during generation

FiltrationConfig

Filtration ensures generated inputs meet a minimum quality bar. Inputs that score below the threshold are regenerated.

python

from deepeval.synthesizer.config import FiltrationConfig

filtration_config = FiltrationConfig(
    critic_model="gpt-4.1",
    synthetic_input_quality_threshold=0.5,
    max_quality_retries=3,
)
synthesizer = Synthesizer(filtration_config=filtration_config)

Parameter	Type	Default	Description
`critic_model`	`str` or `DeepEvalBaseLLM`	Synthesizer's model, else `"gpt-4.1"`	The LLM used to evaluate input quality. Can differ from the generation model
`synthetic_input_quality_threshold`	`float`	`0.5`	Minimum quality score (0-1) an input must achieve. Higher values produce better inputs but may increase generation time and cost
`max_quality_retries`	`int`	`3`	Maximum regeneration attempts for inputs below the threshold. After exhausting retries, the highest-scoring attempt is used

Quality scores are computed based on self-containment (the input is understandable without external references) and clarity (the input has unambiguous intent). These scores are available in the output DataFrame via the synthetic_input_quality column.

EvolutionConfig

Evolution transforms simple inputs into more complex, realistic test cases. You control which evolution types to apply and their relative sampling probabilities.

python

from deepeval.synthesizer.config import EvolutionConfig
from deepeval.synthesizer import Evolution

evolution_config = EvolutionConfig(
    evolutions={
        Evolution.REASONING: 1/4,
        Evolution.MULTICONTEXT: 1/4,
        Evolution.CONCRETIZING: 1/4,
        Evolution.CONSTRAINED: 1/4,
    },
    num_evolutions=4,
)
synthesizer = Synthesizer(evolution_config=evolution_config)

Parameter	Type	Default	Description
`evolutions`	`dict[Evolution, float]`	Equal distribution	Maps evolution types to their sampling probabilities. Probabilities should sum to 1.0
`num_evolutions`	`int`	`1`	Number of sequential evolution steps applied to each input. Higher values produce more complex inputs but increase cost and may reduce naturalness

When num_evolutions > 1, each step samples an evolution type independently based on the configured probabilities. An input might first be concretized, then have reasoning added, producing a multi-layered complexity increase.

Evolution Types in Detail

Evolution.REASONING: Transforms inputs to require logical reasoning, multi-step deduction, or inference chains. The evolved input can't be answered by simple extraction -- the model must reason over the context.

Evolution.MULTICONTEXT: Modifies inputs so that answering requires synthesizing information from multiple context chunks. This tests whether the model can combine disparate pieces of information. Context-adherent: the evolved input stays grounded in the provided contexts.

Evolution.CONCRETIZING: Makes abstract or general inputs more specific. For example, "What are the benefits of exercise?" might become "What cardiovascular benefits do adults over 50 gain from 30 minutes of daily walking?" Context-adherent.

Evolution.CONSTRAINED: Adds constraints to the expected response format or content. For example, "Explain quantum computing" might become "Explain quantum computing in exactly three sentences, using no technical jargon." Context-adherent.

Evolution.COMPARATIVE: Transforms inputs to require comparing multiple entities, approaches, or options. For example, "What is supervised learning?" might become "Compare supervised and unsupervised learning in terms of data requirements and typical use cases." Context-adherent.

Evolution.HYPOTHETICAL: Adds hypothetical scenarios or counterfactual reasoning requirements. For example, "What causes inflation?" might become "If a central bank doubled the money supply overnight, what inflationary effects would follow?"

Evolution.IN_BREADTH: Broadens the input to explore related topics, testing whether the model has wider knowledge beyond the specific context provided.

StylingConfig

Styling controls the format and framing of generated inputs and expected outputs. This is essential when your application expects inputs in a specific format (e.g., SQL queries, customer support tickets, technical questions).

python

from deepeval.synthesizer.config import StylingConfig

styling_config = StylingConfig(
    input_format="Questions in English that ask for data in a database",
    expected_output_format="SQL query based on the given input",
    task="Answering text-to-SQL-related queries by querying a database and returning results to users",
    scenario="Non-technical users trying to query a database using plain English",
)
synthesizer = Synthesizer(styling_config=styling_config)

Parameter	Type	Default	Description
`input_format`	`str`	`None`	Describes the desired format for generated inputs
`expected_output_format`	`str`	`None`	Describes the desired format for expected outputs
`task`	`str`	`None`	Describes the purpose of the LLM application being evaluated
`scenario`	`str`	`None`	Describes the setting or context in which the application is used

For multi-turn generation from scratch, use ConversationalStylingConfig instead:

python

from deepeval.synthesizer.config import ConversationalStylingConfig

conversational_styling_config = ConversationalStylingConfig(
    conversational_task="Answering text-to-SQL-related queries through conversation",
    scenario_context="Non-technical users trying to query a database using plain English",
    participant_roles="A non-technical user asking database questions and an AI assistant responding",
)
synthesizer = Synthesizer(conversational_styling_config=conversational_styling_config)

Parameter	Type	Default	Description
`conversational_task`	`str`	`None`	The overall purpose of the conversation
`scenario_context`	`str`	`None`	Environmental details and context for the conversation
`participant_roles`	`str`	`None`	Description of the interaction participants

Method 1: Generate Goldens From Documents

This is the most automated method, designed for RAG systems with existing knowledge bases. It handles document parsing, chunking, embedding, context selection, and golden generation in a single call.

Prerequisites

bash

pip install chromadb langchain-core langchain-community langchain-text-splitters

These dependencies handle document parsing (langchain-text-splitters for chunking, langchain-community for document loaders) and context management (chromadb for embedding storage and retrieval).

Single-Turn Generation

python

from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer()
goldens = synthesizer.generate_goldens_from_docs(
    document_paths=["knowledge_base.txt", "faq.pdf", "guide.docx"],
    include_expected_output=True,
    max_goldens_per_context=2,
)

Parameter	Type	Default	Description
`document_paths`	`list[str]`	Required	File paths to source documents. Supported formats: `.txt`, `.docx`, `.pdf`, `.md`, `.markdown`, `.mdx`
`include_expected_output`	`bool`	`True`	Generate a reference `expected_output` for each golden
`max_goldens_per_context`	`int`	`2`	Maximum goldens generated per constructed context
`context_construction_config`	`ContextConstructionConfig`	Default values	Controls how contexts are built from documents

The total maximum number of goldens produced is max_goldens_per_context * max_contexts_per_document * number_of_documents, not simply max_goldens_per_context.

Multi-Turn Generation

python

synthesizer.generate_conversational_goldens_from_docs(
    document_paths=["knowledge_base.txt", "faq.pdf"],
    include_expected_outcome=True,
    max_goldens_per_context=2,
)

Parameter	Type	Default	Description
`document_paths`	`list[str]`	Required	File paths to source documents
`include_expected_outcome`	`bool`	`True`	Generate `expected_outcome` for each `ConversationalGolden`
`max_goldens_per_context`	`int`	`2`	Maximum goldens per context
`context_construction_config`	`ContextConstructionConfig`	Default values	Controls context construction

ContextConstructionConfig

Unlike other Synthesizer configurations (which are set at initialization), context construction is configured at generation time because it is specific to document-based generation.

python

from deepeval.synthesizer.config import ContextConstructionConfig

synthesizer.generate_goldens_from_docs(
    document_paths=["knowledge_base.txt"],
    context_construction_config=ContextConstructionConfig(
        embedder="text-embedding-3-small",
        chunk_size=1024,
        chunk_overlap=0,
        max_contexts_per_document=3,
        min_contexts_per_document=1,
        max_context_length=3,
        min_context_length=1,
        context_quality_threshold=0.5,
        context_similarity_threshold=0.5,
        max_retries=3,
        critic_model="gpt-4.1",
    ),
)

Parameter	Type	Default	Description
`embedder`	`str` or `DeepEvalBaseEmbeddingModel`	`"text-embedding-3-small"`	Embedding model for document parsing and context grouping
`chunk_size`	`int`	`1024`	Token size (not character size) of text chunks during document parsing
`chunk_overlap`	`int`	`0`	Token overlap between consecutive chunks
`max_contexts_per_document`	`int`	`3`	Maximum number of contexts extracted from each document
`min_contexts_per_document`	`int`	`1`	Minimum number of contexts extracted from each document
`max_context_length`	`int`	`3`	Maximum number of text chunks grouped into a single context
`min_context_length`	`int`	`1`	Minimum number of text chunks in a context
`context_quality_threshold`	`float`	`0.5`	Minimum quality score (0-1) for a context to be accepted
`context_similarity_threshold`	`float`	`0.5`	Minimum cosine similarity for context grouping
`max_retries`	`int`	`3`	Retry attempts for context selection and grouping failures
`critic_model`	`str` or `DeepEvalBaseLLM`	Synthesizer's model, else `"gpt-4.1"`	LLM used to evaluate context quality scores
`encoding`	`str`	Auto-detected	Text encoding for `.txt`, `.md`, `.markdown`, `.mdx` files

The Context Construction Pipeline

Document-based generation runs three sub-stages before the main generation pipeline:

1. Document Parsing: Documents are split into chunks using TokenTextSplitter at the token level (governed by chunk_size and chunk_overlap). Each chunk is embedded using the configured embedder and stored in a ChromaDB vector database. If chunk_size is too large relative to the document size, an error is raised because there aren't enough unique chunks to build max_contexts_per_document contexts.

2. Context Selection: Random nodes are sampled from the vector database and scored for quality (0-1) by the critic_model. Quality is assessed on four dimensions:

Clarity: Is the information comprehensible?
Depth: Does it contain sufficient detail and insight?
Structure: Is it well-organized and logically coherent?
Relevance: Is it topically focused?

Nodes scoring below context_quality_threshold are re-sampled up to max_retries times. If all retries fail, the highest-scoring node is used regardless.

3. Context Grouping: Selected nodes are grouped with up to max_context_length similar nodes using cosine similarity. Nodes with similarity below context_similarity_threshold are retried up to max_retries times, falling back to the highest-similarity match. This ensures each context group is thematically coherent, producing more focused and answerable goldens.

After context construction completes, the constructed contexts are passed to the same generation pipeline used by generate_goldens_from_contexts().

Method 2: Generate Goldens From Contexts

Use this method when you already have prepared contexts -- for example, chunks stored in a vector database or manually curated context sets. This bypasses all document processing and context construction, feeding your contexts directly into the generation pipeline.

Single-Turn Generation

python

from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer()
goldens = synthesizer.generate_goldens_from_contexts(
    contexts=[
        [
            "The Earth revolves around the Sun in approximately 365.25 days.",
            "Planets are celestial bodies that orbit stars.",
        ],
        [
            "Water freezes at 0 degrees Celsius at standard atmospheric pressure.",
            "The chemical formula for water is H2O.",
        ],
    ],
    include_expected_output=True,
    max_goldens_per_context=2,
    source_files=["astronomy.txt", "chemistry.txt"],
)

Parameter	Type	Default	Description
`contexts`	`list[list[str]]`	Required	List of contexts, where each context is a list of related text strings. Strings within each inner list should share a common theme
`include_expected_output`	`bool`	`True`	Generate a reference `expected_output` for each golden
`max_goldens_per_context`	`int`	`2`	Maximum goldens generated per context
`source_files`	`list[str]` or `None`	`None`	Optional source identifiers. If provided, length must match the `contexts` list length

Multi-Turn Generation

python

conversational_goldens = synthesizer.generate_conversational_goldens_from_contexts(
    contexts=[
        [
            "The Earth revolves around the Sun in approximately 365.25 days.",
            "Planets are celestial bodies that orbit stars.",
        ],
        [
            "Water freezes at 0 degrees Celsius at standard atmospheric pressure.",
            "The chemical formula for water is H2O.",
        ],
    ],
    include_expected_outcome=True,
    max_goldens_per_context=2,
    source_files=["astronomy.txt", "chemistry.txt"],
)

Parameter	Type	Default	Description
`contexts`	`list[list[str]]`	Required	List of contexts, each a list of related strings
`include_expected_outcome`	`bool`	`True`	Generate `expected_outcome` for each `ConversationalGolden`
`max_goldens_per_context`	`int`	`2`	Maximum goldens per context
`source_files`	`list[str]` or `None`	`None`	Source identifiers, must match `contexts` length if provided

Relationship to Document-Based Generation

The generate_goldens_from_docs() method calls generate_goldens_from_contexts() under the hood. The only difference is the additional context construction step that parses, chunks, and groups document content into the contexts format. If you have already processed your documents into context groups, using generate_goldens_from_contexts() directly is more efficient.

Method 3: Generate Goldens From Scratch

This method generates goldens without any documents or contexts. It is designed for applications that don't rely on RAG -- for example, chatbots, code generators, text-to-SQL systems, or creative writing assistants. Since there is no source material, the StylingConfig (or ConversationalStylingConfig for multi-turn) becomes essential to guide the generation.

Single-Turn Generation

python

from deepeval.synthesizer import Synthesizer
from deepeval.synthesizer.config import StylingConfig

styling_config = StylingConfig(
    input_format="Questions in English that ask for data in a database",
    expected_output_format="SQL query based on the given input",
    task="Answering text-to-SQL-related queries by querying a database and returning results to users",
    scenario="Non-technical users trying to query a database using plain English",
)

synthesizer = Synthesizer(styling_config=styling_config)
goldens = synthesizer.generate_goldens_from_scratch(num_goldens=25)

Parameter	Type	Default	Description
`num_goldens`	`int`	Required	The number of synthetic goldens to generate

Without a StylingConfig, the Synthesizer has no guidance on what kind of inputs to generate, making the config effectively mandatory for this method.

Multi-Turn Generation

python

from deepeval.synthesizer import Synthesizer
from deepeval.synthesizer.config import ConversationalStylingConfig

conversational_styling_config = ConversationalStylingConfig(
    conversational_task="Helping users write and debug SQL queries through conversation",
    scenario_context="Non-technical users interacting with a database assistant",
    participant_roles="A user asking database questions and an AI SQL assistant",
)

synthesizer = Synthesizer(conversational_styling_config=conversational_styling_config)
conversational_goldens = synthesizer.generate_conversational_goldens_from_scratch(
    num_goldens=25,
)

Parameter	Type	Default	Description
`num_goldens`	`int`	Required	The number of conversational goldens to generate

When to Use From-Scratch Generation

No existing knowledge base: Your application generates responses from parametric knowledge or external APIs, not from documents
Task-specific evaluation: You need test cases that match a specific input/output format (SQL queries, code, structured data)
Bootstrapping: You are starting evaluation from zero and need an initial dataset to iterate on
Coverage testing: You want to test edge cases and scenarios that your existing documents don't cover

Method 4: Generate Goldens From Goldens

This method augments an existing set of goldens by generating new variations. It is useful for expanding small evaluation datasets, increasing diversity, or creating more challenging versions of existing test cases.

Single-Turn Generation

python

from deepeval.synthesizer import Synthesizer
from deepeval.dataset import Golden

existing_goldens = [
    Golden(
        input="What is the capital of France?",
        expected_output="Paris",
        context=["Paris is the capital and most populous city of France."],
    ),
    Golden(
        input="What is photosynthesis?",
        expected_output="The process by which plants convert sunlight into energy.",
        context=["Photosynthesis is a process used by plants to convert light energy into chemical energy."],
    ),
]

synthesizer = Synthesizer()
new_goldens = synthesizer.generate_goldens_from_goldens(
    goldens=existing_goldens,
    max_goldens_per_golden=2,
    include_expected_output=True,
)

Parameter	Type	Default	Description
`goldens`	`list[Golden]`	Required	Existing goldens to use as templates for generating new ones
`max_goldens_per_golden`	`int`	`2`	Maximum number of new goldens generated from each existing golden
`include_expected_output`	`bool`	`True`	Generate `expected_output` for each new golden

Multi-Turn Generation

python

new_conversational_goldens = synthesizer.generate_conversational_goldens_from_goldens(
    goldens=existing_goldens,
    max_goldens_per_golden=2,
    include_expected_outcome=True,
)

Parameter	Type	Default	Description
`goldens`	`list[Golden]`	Required	Existing goldens as generation templates
`max_goldens_per_golden`	`int`	`2`	Maximum new goldens per existing golden
`include_expected_outcome`	`bool`	`True`	Generate `expected_outcome` for each `ConversationalGolden`

Critical Constraints

Context requirement for expected outputs: Generated goldens will contain expected_output only if your existing goldens contain context. When context is present, the Synthesizer uses it to ground the new goldens in factual content. Without context, the method falls back to from-scratch techniques based on the input patterns alone.

Single/multi-turn symmetry: You can only generate single-turn goldens from existing single-turn goldens, and conversational goldens from existing conversational goldens. You cannot mix the two.

StylingConfig recommendation: While the method can extract styling patterns from existing goldens, explicitly providing a StylingConfig produces more accurate and consistent results.

Saving and Exporting Generated Data

Push to Confident AI

python

from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset(goldens=synthesizer.synthetic_goldens)
dataset.push(alias="My Generated Dataset")

This uploads the dataset to Confident AI's platform for versioning, collaboration, and integration with DeepEval's evaluation pipeline.

Save Locally

python

synthesizer.save_as(
    file_type="json",              # "json" or "csv"
    directory="./synthetic_data",
    file_name="my_dataset",        # optional, without extension
)

Parameter	Type	Default	Description
`file_type`	`str`	Required	Output format: `"json"` or `"csv"`
`directory`	`str`	Required	Folder path for the saved file
`file_name`	`str`	`None`	Custom filename without extension. Auto-generated if omitted
`quiet`	`bool`	`False`	Suppress output messages when `True`

Inspect as DataFrame

python

df = synthesizer.to_pandas()
print(df.columns.tolist())

The DataFrame includes these columns:

Column	Description
`input`	The generated question or prompt
`actual_output`	Always `None` (populated by your application)
`expected_output`	Reference answer from the Synthesizer
`context`	Source knowledge base text
`retrieval_context`	Retrieved passages (populated during evaluation)
`n_chunks_per_context`	Number of text chunks in the context
`context_length`	Character length of the context
`context_quality`	Context quality score (0-1), from context construction
`synthetic_input_quality`	Input quality score (0-1), from filtration
`evolutions`	Sequence of evolution types applied
`source_file`	Original document source

End-to-End Example

Here is a complete workflow that generates goldens from documents, configures all pipeline stages, and exports the results:

python

from deepeval.synthesizer import Synthesizer, Evolution
from deepeval.synthesizer.config import (
    FiltrationConfig,
    EvolutionConfig,
    StylingConfig,
    ContextConstructionConfig,
)
from deepeval.dataset import EvaluationDataset

# Configure pipeline stages
filtration_config = FiltrationConfig(
    critic_model="gpt-4.1",
    synthetic_input_quality_threshold=0.7,
    max_quality_retries=5,
)

evolution_config = EvolutionConfig(
    evolutions={
        Evolution.REASONING: 0.3,
        Evolution.MULTICONTEXT: 0.3,
        Evolution.CONCRETIZING: 0.2,
        Evolution.COMPARATIVE: 0.2,
    },
    num_evolutions=2,
)

styling_config = StylingConfig(
    input_format="Technical questions about machine learning concepts",
    expected_output_format="Detailed explanations with examples",
    task="Answering ML-related questions from a knowledge base",
    scenario="ML engineers looking up concepts in internal documentation",
)

# Initialize Synthesizer
synthesizer = Synthesizer(
    model="gpt-4.1",
    async_mode=True,
    max_concurrent=50,
    filtration_config=filtration_config,
    evolution_config=evolution_config,
    styling_config=styling_config,
    cost_tracking=True,
)

# Configure context construction
context_config = ContextConstructionConfig(
    chunk_size=512,
    chunk_overlap=50,
    max_contexts_per_document=5,
    max_context_length=3,
    context_quality_threshold=0.6,
    context_similarity_threshold=0.6,
)

# Generate goldens
goldens = synthesizer.generate_goldens_from_docs(
    document_paths=["docs/architecture.md", "docs/api-reference.md", "docs/tutorials.pdf"],
    include_expected_output=True,
    max_goldens_per_context=3,
    context_construction_config=context_config,
)

# Inspect results
df = synthesizer.to_pandas()
print(f"Generated {len(goldens)} goldens")
print(f"Average input quality: {df['synthetic_input_quality'].mean():.3f}")
print(f"Average context quality: {df['context_quality'].mean():.3f}")

# Save locally and to Confident AI
synthesizer.save_as(file_type="json", directory="./eval_data", file_name="ml_docs_goldens")
dataset = EvaluationDataset(goldens=goldens)
dataset.push(alias="ML Docs Evaluation Set v1")

Choosing the Right Generation Method

Scenario	Method	Why
RAG system with document corpus	`generate_goldens_from_docs`	Automated end-to-end pipeline from raw documents to goldens
Pre-chunked data in vector DB	`generate_goldens_from_contexts`	Skip document processing, use existing embeddings directly
Non-RAG application (chatbot, code gen)	`generate_goldens_from_scratch`	No source material needed, guided by task description
Small existing eval dataset	`generate_goldens_from_goldens`	Augment and diversify existing test cases
Mixed: some docs + some manual cases	Combine methods	Use `from_docs` for document coverage, `from_goldens` to augment edge cases

Practical Considerations

Manual inspection is essential: Synthetic data generation is not a fire-and-forget process. Always review a sample of generated goldens before using them for evaluation. Common issues include:

Inputs that are too vague or too specific to be realistic
Expected outputs that contradict the source context
Evolution producing unnaturally complex or convoluted inputs
Context grouping that combines unrelated topics

Cost management: Each golden requires multiple LLM calls (generation, filtration scoring, evolution, styling). With cost_tracking=True, monitor spend. Reduce costs by lowering max_quality_retries, using fewer evolution steps, or using a cheaper model for the critic_model while keeping a stronger model for generation.

OpenAI API key: The default embedder (text-embedding-3-small) and model (gpt-4.1) require an OPENAI_API_KEY. For non-OpenAI setups, provide a custom DeepEvalBaseLLM for the model and a custom DeepEvalBaseEmbeddingModel for the embedder.

Scaling: For large document corpora, increase max_concurrent and enable async_mode. For very large datasets (thousands of goldens), consider generating in batches and merging the results to avoid timeout issues.

Reproducibility: The Synthesizer uses LLM-based generation, which is inherently non-deterministic. Running the same configuration twice will produce different goldens. For reproducible evaluation datasets, generate once, inspect, curate, and version the results rather than regenerating each time.