← all lessons/🛡 Phase 5 · Evals, Safety & Observability/#50
Lesson 6 of 18 in Phase 5 · Evals, Safety & Observability

DeepEval Synthesizer: Synthetic Golden Generation for LLM Evaluation

🛡 Phase 5 · Evals, Safety & ObservabilityIntermediate~20 min read
Recommended prerequisite:#49 Eval Frameworks Comparison: DeepEval, Promptfoo, RAGAS, Braintrust, LangSmith & More

Building robust evaluation datasets is one of the most time-consuming bottlenecks in LLM application development. DeepEval's Synthesizer addresses this by providing a systematic pipeline for generating high-quality synthetic test data -- called "Goldens" -- that can be used to evaluate LLM systems without manually authoring hundreds of test cases. The Synthesizer supports four distinct generation strategies, each suited to different data availability scenarios: generating from documents, from pre-built contexts, from scratch without any knowledge base, and from existing goldens. Each strategy produces both single-turn and multi-turn (conversational) goldens, and the entire pipeline is governed by configurable filtration, evolution, and styling stages that control the quality, complexity, and format of the generated data.

What Are Goldens?

A Golden is a single test case in DeepEval's evaluation framework. Each Golden represents a structured data point with fields that correspond to inputs, expected outputs, and supporting context. The Synthesizer generates these programmatically so you can build evaluation datasets at scale.

For single-turn goldens, a Golden object contains:

  • input: The generated question or prompt
  • actual_output: Always None after generation -- this field is populated later when you run the input through your LLM application
  • expected_output: A reference answer generated by the Synthesizer (when include_expected_output=True)
  • context: The knowledge base source text used to ground the golden
  • retrieval_context: Retrieved passages (populated during evaluation)
  • source_file: The original document the context was extracted from

For multi-turn goldens, a ConversationalGolden object follows the same structure but represents a full conversation with multiple turns, and uses expected_outcome instead of expected_output.

A critical distinction: the Synthesizer does not generate actual_output values. These come from running your LLM application against the generated inputs. Similarly, multi-turn generation does not create individual turns -- use DeepEval's ConversationSimulator for that purpose.

The Generation Pipeline

Regardless of which generation method you use, the Synthesizer follows a four-stage pipeline:

Stage 1: Input Generation

The Synthesizer creates synthetic input values -- questions, prompts, or queries -- based on the source material you provide. The mechanism varies by method:

  • From docs: Documents are parsed, chunked, embedded, and stored in a vector database. Contexts are selected and grouped, then used to generate inputs.
  • From contexts: Pre-prepared context lists are used directly, skipping document processing.
  • From scratch: Inputs are generated purely from a task description and scenario, with no source documents or contexts.
  • From goldens: Existing goldens serve as templates for generating new, varied inputs.

Stage 2: Filtration

Generated inputs are scored on quality (0 to 1) based on two criteria:

  • Self-containment: Can the input be understood without external references?
  • Clarity: Does the input have clear intent without ambiguity?

Inputs scoring below the synthetic_input_quality_threshold (default 0.5) are regenerated up to max_quality_retries (default 3) times. If all retries fail, the highest-scoring attempt is used. This is controlled via FiltrationConfig.

Stage 3: Evolution

Inputs are evolved through complexity-increasing transformations inspired by the Evol-Instruct methodology (WizardLM). Each input can undergo one or more evolution steps, producing more challenging and realistic test cases. Seven evolution types are available:

Evolution TypeDescriptionContext-Adherent
Evolution.REASONINGAdds logical reasoning requirementsNo
Evolution.MULTICONTEXTRequires synthesizing information from multiple context sourcesYes
Evolution.CONCRETIZINGMakes inputs more specific and concreteYes
Evolution.CONSTRAINEDAdds constraints to the expected responseYes
Evolution.COMPARATIVERequires comparing multiple options or entitiesYes
Evolution.HYPOTHETICALIntroduces hypothetical scenariosNo
Evolution.IN_BREADTHExplores broader related topicsNo

Context-adherent evolutions use the provided context to guide the transformation, ensuring the evolved input remains answerable from the source material.

Stage 4: Styling

The final stage formats inputs and expected outputs according to your specifications. This is controlled via StylingConfig and allows you to ensure generated data matches the format your application expects.

Synthesizer Initialization

The Synthesizer constructor accepts seven optional parameters that control global behavior across all generation methods:

python
from deepeval.synthesizer import Synthesizer
from deepeval.synthesizer.config import (
    FiltrationConfig,
    EvolutionConfig,
    StylingConfig,
)

synthesizer = Synthesizer(
    model="gpt-4.1",                    # LLM for generation
    async_mode=True,                     # concurrent generation
    max_concurrent=100,                  # max parallel operations
    filtration_config=FiltrationConfig(), # quality filtering
    evolution_config=EvolutionConfig(),   # complexity evolution
    styling_config=StylingConfig(),       # output formatting
    cost_tracking=True,                  # print LLM costs
)
ParameterTypeDefaultDescription
modelstr or DeepEvalBaseLLM"gpt-4.1"The LLM used for all generation steps. Can be an OpenAI model string or a custom model implementing DeepEvalBaseLLM
async_modeboolTrueEnables concurrent golden generation for faster throughput
max_concurrentint100Maximum number of parallel generation operations when async_mode=True
filtration_configFiltrationConfigDefault valuesControls quality filtering of generated inputs
evolution_configEvolutionConfigDefault valuesControls complexity evolution of inputs
styling_configStylingConfigDefault valuesControls output formatting and style
cost_trackingboolFalseWhen enabled, prints LLM API costs during generation

FiltrationConfig

Filtration ensures generated inputs meet a minimum quality bar. Inputs that score below the threshold are regenerated.

python
from deepeval.synthesizer.config import FiltrationConfig

filtration_config = FiltrationConfig(
    critic_model="gpt-4.1",
    synthetic_input_quality_threshold=0.5,
    max_quality_retries=3,
)
synthesizer = Synthesizer(filtration_config=filtration_config)
ParameterTypeDefaultDescription
critic_modelstr or DeepEvalBaseLLMSynthesizer's model, else "gpt-4.1"The LLM used to evaluate input quality. Can differ from the generation model
synthetic_input_quality_thresholdfloat0.5Minimum quality score (0-1) an input must achieve. Higher values produce better inputs but may increase generation time and cost
max_quality_retriesint3Maximum regeneration attempts for inputs below the threshold. After exhausting retries, the highest-scoring attempt is used

Quality scores are computed based on self-containment (the input is understandable without external references) and clarity (the input has unambiguous intent). These scores are available in the output DataFrame via the synthetic_input_quality column.

EvolutionConfig

Evolution transforms simple inputs into more complex, realistic test cases. You control which evolution types to apply and their relative sampling probabilities.

python
from deepeval.synthesizer.config import EvolutionConfig
from deepeval.synthesizer import Evolution

evolution_config = EvolutionConfig(
    evolutions={
        Evolution.REASONING: 1/4,
        Evolution.MULTICONTEXT: 1/4,
        Evolution.CONCRETIZING: 1/4,
        Evolution.CONSTRAINED: 1/4,
    },
    num_evolutions=4,
)
synthesizer = Synthesizer(evolution_config=evolution_config)
ParameterTypeDefaultDescription
evolutionsdict[Evolution, float]Equal distributionMaps evolution types to their sampling probabilities. Probabilities should sum to 1.0
num_evolutionsint1Number of sequential evolution steps applied to each input. Higher values produce more complex inputs but increase cost and may reduce naturalness

When num_evolutions > 1, each step samples an evolution type independently based on the configured probabilities. An input might first be concretized, then have reasoning added, producing a multi-layered complexity increase.

Evolution Types in Detail

Evolution.REASONING: Transforms inputs to require logical reasoning, multi-step deduction, or inference chains. The evolved input can't be answered by simple extraction -- the model must reason over the context.

Evolution.MULTICONTEXT: Modifies inputs so that answering requires synthesizing information from multiple context chunks. This tests whether the model can combine disparate pieces of information. Context-adherent: the evolved input stays grounded in the provided contexts.

Evolution.CONCRETIZING: Makes abstract or general inputs more specific. For example, "What are the benefits of exercise?" might become "What cardiovascular benefits do adults over 50 gain from 30 minutes of daily walking?" Context-adherent.

Evolution.CONSTRAINED: Adds constraints to the expected response format or content. For example, "Explain quantum computing" might become "Explain quantum computing in exactly three sentences, using no technical jargon." Context-adherent.

Evolution.COMPARATIVE: Transforms inputs to require comparing multiple entities, approaches, or options. For example, "What is supervised learning?" might become "Compare supervised and unsupervised learning in terms of data requirements and typical use cases." Context-adherent.

Evolution.HYPOTHETICAL: Adds hypothetical scenarios or counterfactual reasoning requirements. For example, "What causes inflation?" might become "If a central bank doubled the money supply overnight, what inflationary effects would follow?"

Evolution.IN_BREADTH: Broadens the input to explore related topics, testing whether the model has wider knowledge beyond the specific context provided.

StylingConfig

Styling controls the format and framing of generated inputs and expected outputs. This is essential when your application expects inputs in a specific format (e.g., SQL queries, customer support tickets, technical questions).

python
from deepeval.synthesizer.config import StylingConfig

styling_config = StylingConfig(
    input_format="Questions in English that ask for data in a database",
    expected_output_format="SQL query based on the given input",
    task="Answering text-to-SQL-related queries by querying a database and returning results to users",
    scenario="Non-technical users trying to query a database using plain English",
)
synthesizer = Synthesizer(styling_config=styling_config)
ParameterTypeDefaultDescription
input_formatstrNoneDescribes the desired format for generated inputs
expected_output_formatstrNoneDescribes the desired format for expected outputs
taskstrNoneDescribes the purpose of the LLM application being evaluated
scenariostrNoneDescribes the setting or context in which the application is used

For multi-turn generation from scratch, use ConversationalStylingConfig instead:

python
from deepeval.synthesizer.config import ConversationalStylingConfig

conversational_styling_config = ConversationalStylingConfig(
    conversational_task="Answering text-to-SQL-related queries through conversation",
    scenario_context="Non-technical users trying to query a database using plain English",
    participant_roles="A non-technical user asking database questions and an AI assistant responding",
)
synthesizer = Synthesizer(conversational_styling_config=conversational_styling_config)
ParameterTypeDefaultDescription
conversational_taskstrNoneThe overall purpose of the conversation
scenario_contextstrNoneEnvironmental details and context for the conversation
participant_rolesstrNoneDescription of the interaction participants

Method 1: Generate Goldens From Documents

This is the most automated method, designed for RAG systems with existing knowledge bases. It handles document parsing, chunking, embedding, context selection, and golden generation in a single call.

Prerequisites

bash
pip install chromadb langchain-core langchain-community langchain-text-splitters

These dependencies handle document parsing (langchain-text-splitters for chunking, langchain-community for document loaders) and context management (chromadb for embedding storage and retrieval).

Single-Turn Generation

python
from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer()
goldens = synthesizer.generate_goldens_from_docs(
    document_paths=["knowledge_base.txt", "faq.pdf", "guide.docx"],
    include_expected_output=True,
    max_goldens_per_context=2,
)
ParameterTypeDefaultDescription
document_pathslist[str]RequiredFile paths to source documents. Supported formats: .txt, .docx, .pdf, .md, .markdown, .mdx
include_expected_outputboolTrueGenerate a reference expected_output for each golden
max_goldens_per_contextint2Maximum goldens generated per constructed context
context_construction_configContextConstructionConfigDefault valuesControls how contexts are built from documents

The total maximum number of goldens produced is max_goldens_per_context * max_contexts_per_document * number_of_documents, not simply max_goldens_per_context.

Multi-Turn Generation

python
synthesizer.generate_conversational_goldens_from_docs(
    document_paths=["knowledge_base.txt", "faq.pdf"],
    include_expected_outcome=True,
    max_goldens_per_context=2,
)
ParameterTypeDefaultDescription
document_pathslist[str]RequiredFile paths to source documents
include_expected_outcomeboolTrueGenerate expected_outcome for each ConversationalGolden
max_goldens_per_contextint2Maximum goldens per context
context_construction_configContextConstructionConfigDefault valuesControls context construction

ContextConstructionConfig

Unlike other Synthesizer configurations (which are set at initialization), context construction is configured at generation time because it is specific to document-based generation.

python
from deepeval.synthesizer.config import ContextConstructionConfig

synthesizer.generate_goldens_from_docs(
    document_paths=["knowledge_base.txt"],
    context_construction_config=ContextConstructionConfig(
        embedder="text-embedding-3-small",
        chunk_size=1024,
        chunk_overlap=0,
        max_contexts_per_document=3,
        min_contexts_per_document=1,
        max_context_length=3,
        min_context_length=1,
        context_quality_threshold=0.5,
        context_similarity_threshold=0.5,
        max_retries=3,
        critic_model="gpt-4.1",
    ),
)
ParameterTypeDefaultDescription
embedderstr or DeepEvalBaseEmbeddingModel"text-embedding-3-small"Embedding model for document parsing and context grouping
chunk_sizeint1024Token size (not character size) of text chunks during document parsing
chunk_overlapint0Token overlap between consecutive chunks
max_contexts_per_documentint3Maximum number of contexts extracted from each document
min_contexts_per_documentint1Minimum number of contexts extracted from each document
max_context_lengthint3Maximum number of text chunks grouped into a single context
min_context_lengthint1Minimum number of text chunks in a context
context_quality_thresholdfloat0.5Minimum quality score (0-1) for a context to be accepted
context_similarity_thresholdfloat0.5Minimum cosine similarity for context grouping
max_retriesint3Retry attempts for context selection and grouping failures
critic_modelstr or DeepEvalBaseLLMSynthesizer's model, else "gpt-4.1"LLM used to evaluate context quality scores
encodingstrAuto-detectedText encoding for .txt, .md, .markdown, .mdx files

The Context Construction Pipeline

Document-based generation runs three sub-stages before the main generation pipeline:

1. Document Parsing: Documents are split into chunks using TokenTextSplitter at the token level (governed by chunk_size and chunk_overlap). Each chunk is embedded using the configured embedder and stored in a ChromaDB vector database. If chunk_size is too large relative to the document size, an error is raised because there aren't enough unique chunks to build max_contexts_per_document contexts.

2. Context Selection: Random nodes are sampled from the vector database and scored for quality (0-1) by the critic_model. Quality is assessed on four dimensions:

  • Clarity: Is the information comprehensible?
  • Depth: Does it contain sufficient detail and insight?
  • Structure: Is it well-organized and logically coherent?
  • Relevance: Is it topically focused?

Nodes scoring below context_quality_threshold are re-sampled up to max_retries times. If all retries fail, the highest-scoring node is used regardless.

3. Context Grouping: Selected nodes are grouped with up to max_context_length similar nodes using cosine similarity. Nodes with similarity below context_similarity_threshold are retried up to max_retries times, falling back to the highest-similarity match. This ensures each context group is thematically coherent, producing more focused and answerable goldens.

After context construction completes, the constructed contexts are passed to the same generation pipeline used by generate_goldens_from_contexts().

Method 2: Generate Goldens From Contexts

Use this method when you already have prepared contexts -- for example, chunks stored in a vector database or manually curated context sets. This bypasses all document processing and context construction, feeding your contexts directly into the generation pipeline.

Single-Turn Generation

python
from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer()
goldens = synthesizer.generate_goldens_from_contexts(
    contexts=[
        [
            "The Earth revolves around the Sun in approximately 365.25 days.",
            "Planets are celestial bodies that orbit stars.",
        ],
        [
            "Water freezes at 0 degrees Celsius at standard atmospheric pressure.",
            "The chemical formula for water is H2O.",
        ],
    ],
    include_expected_output=True,
    max_goldens_per_context=2,
    source_files=["astronomy.txt", "chemistry.txt"],
)
ParameterTypeDefaultDescription
contextslist[list[str]]RequiredList of contexts, where each context is a list of related text strings. Strings within each inner list should share a common theme
include_expected_outputboolTrueGenerate a reference expected_output for each golden
max_goldens_per_contextint2Maximum goldens generated per context
source_fileslist[str] or NoneNoneOptional source identifiers. If provided, length must match the contexts list length

Multi-Turn Generation

python
conversational_goldens = synthesizer.generate_conversational_goldens_from_contexts(
    contexts=[
        [
            "The Earth revolves around the Sun in approximately 365.25 days.",
            "Planets are celestial bodies that orbit stars.",
        ],
        [
            "Water freezes at 0 degrees Celsius at standard atmospheric pressure.",
            "The chemical formula for water is H2O.",
        ],
    ],
    include_expected_outcome=True,
    max_goldens_per_context=2,
    source_files=["astronomy.txt", "chemistry.txt"],
)
ParameterTypeDefaultDescription
contextslist[list[str]]RequiredList of contexts, each a list of related strings
include_expected_outcomeboolTrueGenerate expected_outcome for each ConversationalGolden
max_goldens_per_contextint2Maximum goldens per context
source_fileslist[str] or NoneNoneSource identifiers, must match contexts length if provided

Relationship to Document-Based Generation

The generate_goldens_from_docs() method calls generate_goldens_from_contexts() under the hood. The only difference is the additional context construction step that parses, chunks, and groups document content into the contexts format. If you have already processed your documents into context groups, using generate_goldens_from_contexts() directly is more efficient.

Method 3: Generate Goldens From Scratch

This method generates goldens without any documents or contexts. It is designed for applications that don't rely on RAG -- for example, chatbots, code generators, text-to-SQL systems, or creative writing assistants. Since there is no source material, the StylingConfig (or ConversationalStylingConfig for multi-turn) becomes essential to guide the generation.

Single-Turn Generation

python
from deepeval.synthesizer import Synthesizer
from deepeval.synthesizer.config import StylingConfig

styling_config = StylingConfig(
    input_format="Questions in English that ask for data in a database",
    expected_output_format="SQL query based on the given input",
    task="Answering text-to-SQL-related queries by querying a database and returning results to users",
    scenario="Non-technical users trying to query a database using plain English",
)

synthesizer = Synthesizer(styling_config=styling_config)
goldens = synthesizer.generate_goldens_from_scratch(num_goldens=25)
ParameterTypeDefaultDescription
num_goldensintRequiredThe number of synthetic goldens to generate

Without a StylingConfig, the Synthesizer has no guidance on what kind of inputs to generate, making the config effectively mandatory for this method.

Multi-Turn Generation

python
from deepeval.synthesizer import Synthesizer
from deepeval.synthesizer.config import ConversationalStylingConfig

conversational_styling_config = ConversationalStylingConfig(
    conversational_task="Helping users write and debug SQL queries through conversation",
    scenario_context="Non-technical users interacting with a database assistant",
    participant_roles="A user asking database questions and an AI SQL assistant",
)

synthesizer = Synthesizer(conversational_styling_config=conversational_styling_config)
conversational_goldens = synthesizer.generate_conversational_goldens_from_scratch(
    num_goldens=25,
)
ParameterTypeDefaultDescription
num_goldensintRequiredThe number of conversational goldens to generate

When to Use From-Scratch Generation

  • No existing knowledge base: Your application generates responses from parametric knowledge or external APIs, not from documents
  • Task-specific evaluation: You need test cases that match a specific input/output format (SQL queries, code, structured data)
  • Bootstrapping: You are starting evaluation from zero and need an initial dataset to iterate on
  • Coverage testing: You want to test edge cases and scenarios that your existing documents don't cover

Method 4: Generate Goldens From Goldens

This method augments an existing set of goldens by generating new variations. It is useful for expanding small evaluation datasets, increasing diversity, or creating more challenging versions of existing test cases.

Single-Turn Generation

python
from deepeval.synthesizer import Synthesizer
from deepeval.dataset import Golden

existing_goldens = [
    Golden(
        input="What is the capital of France?",
        expected_output="Paris",
        context=["Paris is the capital and most populous city of France."],
    ),
    Golden(
        input="What is photosynthesis?",
        expected_output="The process by which plants convert sunlight into energy.",
        context=["Photosynthesis is a process used by plants to convert light energy into chemical energy."],
    ),
]

synthesizer = Synthesizer()
new_goldens = synthesizer.generate_goldens_from_goldens(
    goldens=existing_goldens,
    max_goldens_per_golden=2,
    include_expected_output=True,
)
ParameterTypeDefaultDescription
goldenslist[Golden]RequiredExisting goldens to use as templates for generating new ones
max_goldens_per_goldenint2Maximum number of new goldens generated from each existing golden
include_expected_outputboolTrueGenerate expected_output for each new golden

Multi-Turn Generation

python
new_conversational_goldens = synthesizer.generate_conversational_goldens_from_goldens(
    goldens=existing_goldens,
    max_goldens_per_golden=2,
    include_expected_outcome=True,
)
ParameterTypeDefaultDescription
goldenslist[Golden]RequiredExisting goldens as generation templates
max_goldens_per_goldenint2Maximum new goldens per existing golden
include_expected_outcomeboolTrueGenerate expected_outcome for each ConversationalGolden

Critical Constraints

Context requirement for expected outputs: Generated goldens will contain expected_output only if your existing goldens contain context. When context is present, the Synthesizer uses it to ground the new goldens in factual content. Without context, the method falls back to from-scratch techniques based on the input patterns alone.

Single/multi-turn symmetry: You can only generate single-turn goldens from existing single-turn goldens, and conversational goldens from existing conversational goldens. You cannot mix the two.

StylingConfig recommendation: While the method can extract styling patterns from existing goldens, explicitly providing a StylingConfig produces more accurate and consistent results.

Saving and Exporting Generated Data

Push to Confident AI

python
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset(goldens=synthesizer.synthetic_goldens)
dataset.push(alias="My Generated Dataset")

This uploads the dataset to Confident AI's platform for versioning, collaboration, and integration with DeepEval's evaluation pipeline.

Save Locally

python
synthesizer.save_as(
    file_type="json",              # "json" or "csv"
    directory="./synthetic_data",
    file_name="my_dataset",        # optional, without extension
)
ParameterTypeDefaultDescription
file_typestrRequiredOutput format: "json" or "csv"
directorystrRequiredFolder path for the saved file
file_namestrNoneCustom filename without extension. Auto-generated if omitted
quietboolFalseSuppress output messages when True

Inspect as DataFrame

python
df = synthesizer.to_pandas()
print(df.columns.tolist())

The DataFrame includes these columns:

ColumnDescription
inputThe generated question or prompt
actual_outputAlways None (populated by your application)
expected_outputReference answer from the Synthesizer
contextSource knowledge base text
retrieval_contextRetrieved passages (populated during evaluation)
n_chunks_per_contextNumber of text chunks in the context
context_lengthCharacter length of the context
context_qualityContext quality score (0-1), from context construction
synthetic_input_qualityInput quality score (0-1), from filtration
evolutionsSequence of evolution types applied
source_fileOriginal document source

End-to-End Example

Here is a complete workflow that generates goldens from documents, configures all pipeline stages, and exports the results:

python
from deepeval.synthesizer import Synthesizer, Evolution
from deepeval.synthesizer.config import (
    FiltrationConfig,
    EvolutionConfig,
    StylingConfig,
    ContextConstructionConfig,
)
from deepeval.dataset import EvaluationDataset

# Configure pipeline stages
filtration_config = FiltrationConfig(
    critic_model="gpt-4.1",
    synthetic_input_quality_threshold=0.7,
    max_quality_retries=5,
)

evolution_config = EvolutionConfig(
    evolutions={
        Evolution.REASONING: 0.3,
        Evolution.MULTICONTEXT: 0.3,
        Evolution.CONCRETIZING: 0.2,
        Evolution.COMPARATIVE: 0.2,
    },
    num_evolutions=2,
)

styling_config = StylingConfig(
    input_format="Technical questions about machine learning concepts",
    expected_output_format="Detailed explanations with examples",
    task="Answering ML-related questions from a knowledge base",
    scenario="ML engineers looking up concepts in internal documentation",
)

# Initialize Synthesizer
synthesizer = Synthesizer(
    model="gpt-4.1",
    async_mode=True,
    max_concurrent=50,
    filtration_config=filtration_config,
    evolution_config=evolution_config,
    styling_config=styling_config,
    cost_tracking=True,
)

# Configure context construction
context_config = ContextConstructionConfig(
    chunk_size=512,
    chunk_overlap=50,
    max_contexts_per_document=5,
    max_context_length=3,
    context_quality_threshold=0.6,
    context_similarity_threshold=0.6,
)

# Generate goldens
goldens = synthesizer.generate_goldens_from_docs(
    document_paths=["docs/architecture.md", "docs/api-reference.md", "docs/tutorials.pdf"],
    include_expected_output=True,
    max_goldens_per_context=3,
    context_construction_config=context_config,
)

# Inspect results
df = synthesizer.to_pandas()
print(f"Generated {len(goldens)} goldens")
print(f"Average input quality: {df['synthetic_input_quality'].mean():.3f}")
print(f"Average context quality: {df['context_quality'].mean():.3f}")

# Save locally and to Confident AI
synthesizer.save_as(file_type="json", directory="./eval_data", file_name="ml_docs_goldens")
dataset = EvaluationDataset(goldens=goldens)
dataset.push(alias="ML Docs Evaluation Set v1")

Choosing the Right Generation Method

ScenarioMethodWhy
RAG system with document corpusgenerate_goldens_from_docsAutomated end-to-end pipeline from raw documents to goldens
Pre-chunked data in vector DBgenerate_goldens_from_contextsSkip document processing, use existing embeddings directly
Non-RAG application (chatbot, code gen)generate_goldens_from_scratchNo source material needed, guided by task description
Small existing eval datasetgenerate_goldens_from_goldensAugment and diversify existing test cases
Mixed: some docs + some manual casesCombine methodsUse from_docs for document coverage, from_goldens to augment edge cases

Practical Considerations

Manual inspection is essential: Synthetic data generation is not a fire-and-forget process. Always review a sample of generated goldens before using them for evaluation. Common issues include:

  • Inputs that are too vague or too specific to be realistic
  • Expected outputs that contradict the source context
  • Evolution producing unnaturally complex or convoluted inputs
  • Context grouping that combines unrelated topics

Cost management: Each golden requires multiple LLM calls (generation, filtration scoring, evolution, styling). With cost_tracking=True, monitor spend. Reduce costs by lowering max_quality_retries, using fewer evolution steps, or using a cheaper model for the critic_model while keeping a stronger model for generation.

OpenAI API key: The default embedder (text-embedding-3-small) and model (gpt-4.1) require an OPENAI_API_KEY. For non-OpenAI setups, provide a custom DeepEvalBaseLLM for the model and a custom DeepEvalBaseEmbeddingModel for the embedder.

Scaling: For large document corpora, increase max_concurrent and enable async_mode. For very large datasets (thousands of goldens), consider generating in batches and merging the results to avoid timeout issues.

Reproducibility: The Synthesizer uses LLM-based generation, which is inherently non-deterministic. Running the same configuration twice will produce different goldens. For reproducible evaluation datasets, generate once, inspect, curate, and version the results rather than regenerating each time.

← PreviousEval Frameworks Comparison: DeepEval, Promptfoo, RAGAS, Braintrust, LangSmith & MoreNext →Agent Evaluation: Reliability, Tool Use Accuracy & Trajectory Analysis