AI for Code: Copilots, Code Review & Program Synthesis

Code-specialized language models have evolved from academic curiosities to indispensable engineering tools, fundamentally altering how software is written, reviewed, and maintained. From GitHub Copilot's inline completions to SWE-bench-solving autonomous agents, AI for code represents one of the highest-impact applications of large language models. This article examines the training methodologies, architectural decisions, and production patterns that define the current state of code AI.

Training Code LLMs

Data Collection and Processing

The foundation of any code LLM is its training data. Major open datasets include:

The Stack v2 (Lozhkov et al., 2024): 67.5TB of deduplicated, permissively-licensed source code from GitHub across 600+ languages, built by the BigCode project
StarCoder training data: 783GB from GitHub, filtered for quality with near-deduplication
RedPajama-Code: Code subset of the RedPajama dataset for open LLM training

Data processing for code requires domain-specific filtering:

python

class CodeDataFilter:
    """Filters for training data quality"""

    def __init__(self):
        self.min_file_size = 100       # bytes
        self.max_file_size = 1_000_000  # 1MB
        self.max_line_length = 1000
        self.min_alphanum_ratio = 0.25

    def filter(self, code_file):
        # Remove auto-generated files
        if self.is_autogenerated(code_file):
            return False

        # Remove files with too many long lines (likely data/minified)
        long_lines = sum(1 for line in code_file.lines
                        if len(line) > self.max_line_length)
        if long_lines / max(len(code_file.lines), 1) > 0.1:
            return False

        # Remove low-quality files (mostly non-alphanumeric)
        alphanum = sum(c.isalnum() for c in code_file.content)
        if alphanum / max(len(code_file.content), 1) < self.min_alphanum_ratio:
            return False

        # Near-deduplication (MinHash/LSH)
        if self.dedup_index.is_near_duplicate(code_file):
            return False

        return True

    def is_autogenerated(self, code_file):
        markers = [
            "auto-generated", "generated by", "do not edit",
            "machine generated", "this file is generated",
        ]
        header = code_file.content[:500].lower()
        return any(marker in header for marker in markers)

Key Code LLM Families

CodeLlama (Roziere et al., 2023): Meta's code-specialized model family built on Llama 2. Available in 7B, 13B, 34B, and 70B parameter variants. Training involved continued pretraining on 500B tokens of code, followed by long-context fine-tuning (up to 100K tokens) and instruction tuning. The 70B variant matches GPT-4's code generation on HumanEval.

StarCoder / StarCoder2 (Li et al., 2023; Lozhkov et al., 2024): BigCode's open-source models trained on The Stack. StarCoder2-15B uses a 16K context window and was trained on 3.3T+ tokens. Notable for its transparent training process and permissive license. StarCoder2 introduced grouped query attention and sliding window attention.

DeepSeek-Coder (Guo et al., 2024): Available in 1.3B, 6.7B, and 33B sizes. Trained on 2T tokens comprising 87% code and 13% natural language. Achieves strong performance across multiple benchmarks. DeepSeek-Coder-V2 further improved with mixture-of-experts architecture. The later DeepSeek-V3 and DeepSeek-R1 models (late 2024/early 2025) demonstrated that general-purpose reasoning models with strong chain-of-thought capabilities can match or exceed dedicated code models on coding benchmarks, blurring the line between "code LLMs" and "reasoning LLMs" for software engineering tasks.

Qwen2.5-Coder (Yang et al., 2024): Alibaba's code model achieving state-of-the-art performance among open-source models on multiple benchmarks. Trained with careful data mixing strategies between code and natural language.

Fill-in-the-Middle Training

Standard left-to-right language modeling is suboptimal for code completion, where the model needs to generate code that fits between existing context (prefix and suffix). Fill-in-the-Middle (FIM) training (Bavarian et al., 2022) addresses this by restructuring training examples:

Original code:
  def factorial(n):
      if n <= 1:
          return 1
      return n * factorial(n - 1)

FIM transformation (PSM format - Prefix/Suffix/Middle):
  <fim_prefix>def factorial(n):
      if n <= 1:
  <fim_suffix>
      return n * factorial(n - 1)<fim_middle>        return 1

FIM transformation (SPM format - Suffix/Prefix/Middle):
  <fim_suffix>
      return n * factorial(n - 1)<fim_prefix>def factorial(n):
      if n <= 1:
  <fim_middle>        return 1

The model learns to generate the middle portion given the surrounding context. Key training decisions:

FIM rate: Typically 50-90% of training examples are FIM-transformed; the rest remain causal (left-to-right). This preserves strong generation capabilities while adding infilling ability.
PSM vs. SPM: SPM (suffix first) works better in practice because the model can attend to the suffix before generating, but PSM is more natural for prefix-heavy completions.
Character-level vs. token-level split points: Random character-level splits produce more diverse training examples but can split tokens, requiring careful handling.

python

import random

def fim_transform(code, fim_rate=0.5, psm_rate=0.5):
    """Apply Fill-in-the-Middle transformation to a code sample"""
    if random.random() > fim_rate:
        return code  # Keep as causal

    # Random split point
    split_point = random.randint(0, len(code))
    prefix = code[:split_point]
    suffix = code[split_point:]

    # Optionally split suffix further for middle extraction
    if len(suffix) > 0:
        middle_end = random.randint(0, len(suffix))
        middle = suffix[:middle_end]
        suffix = suffix[middle_end:]
    else:
        middle = ""

    if random.random() < psm_rate:
        # PSM format
        return f"<fim_prefix>{prefix}<fim_suffix>{suffix}<fim_middle>{middle}"
    else:
        # SPM format
        return f"<fim_suffix>{suffix}<fim_prefix>{prefix}<fim_middle>{middle}"

Copilot Architectures

Inline Code Completion

The most widespread code AI application is inline completion (autocomplete on steroids). The architecture involves:

Editor State:
  - Current file content (prefix + cursor position + suffix)
  - Open files in workspace (context)
  - Recent edits (for intent inference)
  - Language/framework detection

              |
              v

  Context Assembly:
    - Prioritize: current file > imported files > recently edited > similar files
    - Budget: fit within model's context window (4K-32K tokens)
    - Strategy: snippets from relevant files, not full files

              |
              v

  Model Inference:
    - FIM query with prefix/suffix from current file
    - Generate multiple candidates (n=3-5)
    - Apply post-processing (bracket matching, indentation fixing)

              |
              v

  Ranking & Filtering:
    - Log probability scoring
    - Syntax validation (parse the completion)
    - Semantic filtering (type checking where possible)
    - De-duplication across candidates

              |
              v

  Display: Ghost text in editor, tab to accept

Context Window Management

Effective copilots must decide what context to include within a limited window. This is a retrieval problem:

python

class CopilotContextAssembler:
    def __init__(self, max_tokens=8192):
        self.max_tokens = max_tokens
        self.tokenizer = load_tokenizer()

    def assemble(self, cursor_position, current_file, workspace):
        budget = self.max_tokens
        context_parts = []

        # 1. Current file context (highest priority)
        prefix, suffix = self.split_at_cursor(current_file, cursor_position)
        # Keep more prefix than suffix (3:1 ratio)
        prefix_budget = int(budget * 0.45)
        suffix_budget = int(budget * 0.15)
        context_parts.append(("prefix", self.truncate(prefix, prefix_budget, keep="end")))
        context_parts.append(("suffix", self.truncate(suffix, suffix_budget, keep="start")))
        budget -= prefix_budget + suffix_budget

        # 2. Imported/referenced files
        imports = self.extract_imports(current_file)
        for imp in imports[:5]:
            file_content = workspace.get_file(imp)
            if file_content:
                # Extract relevant definitions (function signatures, types)
                definitions = self.extract_definitions(file_content)
                tokens_needed = self.count_tokens(definitions)
                if tokens_needed <= budget:
                    context_parts.append(("import", definitions))
                    budget -= tokens_needed

        # 3. Recently edited files (intent inference)
        for recent_file in workspace.recently_edited()[:3]:
            snippet = self.extract_relevant_snippet(recent_file, current_file)
            tokens_needed = self.count_tokens(snippet)
            if tokens_needed <= budget:
                context_parts.append(("recent", snippet))
                budget -= tokens_needed

        return self.format_context(context_parts)

Latency Requirements

Inline completion has strict latency requirements - suggestions appearing after more than 200-300ms feel laggy. Strategies to meet this:

Speculative decoding: Use a small draft model to generate candidates quickly, verified by a larger model
Caching: Cache model KV states for common prefixes
Debouncing: Don't trigger on every keystroke; wait for a brief pause (100-150ms)
Precomputation: Start inference speculatively when the user pauses, before they explicitly request a completion
Model size selection: Use 1-7B parameter models for completion (larger models for chat/review)

AI Code Review Systems

Architecture

AI-powered code review goes beyond simple linting. Modern systems analyze pull requests holistically:

python

class AICodeReviewer:
    def __init__(self, llm_client, repo_context):
        self.llm = llm_client
        self.repo = repo_context

    async def review_pull_request(self, pr):
        diff = pr.get_diff()
        comments = []

        # 1. File-level analysis
        for file_diff in diff.files:
            # Get full file context (not just the diff)
            full_file = self.repo.get_file(file_diff.path, pr.head_sha)
            related_files = self.repo.get_related_files(file_diff.path)

            review = await self.llm.analyze(
                system=CODE_REVIEW_PROMPT,
                context={
                    "diff": file_diff.patch,
                    "full_file": full_file,
                    "related_files": related_files,
                    "pr_description": pr.description,
                    "file_history": self.repo.get_file_history(file_diff.path),
                },
                instructions="""
                Review this diff for:
                1. Logic errors and bugs
                2. Security vulnerabilities (injection, auth bypass, data exposure)
                3. Performance issues (N+1 queries, unnecessary allocations)
                4. API contract violations
                5. Missing error handling
                6. Test coverage gaps

                For each issue, provide:
                - Severity (critical/warning/suggestion)
                - Line number(s)
                - Explanation
                - Suggested fix
                """,
            )
            comments.extend(review.issues)

        # 2. Cross-file analysis
        architectural_review = await self.llm.analyze(
            system=ARCHITECTURE_REVIEW_PROMPT,
            context={"all_changes": diff, "repo_structure": self.repo.structure},
            instructions="Analyze cross-cutting concerns: consistency, patterns, dependencies",
        )
        comments.extend(architectural_review.issues)

        # 3. Filter and rank comments
        comments = self.deduplicate(comments)
        comments = self.filter_false_positives(comments)
        comments.sort(key=lambda c: c.severity, reverse=True)

        return comments

Challenges in AI Code Review

False positives: The most common complaint. Models may flag correct code as buggy. Calibration and confidence thresholds are essential.
Context limitations: Understanding code changes often requires understanding the entire codebase, not just the diff. Repository-level RAG helps.
Style vs. substance: Models should focus on logic bugs and security issues, not style preferences (which should be handled by formatters/linters).
Actionability: Every comment should include a concrete suggestion. "This could be improved" is useless without a specific recommendation.

Program Synthesis from Specifications

From Natural Language to Code

Program synthesis - generating programs from specifications - has evolved from formal methods to LLM-based approaches:

Specification types (from most to least formal):

Formal specs (input/output types, pre/post conditions) - most reliable but hardest to write
Test cases - concrete examples of desired behavior
Natural language descriptions - most accessible but most ambiguous
Existing code (refactoring/translation) - transformation of existing programs

python

# Test-driven synthesis: generate code that passes given tests
async def synthesize_from_tests(
    function_signature: str,
    test_cases: list[dict],
    llm_client,
    max_attempts: int = 5,
):
    """Generate a function implementation that passes all test cases"""

    for attempt in range(max_attempts):
        # Generate candidate implementation
        candidate = await llm_client.generate(
            prompt=f"""Write a Python function with this signature:
{function_signature}

It must pass these test cases:
{format_tests(test_cases)}

{"Previous attempt failed with: " + error if attempt > 0 else ""}
Return only the function implementation.""",
        )

        # Validate by running tests
        try:
            exec_result = run_tests_sandboxed(candidate, test_cases, timeout=10)
            if exec_result.all_passed:
                return candidate
            else:
                error = exec_result.failure_message
        except Exception as e:
            error = str(e)

    raise SynthesisError(f"Failed to synthesize after {max_attempts} attempts")

AlphaCode and Competition Programming

DeepMind's AlphaCode (Li et al., 2022) demonstrated that LLMs could solve competitive programming problems at a human-competitive level. Key innovations:

Massive sampling: Generate ~1 million candidate solutions per problem
Filtering: Remove solutions that fail example test cases (reduces to ~tens of thousands)
Clustering: Group remaining solutions by behavior on generated test inputs
Selection: Submit one solution per cluster (10 submissions allowed)

This brute-force approach highlighted that code correctness verification (running tests) is far easier than code generation, making generate-and-test a viable strategy when sampling is cheap.

Repository-Level Understanding

Beyond Single-File Context

Real-world code AI must understand entire repositories, not just individual files. This requires:

Repository indexing for semantic search:

python

class RepoIndex:
    """Index a repository for semantic code search"""

    def __init__(self, repo_path, embedding_model):
        self.repo_path = repo_path
        self.embedding_model = embedding_model
        self.chunks = []
        self.embeddings = []

    def build_index(self):
        for file_path in self.iter_source_files():
            # Parse into semantic chunks (functions, classes, modules)
            chunks = self.parse_chunks(file_path)
            for chunk in chunks:
                embedding = self.embedding_model.encode(
                    f"{chunk.file_path}:{chunk.name}\n{chunk.docstring}\n{chunk.signature}"
                )
                self.chunks.append(chunk)
                self.embeddings.append(embedding)

        self.index = build_faiss_index(self.embeddings)

    def search(self, query, top_k=10):
        query_embedding = self.embedding_model.encode(query)
        distances, indices = self.index.search(query_embedding, top_k)
        return [(self.chunks[i], distances[0][j])
                for j, i in enumerate(indices[0])]

    def parse_chunks(self, file_path):
        """Parse source file into semantic chunks using tree-sitter"""
        tree = self.parser.parse(read_file(file_path))
        chunks = []
        for node in self.walk_definitions(tree.root_node):
            chunks.append(CodeChunk(
                file_path=file_path,
                name=node.name,
                signature=self.extract_signature(node),
                docstring=self.extract_docstring(node),
                body=node.text,
                start_line=node.start_point[0],
                end_line=node.end_point[0],
            ))
        return chunks

Dependency graph construction: Understanding import relationships, call graphs, and type hierarchies across the codebase. Tools like tree-sitter provide language-agnostic AST parsing, while language servers (LSP) provide type information and go-to-definition capabilities.

Agentic repository exploration: SWE-bench-solving agents like SWE-agent (Yang et al., 2024) and Devin interact with repositories through tool use - reading files, searching code, running tests, and editing files iteratively. This mirrors how human developers work.

Evaluation Benchmarks

HumanEval and Variants

HumanEval (Chen et al., 2021) established the standard evaluation framework for code generation. It contains 164 hand-written Python programming problems with function signatures, docstrings, and test cases.

python

# Example HumanEval problem
def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """Check if in given list of numbers, are any two numbers
    closer to each other than given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """

Metrics:

pass@k: Probability that at least one of k generated samples passes all tests
pass@1: Most commonly reported; measures single-attempt success rate

Limitations of HumanEval:

Only 164 problems (small evaluation set)
Only Python
Self-contained functions (no dependencies, no I/O)
Simple problems that don't test real-world engineering

SWE-bench

SWE-bench (Jimenez et al., 2024) represents a massive step up in evaluation complexity. It contains 2,294 real GitHub issues from 12 popular Python repositories (Django, scikit-learn, sympy, etc.) paired with the actual pull requests that resolved them.

The task: given the issue description and the repository state before the fix, generate a patch that resolves the issue and passes the repository's test suite.

SWE-bench performance (as of early 2025):
  - SWE-bench Lite (300 problems subset):
    - Claude 3.5 Sonnet + SWE-agent: ~49%
    - GPT-4o + SWE-agent: ~33%
    - DeepSeek-V3 + SWE-agent: ~42%
  - SWE-bench Verified (500 human-validated problems):
    - Best agent systems: ~55-65% (with top systems exceeding 50% by early 2025)
    - Best non-agentic: ~25-30%

The rapid progression on SWE-bench Verified is notable: agent systems crossed the 50% threshold in early 2025, meaning they can resolve more than half of real-world GitHub issues autonomously. This milestone suggests that AI coding agents are approaching practical utility for a meaningful fraction of routine software maintenance work. For a detailed treatment of how these agent systems are architecturally composed, see Article 29: Code Generation Agents.

SWE-bench tests capabilities that HumanEval misses entirely:

Repository navigation and understanding
Reading and understanding existing code
Debugging from error messages and stack traces
Making minimal, targeted edits
Understanding project conventions

Additional Benchmarks

Benchmark	Focus	Size	Languages
MBPP	Basic Python problems	974	Python
HumanEval+	HumanEval with stronger tests	164	Python
MultiPL-E	HumanEval translated	164	18 languages
DS-1000	Data science problems	1000	Python
ClassEval	Class-level generation	100	Python
CrossCodeEval	Cross-file completion	9928	4 languages
LiveCodeBench	Contamination-free (new problems)	Growing	Python
Aider polyglot	Multi-language editing	225	8 languages

LiveCodeBench (Jain et al., 2024) is particularly important because it uses problems released after model training cutoffs, avoiding data contamination - a significant concern with HumanEval and MBPP.

Production Copilot Patterns

Multi-Model Architecture

Production copilot systems typically use multiple models for different tasks:

Fast completion model (1-7B params):
  - Inline autocomplete
  - Single-line suggestions
  - Latency: <200ms

Medium model (7-34B params):
  - Multi-line code generation
  - Function-level completion
  - Docstring generation
  - Latency: <1s

Large model (70B+ or API):
  - Code review comments
  - Bug explanation and fixes
  - Architecture suggestions
  - Complex refactoring
  - Latency: <10s (acceptable for chat interactions)

Telemetry and Improvement

The feedback loop from user interactions is critical for improving copilot systems:

python

class CopilotTelemetry:
    def track_suggestion(self, suggestion_id, event_type, metadata):
        """
        Track suggestion lifecycle:
        - shown: suggestion displayed to user
        - accepted: user pressed tab/enter
        - partially_accepted: user accepted then immediately edited
        - rejected: user continued typing (implicit rejection)
        - undone: user accepted then immediately undid (Ctrl+Z)
        """
        self.emit({
            "suggestion_id": suggestion_id,
            "event": event_type,
            "language": metadata.language,
            "suggestion_length": metadata.token_count,
            "time_to_decision_ms": metadata.time_to_decision,
            "context_type": metadata.context_type,  # comment, function_body, etc.
        })

    def compute_metrics(self, time_range):
        events = self.query(time_range)
        return {
            "acceptance_rate": accepted / shown,
            "retention_rate": (accepted - undone) / accepted,
            "chars_saved_per_day": self.estimate_chars_saved(events),
            "p50_latency_ms": percentile(latencies, 50),
            "p99_latency_ms": percentile(latencies, 99),
        }

GitHub reports that Copilot suggestions are accepted roughly 30% of the time, with higher rates for boilerplate code and lower rates for complex logic. The acceptance rate varies significantly by language, file type, and position in the file.

Security Considerations

AI-generated code introduces specific security risks:

Insecure patterns: Models may generate code with SQL injection, XSS, or other vulnerabilities if trained on vulnerable code
Secret leakage: Models might reproduce API keys or secrets from training data
License compliance: Generated code may reproduce copyrighted or restrictively-licensed code
Supply chain attacks: Suggesting non-existent packages that an attacker could register (package hallucination)

Mitigations include post-generation security scanning (Semgrep, CodeQL), license detection, and package existence verification.

AI IDE Architectures

Inline tab completion was only the first generation of AI-assisted code editing. Modern AI-native IDEs have evolved into multi-modal systems that combine completion, conversational chat, and fully autonomous agent workflows within a single editor surface.

Interaction Modes

Tab completion remains the lowest-latency interaction: the editor sends a FIM prompt on every pause, and the model returns one or more ghost-text suggestions. Cursor, Windsurf, and GitHub Copilot all offer this mode, typically backed by small, fast models (1-7B parameters) optimized for sub-200ms response times.

Chat mode opens a conversational panel where the developer describes a task in natural language, and the model generates or edits code in response. The key architectural difference from tab completion is context assembly: chat interactions can afford higher latency (1-5 seconds), so the system can include more context - open files, project structure, terminal output, and previous conversation turns. Cursor's chat mode, for example, automatically indexes the workspace and retrieves relevant files using embedding-based search before generating a response.

Agent mode is the most recent evolution, where the IDE hands control to an autonomous agent that can plan multi-step changes, edit multiple files, run terminal commands, and iterate on errors - all within the editor. Cursor's Agent, Windsurf's Cascade, and Zed's AI assistant each implement this pattern differently, but the core architecture is similar:

Developer describes task (natural language)
        |
        v
  Agent plans approach (may ask clarifying questions)
        |
        v
  Loop:
    - Read/search relevant files
    - Generate edits across one or more files
    - Apply edits and show diff preview
    - Optionally run tests or build commands
    - Analyze output, fix errors if any
    - Repeat until task is complete or agent yields control

Context Window Management

The quality of AI-assisted editing depends heavily on what context the model sees. Modern AI IDEs use several strategies beyond the basic CopilotContextAssembler pattern described earlier:

AST-aware chunking: Rather than sending raw file content, the IDE parses the abstract syntax tree and sends function signatures, type definitions, and class hierarchies - the structural skeleton that conveys maximum semantic information per token. Tree-sitter provides language-agnostic parsing for this purpose.
Symbol resolution: When the cursor is inside a function call, the IDE resolves the called function's definition and includes its signature and docstring. This is similar to how language servers work, and in fact Zed integrates its AI features directly with LSP data.
Open-file prioritization: Files currently open in editor tabs receive higher priority than closed files in the workspace. The assumption is that open files are most relevant to the current task - a heuristic that holds well in practice.
Git diff context: Some systems include recent uncommitted changes as context, allowing the model to understand what the developer is currently working on and maintain consistency with in-progress modifications.

The fundamental tradeoff is coverage versus noise: including more context gives the model more information but risks diluting the signal with irrelevant code. The best systems use retrieval (embedding search, BM25, or graph-based traversal) to select context rather than dumping everything into the prompt. For a deeper look at the retrieval and planning architectures that underpin these systems, see Article 26: Agent Architectures.

Terminal-Based Code Agents

While AI IDEs embed intelligence into the editor, a parallel category of tools operates directly in the terminal, treating the entire development workflow - git, tests, builds, deployment - as their environment.

Claude Code

Claude Code (Anthropic, 2025) runs as a CLI agent that operates within the developer's terminal and file system. It reads and writes files, executes shell commands, interacts with git, and runs tests - all through an agentic loop where the model decides which tool to invoke next. Its architecture is notable for operating without a predefined plan: the model receives the developer's request along with available tools (file read/write, bash execution, search) and iterates until the task is complete or it needs human input.

Key design decisions include:

No sandbox by default: Claude Code operates directly on the developer's file system, enabling real git commits, real test runs, and real build processes. This contrasts with sandboxed approaches but requires a trust model where the developer reviews changes before accepting.
Full repository awareness: The agent can search the codebase, read arbitrary files, and understand project structure organically through exploration rather than pre-indexing.
Workflow integration: Because it runs in the terminal, it composes naturally with existing developer tools - CI pipelines, package managers, deployment scripts.

Aider

Aider (Gauthier, 2023) takes a different architectural approach: it maintains a "chat" with the LLM where the conversation includes a map of the entire repository structure. The developer adds specific files to the chat context, and Aider generates targeted edits using a structured diff format. Aider's design prioritizes precision over autonomy - it asks the developer to specify which files are relevant rather than searching the codebase itself.

Aider supports multiple LLM backends and has pioneered several practical innovations:

Unified diff format: A structured output format that enables reliable multi-file editing across different models
Git integration: Every change is automatically committed with a descriptive message, creating a clean history that can be reviewed or reverted
Linting and testing loops: After making edits, Aider can run linters and tests, feeding errors back to the model for automatic repair

OpenHands (formerly OpenDevin)

OpenHands provides a sandboxed runtime environment where an AI agent can write code, execute commands, browse the web, and interact with a full Linux environment. Its architecture leans toward maximum autonomy: the agent receives a task and works inside a Docker container, making changes and running validation until it considers the task complete.

These terminal-based agents share a common architecture pattern - the observe-think-act loop described in Article 26: Agent Architectures - but differ in their trust model, context management, and degree of autonomy. The trend is toward tighter integration with real development workflows: real git repositories, real test suites, real CI pipelines.

AI-Generated Tests

Test generation is one of the highest-value applications of code AI because tests have a built-in verification mechanism: they either pass or fail. This makes the generate-and-validate loop particularly effective.

Automated Test Generation

LLMs can generate tests from several starting points:

From implementation code: Given a function or class, the model generates unit tests that exercise its behavior. The most effective approach provides the model with the implementation, its type signatures, any existing tests as style examples, and instructions to cover edge cases:

python

async def generate_tests_for_function(
    function_source: str,
    existing_tests: str | None,
    llm_client,
) -> str:
    """Generate unit tests for a given function implementation"""
    prompt = f"""Write comprehensive unit tests for this function:

```python
{function_source}

{"Existing test style to follow:" + existing_tests if existing_tests else ""}

Requirements:

Test normal cases, edge cases, and error conditions
Use descriptive test names that explain the scenario
Include boundary values and empty/null inputs
Test both expected outputs and expected exceptions
Keep tests independent (no shared mutable state) """ return await llm_client.generate(prompt)


**From specifications or docstrings**: When the implementation does not yet exist (test-first development), the model generates tests from the function signature and documentation alone. This supports an AI-assisted TDD workflow: the developer writes a docstring, the model generates tests, and then either the developer or the model writes the implementation to make the tests pass.

**From bug reports**: Given an issue description, the model can generate a regression test that reproduces the bug before any fix is attempted. This is precisely what SWE-bench-solving agents do - and it serves as both a validation mechanism and a guard against future regressions.

### Coverage Improvement Strategies

AI-driven coverage improvement follows a systematic pattern:

1. **Analyze coverage reports**: Parse existing coverage data (from tools like `coverage.py`, `istanbul`, or `llvm-cov`) to identify uncovered lines, branches, and functions
2. **Prioritize by risk**: Focus test generation on uncovered code paths in critical modules - authentication, payment processing, data validation - rather than pursuing raw coverage numbers
3. **Generate targeted tests**: For each uncovered path, generate tests that specifically exercise that path, using the coverage tool to verify the new tests actually cover the intended lines
4. **Validate and deduplicate**: Run the generated tests to confirm they pass, then remove redundant tests that do not increase coverage

The important caveat is that AI-generated tests can suffer from "tautological testing" - testing that the code does what it does, rather than testing that it does what it should. A test that merely asserts the current behavior of a buggy function will pass but provides no value. The most effective approaches combine AI generation with human review of test assertions, or use specification documents as the ground truth for expected behavior. For patterns on integrating AI-generated tests into continuous integration pipelines, see [Article 36: CI/CD for AI](/ci-cd-ai).

## Code Understanding and Explanation

Beyond generating and editing code, LLMs excel at reading and explaining existing code - a capability that has significant implications for developer onboarding, documentation, and codebase comprehension.

### Codebase Comprehension

Large codebases are notoriously difficult for new developers to understand. AI-assisted comprehension tools address this by letting developers ask questions about code in natural language:

- **"What does this function do?"**: The model reads the implementation and generates a plain-language explanation, including edge cases and side effects that may not be obvious from the function name alone
- **"How does authentication work in this project?"**: The agent searches for auth-related files, traces the request flow from middleware to handlers, and synthesizes an explanation of the overall architecture
- **"Why was this code written this way?"**: When paired with git history, the model can reference the commit messages and pull request descriptions that introduced a pattern, explaining the historical context behind a design decision

The effectiveness of these interactions depends on how much codebase context the model can access. RAG-based systems that index the repository (as described in the Repository-Level Understanding section above) perform significantly better than systems limited to the current file, because real-world questions often span multiple modules.

### Documentation Generation

AI-generated documentation fills a persistent gap in software engineering. Models can produce:

- **Inline docstrings**: Given a function implementation, generate a docstring describing parameters, return values, exceptions, and usage examples. This is most effective when the model has access to call sites, allowing it to document actual usage patterns rather than hypothetical ones.
- **API documentation**: For libraries and services, the model can generate endpoint descriptions, parameter tables, and example requests by analyzing route definitions and handler implementations.
- **Architecture documents**: By analyzing the dependency graph and module structure, the model can generate high-level architecture overviews that describe how components interact.

The reliability of AI-generated documentation varies. Docstrings and API docs tend to be accurate because the model can verify them against the implementation. Architecture documents require more judgment and are more prone to hallucination, particularly when the model lacks full codebase context.

### Onboarding Acceleration

The combination of code comprehension and documentation generation has a direct impact on developer onboarding. Organizations report that AI tools can reduce the time for a new developer to make their first meaningful contribution by providing:

- **Interactive codebase exploration**: Instead of reading documentation that may be outdated, new developers can ask questions about the actual current state of the code
- **Contextual explanations**: When reviewing a pull request or reading unfamiliar code, inline explanations help developers understand patterns and conventions specific to the project
- **Guided task completion**: For well-scoped onboarding tasks, an AI agent can walk the developer through the relevant files, explain the existing patterns, and suggest where to make changes

This capability is closely related to the code agent architectures discussed in [Article 29: Code Generation Agents](/code-agents), where agents must build an understanding of a repository before making changes. The same comprehension mechanisms that enable an agent to solve a GitHub issue also enable a developer to understand unfamiliar code.

## Summary and Key Takeaways

- **Code LLM training** requires careful data curation (deduplication, quality filtering, license compliance) and code-specific training objectives like Fill-in-the-Middle for completion tasks
- **Fill-in-the-Middle** training is essential for practical code completion; without it, models can only generate code left-to-right and cannot complete code within existing context
- **Production copilots** use multi-model architectures: small, fast models for inline completion and larger models for chat, review, and complex generation
- **Context assembly** is as important as model quality; effective retrieval of relevant code from the workspace determines suggestion quality more than model size
- **AI code review** must focus on high-severity issues (bugs, security) rather than style to maintain developer trust; false positive rate is the key metric to optimize
- **SWE-bench** represents the gold standard for evaluating real-world code AI capability, requiring repository navigation, debugging, and targeted editing - skills that HumanEval does not measure
- **Repository-level understanding** through semantic indexing, dependency graphs, and agentic exploration is the frontier of code AI, enabling systems that understand codebases holistically
- **Security and compliance** considerations are non-negotiable for production code AI: scan generated code for vulnerabilities, check for license issues, and verify package existence
- **AI IDE architectures** have evolved beyond tab completion into chat and autonomous agent modes, with context assembly strategies (AST-aware chunking, symbol resolution, git diff context) determining the quality ceiling for each interaction mode
- **Terminal-based code agents** like Claude Code, Aider, and OpenHands integrate with the full development workflow (git, tests, builds), operating with varying degrees of autonomy and sandboxing
- **AI-generated tests** offer high value because they have built-in verification (tests pass or fail), but must guard against tautological testing where assertions merely mirror current behavior rather than intended behavior
- **Code understanding** capabilities - codebase comprehension, documentation generation, onboarding acceleration - represent a force multiplier that complements code generation and may deliver equal or greater practical value in large engineering organizations