Serving large language models at production scale is fundamentally an inference optimization problem. While training a frontier model may cost hundreds of millions of dollars, the cumulative cost of inference โ serving billions of requests across the model's lifetime โ typically dwarfs training cost by an order of magnitude (see Article 39: Cost Optimization for the economic analysis). This article examines the core techniques that make LLM inference practical: KV cache management, prefix caching, quantization methods, speculative decoding, disaggregated serving, continuous batching, and attention optimization. Each technique addresses a different bottleneck in the inference pipeline โ rooted in the transformer's attention mechanism and autoregressive decode loop covered in Article 01: Transformer Architecture โ and understanding their interactions is essential for building efficient serving systems.
LLM inference proceeds in two distinct phases, each with different computational characteristics:
The prefill phase processes the entire input prompt in parallel. For a prompt of $n$ tokens, the model computes attention over all $n$ tokens simultaneously, populating the KV cache. This phase is compute-bound โ it performs $O(n^2 d)$ operations for attention and $O(n d^2)$ for FFN layers, and modern GPUs have sufficient memory bandwidth to keep ALUs busy.
The decode phase generates tokens one at a time, autoregressively. Each new token requires:
This phase is memory-bandwidth-bound: for each generated token, the model must read all its parameters from memory (GPU HBM) but performs very little computation per parameter (just a matrix-vector multiplication, not a matrix-matrix multiplication). The arithmetic intensity (FLOPs per byte loaded) is extremely low, leaving the GPU's compute units mostly idle.
# Arithmetic intensity comparison
# Prefill: matrix-matrix multiply (high intensity)
# B=batch, N=seq_len, D=hidden_dim
# FLOPs: B * N * D * D, Bytes: D * D (weight) + B * N * D (input)
# Intensity: ~N (scales with sequence length)
# Decode: matrix-vector multiply (low intensity)
# FLOPs: B * 1 * D * D, Bytes: D * D (weight) + B * 1 * D (input)
# Intensity: ~B (scales with batch size only)
# For B=1: intensity ~1, GPU utilization ~1%
This memory-bandwidth bottleneck during decode is the central challenge of LLM inference optimization.
The KV cache stores the key and value projections for all previously processed tokens, avoiding redundant recomputation during autoregressive generation.
For a model with $L$ layers, $n_h$ attention heads, head dimension $d_h$, and sequence length $s$:
$$\text{KV cache size} = 2 \times L \times n_h \times d_h \times s \times \text{bytes_per_element}$$
For Llama 2 70B ($L=80$, $n_h=8$ KV heads with GQA, $d_h=128$) at 128K context in fp16:
$$2 \times 80 \times 8 \times 128 \times 131072 \times 2 = 34.4 \text{ GB}$$
This means the KV cache alone can exceed the model weights in memory for long sequences, making KV cache management the primary memory bottleneck for long-context inference.
As discussed in Article 04: Model Architectures, GQA reduces the number of KV heads. Llama 2 70B uses 8 KV heads instead of 64, achieving an 8x reduction in KV cache size. This was the primary motivation for adopting GQA โ the quality impact is minimal, but the inference memory savings are substantial.
The extreme case of GQA is MQA (Shazeer, 2019), where all query heads share a single key and single value head. While this maximally reduces KV cache size, it can degrade quality, especially for tasks requiring fine-grained attention patterns. GQA provides a tunable middle ground.
Kwon et al. (2023) introduced PagedAttention in the vLLM system, which revolutionized KV cache management by borrowing ideas from operating system virtual memory.
Naive KV cache management pre-allocates a contiguous block of GPU memory for each request's maximum possible sequence length. This leads to severe memory fragmentation:
PagedAttention divides the KV cache into fixed-size pages (blocks of token slots, typically 16 tokens per block). Pages are allocated on demand as the sequence grows, similar to how OS virtual memory maps logical pages to physical frames:
class PagedKVCache:
"""Simplified PagedAttention KV cache manager."""
def __init__(self, num_layers, num_heads, head_dim, block_size=16):
self.block_size = block_size
self.free_blocks = list(range(MAX_BLOCKS))
# block_table[request_id] = list of block indices
self.block_tables = {}
def allocate_block(self, request_id):
block_idx = self.free_blocks.pop()
if request_id not in self.block_tables:
self.block_tables[request_id] = []
self.block_tables[request_id].append(block_idx)
return block_idx
def free_request(self, request_id):
blocks = self.block_tables.pop(request_id)
self.free_blocks.extend(blocks)
This approach achieves near-100% memory utilization and enables:
vLLM with PagedAttention achieves 2-4x higher throughput than naive serving implementations, primarily by fitting more concurrent requests into GPU memory.
A large fraction of production LLM traffic shares common prompt prefixes โ system prompts, few-shot examples, tool definitions, and RAG preambles. Computing the KV cache for these shared prefixes from scratch on every request is wasteful. Prefix caching eliminates this redundancy by reusing previously computed KV cache blocks across requests.
Zheng et al. (2024) introduced RadixAttention in the SGLang serving framework. The core idea is to organize the KV cache as a radix tree (a compressed trie) keyed by token sequences. When a new request arrives, the system performs a longest-prefix match against the radix tree. If the first $k$ tokens of the new request match an existing cached sequence, those $k$ positions of KV cache are reused directly, and the prefill phase only processes the remaining tokens.
# Conceptual radix tree for KV cache reuse
# Three requests sharing a common system prompt prefix:
#
# "You are a helpful assistant..." (system prompt, 500 tokens)
# โโโ "Summarize this article: ..." (Request A)
# โโโ "Translate to French: ..." (Request B)
# โโโ "You are a helpful assistant..." (different continuation)
# โโโ "Write unit tests for..." (Request C)
#
# Request A computes KV cache for all tokens (cache miss).
# Request B reuses the 500-token system prompt KV cache; only prefills
# the unique suffix.
# Request C shares the same prefix as A and B, reuses that KV cache.
The radix tree supports automatic prefix sharing with no manual annotation required. Unlike explicit caching APIs (discussed below), the serving system transparently identifies and reuses common prefixes across all concurrent requests. In workloads with high prefix overlap โ LLM-as-judge evaluations, chat applications with fixed system prompts, batch processing with shared instructions โ RadixAttention achieves up to 5x throughput improvement. See Article 37: LLM Serving for SGLang's full serving architecture.
vLLM implements a related approach through automatic prefix caching (APC): each KV cache block is hashed by its token content, and blocks with matching hashes are shared via copy-on-write semantics (the same mechanism PagedAttention uses for beam search). When APC is enabled, requests that share a common prefix โ even if they arrive minutes apart โ skip prefill for the shared portion.
The performance impact is substantial for system-prompt-heavy workloads. A 2000-token system prompt at fp16 on a 70B model consumes roughly 0.5 GB of KV cache. Without prefix caching, 100 concurrent requests each allocate their own copy (50 GB total for system prompts alone). With prefix caching, a single copy is shared, freeing memory for additional concurrent requests and directly improving throughput.
Prefix caching also reduces Time to First Token (TTFT) โ the cache-hit portion of the prompt skips the compute-bound prefill entirely, so a request with a 2000-token cached prefix and a 200-token unique suffix only prefills 200 tokens. For chat applications where multi-turn conversations accumulate context, this means each successive turn benefits from caching the entire conversation history up to that point.
Quantization reduces the numerical precision of model weights and/or activations, decreasing memory usage and often improving throughput. The challenge is maintaining quality while reducing precision.
The most common approach quantizes only the model weights, keeping activations in higher precision (fp16 or bf16). Since decode is memory-bandwidth-bound, reducing weight size directly increases throughput.
Frantar et al. (2023) introduced GPTQ, a one-shot post-training quantization method based on approximate second-order information:
GPTQ achieves 4-bit quantization with minimal quality loss for most models, reducing memory by 4x compared to fp16.
Lin et al. (2023) observed that not all weights are equally important โ weights corresponding to large-magnitude activations have outsized impact on quality. AWQ scales weights by the activation magnitude before quantization:
def awq_quantize(weight, activation_scale, group_size=128):
"""Simplified AWQ quantization."""
# Scale weights by activation importance
scaled_weight = weight * activation_scale.unsqueeze(0)
# Group quantization: quantize in groups of `group_size` columns
for i in range(0, weight.shape[1], group_size):
group = scaled_weight[:, i:i+group_size]
scale = group.abs().max() / 7 # for 4-bit: range [-8, 7]
quantized = torch.round(group / scale).clamp(-8, 7)
# Store scale factor and quantized weights
AWQ typically outperforms GPTQ at the same bit width and is faster to apply (no Hessian computation needed).
The GGUF format (used by llama.cpp) provides a range of quantization schemes optimized for CPU inference:
The k-quant methods (Gerganov, 2023) use a nested quantization structure where scale factors themselves are quantized, achieving better precision per bit than flat quantization.
More recent methods have narrowed the gap between 4-bit quantized and full-precision models to near-zero, even for smaller models.
QuIP# (Tseng et al., 2024) achieves high-quality 4-bit and even 2-bit quantization through two key ideas: (1) incoherence processing โ applying random orthogonal transformations to the weight matrix before quantization, which spreads information uniformly across all entries and eliminates outlier-sensitive columns; and (2) vector quantization using E8 lattice codebooks, which provides better rate-distortion tradeoffs than scalar quantization. QuIP# at 2 bits per parameter matches or exceeds GPTQ at 3 bits on perplexity benchmarks for 70B-class models.
AQLM (Additive Quantization for Language Models; Egiazarian et al., 2024) extends multi-codebook quantization to LLMs. Instead of quantizing individual scalars, AQLM quantizes groups of weights as vectors, using a sum of entries from multiple learned codebooks to approximate each weight group. This additive structure captures weight correlations that per-element quantization discards. At 2 bits per parameter, AQLM achieves notably better perplexity than GPTQ at the same bit budget, particularly on smaller models (7B-13B) where quantization error has a larger relative impact.
These methods demonstrate that the "4-bit quality wall" is not fundamental โ with sufficiently sophisticated quantization algorithms, even 2-bit weights can preserve model quality at scale.
The quality impact of quantization depends on model size. Larger models are more robust to quantization:
| Model Size | 8-bit Impact | 4-bit Impact | 3-bit Impact |
|---|---|---|---|
| 7B | Negligible | Minor (~1-2% degradation) | Significant |
| 13B | Negligible | Minimal (<1%) | Moderate |
| 70B | Negligible | Negligible | Minor |
This pattern holds because larger models have more redundancy โ the same information is distributed across more parameters, providing resilience to individual weight perturbation.
Quantizing activations (in addition to weights) is more challenging because activation distributions have outliers โ a small number of channels with very large values that make uniform quantization lossy.
Dettmers et al. (2022) with LLM.int8() showed that mixed-precision decomposition works: perform most matrix multiplications in int8 but identify and handle outlier channels (those with values > 6) in fp16. This adds overhead but enables int8 inference with negligible quality loss.
SmoothQuant (Xiao et al., 2023) takes a different approach: mathematically migrate the quantization difficulty from activations to weights by applying a per-channel scaling factor. Since weights are static, they can tolerate more aggressive quantization than dynamic activations.
While INT4/INT8 quantization requires post-training calibration and can introduce quality degradation, the NVIDIA H100 (and subsequent architectures) introduced native hardware support for FP8 (8-bit floating point), enabling a simpler path to 2x throughput improvement over FP16 with minimal quality loss.
FP8 comes in two variants:
For inference, E4M3 is the standard choice. The key advantage over INT8 is that FP8 preserves the floating-point representation โ it handles the wide dynamic range of activations naturally, without the outlier problems that plague integer activation quantization. In practice, FP8 inference on H100 achieves nearly 2x the throughput of FP16 inference with perplexity degradation typically below 0.1%, making it the default precision for production serving on H100 hardware.
MXFP formats (Microscaling Floating Point), standardized by the Open Compute Project, take this further by combining block-level scaling with narrow floating-point elements. MXFP4 uses 4-bit floating-point values with a shared 8-bit scale per block of 32 elements, providing FP-style dynamic range at INT4-class memory savings. Hardware support for MXFP formats is expected in next-generation accelerators, which may make FP4 inference practical without the quality penalties of INT4 quantization.
The broader trend is clear: the industry is moving from integer quantization (which requires careful calibration to handle activation outliers) toward narrow floating-point formats (which handle dynamic range natively). For current deployments, FP8 on H100 is the simplest high-impact optimization โ it requires no calibration dataset, no per-layer tuning, and delivers roughly half the memory bandwidth of FP16 with near-lossless quality.
Leviathan et al. (2023) and Chen et al. (2023) independently proposed speculative decoding, which accelerates autoregressive generation by using a small, fast draft model to propose multiple tokens that are then verified in parallel by the large target model.
def speculative_decode(target_model, draft_model, prompt, gamma=5):
"""Generate tokens using speculative decoding."""
tokens = list(prompt)
while not done:
# Step 1: Draft model generates gamma candidate tokens
draft_tokens = []
draft_probs = []
for _ in range(gamma):
p = draft_model.predict(tokens + draft_tokens)
t = sample(p)
draft_tokens.append(t)
draft_probs.append(p)
# Step 2: Target model scores ALL gamma+1 positions in one forward pass
target_probs = target_model.predict_batch(
tokens + draft_tokens # single forward pass for gamma+1 tokens
)
# Step 3: Accept/reject each draft token
accepted = 0
for i in range(gamma):
# Accept with probability min(1, target_prob / draft_prob)
ratio = target_probs[i][draft_tokens[i]] / draft_probs[i][draft_tokens[i]]
if random.random() < min(1, ratio):
tokens.append(draft_tokens[i])
accepted += 1
else:
# Reject: sample from adjusted distribution
adjusted = max(0, target_probs[i] - draft_probs[i])
adjusted = adjusted / adjusted.sum()
tokens.append(sample(adjusted))
break
# If all accepted, sample one more from target
if accepted == gamma:
tokens.append(sample(target_probs[gamma]))
return tokens
The acceptance-rejection scheme guarantees that the output distribution is exactly the same as the target model's distribution โ speculative decoding introduces zero quality degradation. It is purely a latency optimization. Note that speculative decoding interacts with constrained decoding techniques (see Article 10: Structured Output): when output must conform to a grammar or JSON schema, the draft model's proposals can be further filtered by the grammar constraints, improving acceptance rates on structured output tasks.
The speedup depends on the acceptance rate, which depends on how well the draft model approximates the target model. In practice:
Traditional static batching waits until a batch of requests is assembled, processes them all together, and returns results. This is wasteful because different requests have different generation lengths โ short requests finish early but must wait for long requests in the same batch.
Yu et al. (2022) at Orca introduced continuous batching (also called iteration-level scheduling): the serving system manages a pool of in-progress requests and, at each iteration, processes all active requests together. When a request finishes, a new request can immediately take its slot.
class ContinuousBatchScheduler:
"""Simplified continuous batching scheduler."""
def __init__(self, model, max_batch_size):
self.model = model
self.max_batch_size = max_batch_size
self.active_requests = []
self.waiting_queue = []
def step(self):
# Fill batch with new requests if space available
while (len(self.active_requests) < self.max_batch_size
and self.waiting_queue):
req = self.waiting_queue.pop(0)
self.prefill(req)
self.active_requests.append(req)
if not self.active_requests:
return
# Run one decode step for all active requests
next_tokens = self.model.decode_batch(
[req.current_tokens for req in self.active_requests],
[req.kv_cache for req in self.active_requests]
)
# Process results
finished = []
for req, token in zip(self.active_requests, next_tokens):
req.append_token(token)
if req.is_done():
finished.append(req)
for req in finished:
self.active_requests.remove(req)
req.complete()
Continuous batching increases GPU utilization from typically 30-50% to 70-90%+ and is now standard in all production serving systems (vLLM, TGI, TensorRT-LLM).
A refinement of continuous batching is chunked prefill: instead of processing the entire prompt of a new request in one step (which can cause latency spikes for long prompts), the prefill is split into chunks interleaved with decode steps for existing requests. This smooths latency at the cost of slightly slower prefill.
As noted in the opening of this article, prefill and decode have fundamentally different computational profiles: prefill is compute-bound (high arithmetic intensity, benefits from FLOPS), while decode is memory-bandwidth-bound (low arithmetic intensity, benefits from memory bandwidth). Running both phases on the same GPU forces a compromise โ the hardware cannot be simultaneously optimized for both workloads, and mixing prefill and decode in the same batch creates interference that degrades inter-token latency for in-flight requests.
Disaggregated serving addresses this by physically separating prefill and decode onto different hardware pools.
Patel et al. (2024) introduced Splitwise, and Zhong et al. (2024) independently proposed DistServe, both built on the same principle: route prefill requests to a prefill cluster and decode requests to a decode cluster, transferring the KV cache between them.
โโโโโโโโโโโโโโโโโโโ
New Request โโโโถ โ Prefill Cluster โ โโ High-FLOPS GPUs (H100 SXM)
โ (compute-bound) โ Optimized for large matrix-matrix ops
โโโโโโโโโโฌโโโโโโโโโ
โ KV cache transfer (over NVLink/network)
โผ
โโโโโโโโโโโโโโโโโโโ
โ Decode Cluster โ โโ High-bandwidth memory GPUs
โ (memory-bound) โ Optimized for low-latency token gen
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
Token stream to client
The benefits are significant. Prefill GPUs can run at near-100% compute utilization without worrying about inter-token latency โ they process one prompt after another at maximum throughput. Decode GPUs run without prefill interruptions, delivering consistent inter-token latency. DistServe reports 1.5-2x throughput improvement over co-located serving at the same latency SLOs, with particularly large gains when input prompts are long relative to outputs (common in RAG and summarization workloads).
The separation opens the door to heterogeneous hardware. Prefill benefits from raw FLOPS โ fewer, more powerful GPUs are ideal. Decode benefits from memory bandwidth per dollar โ more GPUs with high HBM bandwidth, even at lower compute capability, may be cost-optimal. In practice, operators might allocate H100 SXM nodes (high NVLink bandwidth, high FLOPS) for prefill and H100 PCIe or even L40S nodes (lower cost per GB/s of memory bandwidth) for decode.
The main challenge is KV cache transfer latency. For a 70B model at fp16 with a 4K-token prompt, the KV cache is roughly 1 GB. Over a 400 Gbps (50 GB/s) inter-node network, this transfer takes ~20 ms โ acceptable for TTFT targets above 100 ms, but potentially problematic for ultra-low-latency applications. See Article 37: LLM Serving for the broader serving architecture context in which disaggregated serving operates.
Dao et al. (2022) designed Flash Attention primarily for training, but it is equally important for inference:
# Flash Attention conceptual workflow (actual implementation is CUDA)
# Key insight: tile the attention computation to fit in SRAM
# Instead of:
# S = Q @ K.T # n x n matrix in HBM (expensive)
# P = softmax(S) # n x n matrix in HBM
# O = P @ V # read n x n from HBM
# Flash Attention:
# For each tile of Q (fits in SRAM):
# For each tile of K, V (fits in SRAM):
# Compute local attention (in SRAM)
# Update running softmax statistics (online softmax)
# Accumulate output (in SRAM)
# Write final output tile to HBM
# Total HBM reads/writes: O(n * d) instead of O(n^2)
For models too large to fit on a single GPU, tensor parallelism (TP) splits individual operations across multiple GPUs. Unlike pipeline parallelism (which splits by layers), TP splits the weight matrices within each layer:
# Column-parallel: split the weight matrix by columns
# GPU 0 gets W[:, :d//2], GPU 1 gets W[:, d//2:]
# Each GPU computes a partial output, which are concatenated
# Row-parallel: split the weight matrix by rows
# GPU 0 gets W[:d//2, :], GPU 1 gets W[d//2:, :]
# Each GPU computes a partial output, which are summed (all-reduce)
For the attention layer, Q/K/V projections are column-parallel (split heads across GPUs), and the output projection is row-parallel. For FFN, the first linear layer is column-parallel and the second is row-parallel. This arrangement minimizes communication: only two all-reduce operations per transformer layer.
The all-reduce operations add latency proportional to the message size divided by the inter-GPU bandwidth. On NVLink (900 GB/s on H100), the overhead is modest. On PCIe (64 GB/s), it can dominate inference time, making TP across PCIe-connected GPUs inadvisable for latency-sensitive applications.
The prefix caching optimization described above also manifests as a user-facing API feature from major LLM providers. While the underlying mechanism is the same โ reusing KV cache computations for repeated prompt prefixes โ the API-level implementations expose this as a cost and latency optimization for end users.
Anthropic offers explicit prompt caching with cache_control markers in the message structure. The developer designates which portions of the prompt (typically the system prompt and few-shot examples) should be cached. Cached input tokens receive a 90% price discount, while the initial cache write incurs a 25% surcharge. The cache has a 5-minute TTL, reset on each cache hit. This explicit design gives developers precise control over what is cached.
OpenAI implements automatic prompt caching for prompts longer than 1024 tokens with no code changes required. The system automatically detects repeated prefixes and caches them. Cached tokens are billed at 50% of the standard input rate. The API response includes a cached_tokens field, making cache hits observable.
Google (Gemini) provides context caching through an explicit API where you create a named cache object with a configurable TTL. Cached input tokens are discounted 75%, but there is a per-hour storage cost for maintaining the cache, making it best suited for high-volume workloads that amortize the storage overhead.
From an inference optimization perspective, prompt caching APIs provide two benefits:
The key architectural insight is that prompt caching is a natural extension of the KV cache reuse described in the prefix caching section above. API providers are effectively running RadixAttention or equivalent systems on their serving infrastructure and exposing the savings through pricing. For a detailed cost analysis and implementation patterns, see Article 39: Cost Optimization.
A production LLM serving system combines all these techniques:
Request Queue
โ
โผ
โโโโโโโโโโโโโโโโ
โ Scheduler โ โโ Continuous batching + chunked prefill
โ (vLLM/TGI) โ (Optional) Disaggregated prefill/decode routing
โโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโ
โ KV Cache โ โโ PagedAttention, GQA/MQA
โ Manager โ โโ Prefix caching (RadixAttention / APC)
โโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโ
โ Model โ โโ Quantized weights (AWQ/GPTQ/FP8)
โ Engine โ โโ Flash Attention
โ โ โโ Tensor Parallelism
โ โ โโ (Optional) Speculative decoding (EAGLE/Medusa)
โโโโโโโโโโโโโโโโ
โ
โผ
Response Stream (token-by-token via SSE)