The landscape of large language model architectures has diversified significantly since GPT-3 demonstrated that scaling decoder-only transformers yields powerful general-purpose language systems. While the decoder-only transformer remains the dominant paradigm (see Article 01: Transformer Architecture for foundational concepts), each major model family introduces architectural innovations โ from Mixture-of-Experts routing in Mixtral to grouped-query attention in Llama 2 to multimodal fusion in Gemini. More recently, reasoning-focused architectures like OpenAI's o1/o3 and DeepSeek R1 have introduced test-time compute scaling as a new dimension of model design, while state-space models like Mamba challenge the transformer's monopoly on sequence modeling. This article provides a detailed comparative analysis of the architectural choices across leading LLM families, examining why specific design decisions were made and their implications for capability, efficiency, and deployment.
Before comparing individual architectures, it is worth noting the remarkable convergence. Every major frontier LLM as of 2025 โ GPT-4, Claude, Llama 3, Gemini, Mistral, Qwen, DeepSeek โ uses a decoder-only transformer architecture with causal (left-to-right) attention masking. This convergence was not inevitable; T5 (Raffel et al., 2020) showed competitive results with encoder-decoder architectures, and models like UL2 (Tay et al., 2022) explored hybrid approaches.
The decoder-only design won for several reinforcing reasons:
Within this consensus, however, significant architectural variation exists in the details.
Brown et al. (2020) established the GPT-3 architecture as a straightforward scaled-up transformer decoder:
GPT-3's architecture was deliberately conservative โ the innovation was scale, not architecture. The alternating sparse attention (using banded patterns every other layer) was one of few deviations from the vanilla transformer.
While OpenAI has not published full architectural details for GPT-4, substantial evidence from reverse engineering efforts and leaked information (confirmed by media reports) suggests GPT-4 uses a Mixture-of-Experts architecture:
The MoE design allows GPT-4 to have a very large total parameter count (capturing more knowledge) while keeping the per-token compute cost manageable โ only 2 of 8 experts are active for any given token.
Anthropic has published limited architectural details about Claude models. What is publicly known:
Anthropic's research publications emphasize interpretability (Elhage et al., 2022; Bills et al., 2023) and alignment methodology over architectural novelty, suggesting that Claude's advantages come primarily from training methodology and data rather than novel architecture.
Meta's Llama family is the most thoroughly documented major LLM architecture, with full technical reports and open weights enabling detailed analysis.
Touvron et al. (2023a) introduced several architectural modifications that have since become standard in the open-source ecosystem:
class LlamaBlock(torch.nn.Module):
"""Simplified Llama transformer block."""
def __init__(self, config):
super().__init__()
self.attention_norm = RMSNorm(config.dim)
self.ffn_norm = RMSNorm(config.dim)
self.attention = MultiHeadAttention(config) # with RoPE
self.ffn = SwiGLUFFN(config)
def forward(self, x, freqs_cis, mask=None):
# Pre-norm + residual for attention
h = x + self.attention(self.attention_norm(x), freqs_cis, mask)
# Pre-norm + residual for FFN
out = h + self.ffn(self.ffn_norm(h))
return out
The Llama 1 architecture was deliberately Chinchilla-optimal in its training data ratios: the 7B model was trained on 1T tokens and the 65B on 1.4T tokens.
Touvron et al. (2023b) made targeted improvements:
class GroupedQueryAttention(torch.nn.Module):
def __init__(self, d_model, n_q_heads, n_kv_heads):
super().__init__()
self.n_q_heads = n_q_heads
self.n_kv_heads = n_kv_heads
self.n_groups = n_q_heads // n_kv_heads # queries per KV head
self.head_dim = d_model // n_q_heads
self.W_q = torch.nn.Linear(d_model, n_q_heads * self.head_dim)
self.W_k = torch.nn.Linear(d_model, n_kv_heads * self.head_dim)
self.W_v = torch.nn.Linear(d_model, n_kv_heads * self.head_dim)
self.W_o = torch.nn.Linear(d_model, d_model)
def forward(self, x, freqs_cis, mask=None):
B, L, _ = x.shape
Q = self.W_q(x).view(B, L, self.n_q_heads, self.head_dim)
K = self.W_k(x).view(B, L, self.n_kv_heads, self.head_dim)
V = self.W_v(x).view(B, L, self.n_kv_heads, self.head_dim)
# Expand KV heads to match query heads
K = K.repeat_interleave(self.n_groups, dim=2)
V = V.repeat_interleave(self.n_groups, dim=2)
# ... standard attention computation
Llama 3 made further refinements:
The Llama 3.1 405B model represents the largest open-weights dense model, trained on 15T+ tokens with extensive post-training (instruction tuning, RLHF, tool use training).
Google's Gemini family (Gemini Team, 2024) represents the most ambitious multimodal architecture:
Unlike GPT-4 and Claude, which added vision capability through adapter modules, Gemini was designed from the ground up as a multimodal model:
Gemini 1.5 Pro uses a sparse Mixture-of-Experts architecture, which Google has deep experience with from the Switch Transformer (Fedus et al., 2022) and GLaM (Du et al., 2022) lineage. The MoE approach allows the model to scale total parameters (and thus knowledge capacity) while keeping the active parameter count โ and therefore per-token inference cost โ manageable.
The 1M+ token context window in Gemini 1.5 requires innovations beyond standard RoPE scaling:
Mistral AI has introduced several architectural innovations focused on inference efficiency.
Jiang et al. (2023) introduced two key innovations:
Instead of attending to all previous tokens, each attention layer attends to only a fixed window of $W$ tokens (4096 in Mistral 7B). Through stacking $L$ layers with window size $W$, the effective receptive field becomes $L \times W$, allowing information to propagate across the full sequence.
def sliding_window_attention(Q, K, V, window_size):
"""Attention restricted to a local window."""
seq_len = Q.size(-2)
# Create sliding window mask
mask = torch.ones(seq_len, seq_len, dtype=torch.bool)
for i in range(seq_len):
mask[i, max(0, i - window_size + 1):i + 1] = False
mask = ~mask # True where attention is allowed
scores = torch.matmul(Q, K.transpose(-2, -1)) / (Q.size(-1) ** 0.5)
scores.masked_fill_(~mask, float('-inf'))
weights = F.softmax(scores, dim=-1)
return torch.matmul(weights, V)
SWA reduces attention's memory and compute from $O(n^2)$ to $O(nW)$, making it linear in sequence length for fixed $W$.
With SWA, the KV cache needs to store only $W$ entries per layer rather than the full sequence, dramatically reducing memory usage during inference. This enables Mistral 7B to handle very long sequences with bounded memory.
Jiang et al. (2024) combined Mistral's architecture with Mixture-of-Experts:
The routing mechanism uses a learned gating network:
class MoELayer(torch.nn.Module):
def __init__(self, config):
super().__init__()
self.experts = torch.nn.ModuleList([
SwiGLUFFN(config) for _ in range(config.num_experts)
])
self.gate = torch.nn.Linear(config.dim, config.num_experts, bias=False)
self.top_k = config.top_k # typically 2
def forward(self, x):
gate_logits = self.gate(x) # (B, L, num_experts)
weights, selected = torch.topk(gate_logits, self.top_k, dim=-1)
weights = F.softmax(weights, dim=-1)
output = torch.zeros_like(x)
for i, expert in enumerate(self.experts):
mask = (selected == i).any(dim=-1)
if mask.any():
expert_out = expert(x[mask])
# Weight by gating score
expert_weight = weights[mask][selected[mask] == i]
output[mask] += expert_out * expert_weight.unsqueeze(-1)
return output
A critical challenge with MoE is ensuring that tokens are distributed reasonably evenly across experts. If most tokens route to the same expert, the model degenerates to a dense model with wasted parameters. Fedus et al. (2022) introduced an auxiliary load-balancing loss:
$$\mathcal{L}{balance} = \alpha \cdot N \sum{i=1}^{N} f_i \cdot P_i$$
where $f_i$ is the fraction of tokens routed to expert $i$, $P_i$ is the average routing probability for expert $i$, and $\alpha$ is a small coefficient (typically 0.01).
DeepSeek has introduced some of the most significant architectural innovations in the open-source LLM space, progressing from V2 through V3 and the reasoning-focused R1.
DeepSeek-V2 (DeepSeek-AI, 2024a) introduced MLA, which compresses the KV cache by projecting keys and values into a low-dimensional latent space before storing them. This achieves KV cache compression comparable to MQA while retaining the expressiveness of MHA:
DeepSeek uses a fine-grained MoE architecture with many small experts rather than fewer large ones. DeepSeek-V2 uses 160 small experts with top-6 routing (rather than Mixtral's 8 experts with top-2), achieving better expert specialization and utilization.
DeepSeek-V3 (DeepSeek-AI, 2024b) scaled the architecture to 671 billion total parameters with only 37 billion active per token, making it one of the most parameter-efficient frontier models:
# Simplified illustration of MLA's KV compression
class MultiHeadLatentAttention(torch.nn.Module):
def __init__(self, d_model, n_heads, d_latent):
super().__init__()
self.head_dim = d_model // n_heads
self.W_q = torch.nn.Linear(d_model, d_model)
# Compress KV into low-dimensional latent
self.W_kv_down = torch.nn.Linear(d_model, d_latent)
# Decompress latent back to K and V
self.W_k_up = torch.nn.Linear(d_latent, d_model)
self.W_v_up = torch.nn.Linear(d_latent, d_model)
def forward(self, x):
Q = self.W_q(x)
# Compress once, store the small latent in KV cache
kv_latent = self.W_kv_down(x) # (B, L, d_latent)
K = self.W_k_up(kv_latent)
V = self.W_v_up(kv_latent)
# ... standard attention with Q, K, V
DeepSeek-R1 (DeepSeek-AI, 2025) applies reinforcement learning to elicit reasoning behavior from the V3 base model (see the Reasoning Architectures section below for full treatment).
Alibaba's Qwen family has emerged as a major open-source competitor to Meta's Llama, with particularly strong multilingual and code performance.
Qwen Team (2024a) introduced a range of models from 0.5B to 72B parameters, building on the Llama-style architecture with several refinements:
Qwen Team (2024b) made significant improvements, establishing Qwen2.5 as a direct competitor to Llama 3.1 across all size classes:
The architectural formula is largely convergent with Llama โ the differentiation comes from training data quality (particularly for multilingual and code), vocabulary design, and post-training methodology.
| Feature | GPT-4 | Claude 3.5 | Llama 3.1 | Gemini 1.5 | Mistral Large 2 | Mixtral | DeepSeek-V3 | Qwen2.5 72B |
|---|---|---|---|---|---|---|---|---|
| Architecture | Decoder-only | Decoder-only | Decoder-only | Decoder-only | Decoder-only | Decoder-only | Decoder-only | Decoder-only |
| MoE | Yes (rumored) | Unknown | No (dense) | Yes | No (dense) | Yes (8x) | Yes (256 experts) | No (dense) |
| Pos. Encoding | Unknown | Unknown | RoPE | RoPE variant | RoPE | RoPE | RoPE (YaRN) | RoPE (YaRN) |
| Attention | Unknown | Unknown | GQA | Unknown | GQA | GQA + SWA | MLA | GQA |
| Normalization | Unknown | Unknown | RMSNorm | Unknown | RMSNorm | RMSNorm | RMSNorm | RMSNorm |
| Activation | Unknown | Unknown | SwiGLU | Unknown | SwiGLU | SwiGLU | SwiGLU | SwiGLU |
| Max Context | 128K | 200K | 128K | 1M-2M | 128K | 32K | 128K | 128K |
| Vocab Size | ~100K | ~100K | 128K | Unknown | 32K | 32K | ~100K | 152K |
| Total Params | ~1.8T | Unknown | 405B | Unknown | 123B | ~47B | 671B | 72B |
| Active Params | ~220B | Unknown | 405B | Unknown | 123B | ~13B | 37B | 72B |
The race to longer context windows has driven significant architectural innovation:
Several approaches extend RoPE beyond its training length:
Architecture alone does not determine model capability. Training methodology varies significantly:
A new class of models has emerged where architecture and training are co-designed to enable explicit multi-step reasoning at inference time. Rather than producing answers in a single forward pass, these models allocate additional test-time compute to "think through" problems โ a paradigm shift with significant architectural implications (see Article 02: Scaling Laws for how test-time compute relates to traditional scaling).
OpenAI's o1 (OpenAI, 2024) introduced the concept of internal chain-of-thought at scale. While architectural details are proprietary, the key design principles are understood:
o3 (2025) extends this approach with improved reasoning efficiency and reliability, reportedly achieving expert-level performance on competition mathematics and doctoral-level science benchmarks.
DeepSeek-R1 (DeepSeek-AI, 2025) demonstrated that reasoning capabilities can emerge from pure reinforcement learning without supervised chain-of-thought data:
# Simplified GRPO reward computation
def grpo_reward(prompt, policy_model, num_samples=16):
"""Generate multiple responses and compute relative rewards."""
responses = [policy_model.generate(prompt) for _ in range(num_samples)]
# Score each response (e.g., correctness check for math)
scores = [verify_answer(r) for r in responses]
# Normalize rewards relative to group
mean_score = sum(scores) / len(scores)
std_score = (sum((s - mean_score)**2 for s in scores) / len(scores)) ** 0.5
advantages = [(s - mean_score) / (std_score + 1e-8) for s in scores]
return advantages
The architectural implication is notable: reasoning capability does not require a novel architecture. It can be trained into a standard transformer through RL post-training, though the model must be large enough to support the emergent reasoning behaviors.
Reasoning models change the compute calculus. A smaller model that spends 10x more inference tokens reasoning can outperform a larger model answering in a single pass. This shifts optimization priorities toward inference efficiency (see Article 05: Inference Optimization) โ fast token generation and efficient KV caching become even more critical when models routinely generate thousands of reasoning tokens per query.
While the transformer dominates, state-space models (SSMs) offer a fundamentally different approach to sequence modeling โ one that replaces attention's quadratic complexity with linear scaling.
Gu and Dao (2023) introduced Mamba, building on the Structured State Space Sequence model (S4) lineage:
# Conceptual selective SSM forward pass
def selective_ssm(x, A, B, C, delta):
"""
x: input sequence (B, L, D)
A, B, C: state-space matrices (input-dependent in Mamba)
delta: discretization step (input-dependent)
"""
h = torch.zeros(x.size(0), state_dim) # hidden state
outputs = []
for t in range(x.size(1)):
# Discretize continuous parameters
A_bar = torch.exp(delta[:, t] * A[:, t])
B_bar = delta[:, t].unsqueeze(-1) * B[:, t]
# State update: linear recurrence
h = A_bar * h + B_bar * x[:, t].unsqueeze(-1)
y = (C[:, t] * h).sum(dim=-1)
outputs.append(y)
return torch.stack(outputs, dim=1)
Mamba-3B matches Transformer models of equivalent size on language modeling while being significantly faster at long-sequence inference.
AI21's Jamba (Lieber et al., 2024) demonstrated that the most practical approach may be combining architectures:
The hybrid approach addresses SSMs' key weakness: pure Mamba models underperform transformers on tasks requiring precise long-range retrieval (e.g., "find the specific fact mentioned 50K tokens ago"). The periodic attention layers provide this capability while Mamba layers handle the sequential flow efficiently.
Peng et al. (2023) developed RWKV (Receptance Weighted Key Value), which reformulates attention as a linear recurrence:
State-space models are preferred in specific scenarios:
For most general-purpose language tasks at moderate sequence lengths, transformers remain superior due to their stronger in-context learning and retrieval capabilities. The hybrid approach (Jamba) may represent the practical middle ground.
A parallel trend to frontier scaling is the development of highly capable small models (1B-7B parameters), optimized for on-device deployment, low-latency serving, and cost-efficient inference.
Microsoft's Phi family demonstrates that data quality can partially substitute for model size:
Gemma Team (2024) introduced architectural optimizations specifically targeting the small-model regime:
Meta's Llama 3.2 (Meta AI, 2024) includes 1B and 3B parameter models designed explicitly for edge deployment:
Qwen2.5's 0.5B, 1.5B, and 3B models round out the small-model landscape:
The small model landscape reveals an important insight: below ~7B parameters, training data quality and distillation methodology matter more than architectural innovations. All competitive small models use essentially the same Llama-derived architecture. The differentiation comes from: