Aligning language models with human preferences has become the defining challenge of modern AI engineering, transforming base models that merely predict text into assistants that are helpful, harmless, and honest. This article provides a technical deep-dive into the RLHF pipeline (reward model training and PPO), Direct Preference Optimization (DPO), and emerging alternatives like ORPO and KTO, covering the mathematical foundations, practical implementation details, and the real-world challenges of preference data collection. Understanding these alignment techniques is essential for anyone building production language model systems.
Pre-trained language models are trained to predict the next token, not to be helpful. A model trained on internet text will cheerfully generate toxic content, hallucinate confidently, or produce verbose non-answers, because all of these patterns exist in its training data. Alignment is the process of steering model behavior toward human preferences without destroying the model's underlying capabilities.
The fundamental challenge is that "helpfulness" and "harmlessness" are not easily expressed as loss functions. We cannot write a differentiable objective that captures what makes a good response. Instead, we rely on human judgments: given two responses, which one is better? This preference signal, while noisy and subjective, turns out to be sufficient to dramatically improve model behavior.
The standard RLHF pipeline, as described in Ouyang et al. (2022) "Training language models to follow instructions with human feedback" (the InstructGPT paper), consists of three stages:
Before applying RLHF, the base model is fine-tuned on high-quality demonstration data. Human annotators write ideal responses to prompts, and the model is trained to imitate these demonstrations using standard cross-entropy loss.
This stage serves two purposes:
A reward model (RM) is trained to predict human preferences. The RM takes a prompt and response as input and outputs a scalar reward score.
Data collection: Human annotators are shown a prompt and multiple model responses (typically 4-9), then rank them from best to worst. These rankings are decomposed into pairwise comparisons: (prompt, chosen_response, rejected_response).
The reward model is trained using the Bradley-Terry preference model:
$$\mathcal{L}{RM} = -\mathbb{E}{(x, y_w, y_l) \sim D} [\log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))]$$
where $y_w$ is the preferred (winning) response, $y_l$ is the rejected (losing) response, and $r_\phi$ is the reward model with parameters $\phi$.
import torch
import torch.nn as nn
from transformers import AutoModelForSequenceClassification
class RewardModel(nn.Module):
def __init__(self, model_name):
super().__init__()
self.model = AutoModelForSequenceClassification.from_pretrained(
model_name, num_labels=1
)
def forward(self, input_ids, attention_mask):
outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
return outputs.logits.squeeze(-1) # Scalar reward
def reward_loss(reward_model, chosen_ids, chosen_mask, rejected_ids, rejected_mask):
chosen_reward = reward_model(chosen_ids, chosen_mask)
rejected_reward = reward_model(rejected_ids, rejected_mask)
loss = -torch.log(torch.sigmoid(chosen_reward - rejected_reward)).mean()
return loss, chosen_reward.mean(), rejected_reward.mean()
Reward model sizing: The RM is typically the same size or slightly smaller than the policy model. Using a much smaller RM risks reward hacking, where the policy exploits weaknesses in the RM's judgment. InstructGPT used a 6B reward model for a 175B policy.
Proximal Policy Optimization (Schulman et al., 2017) is used to optimize the language model policy against the reward model, with a KL divergence penalty to prevent the policy from diverging too far from the SFT model.
The objective is:
$$\max_\pi \mathbb{E}{x \sim D, y \sim \pi(\cdot|x)} [r\phi(x, y)] - \beta \cdot D_{KL}[\pi(\cdot|x) || \pi_{ref}(\cdot|x)]$$
The KL penalty is critical: without it, the model learns to produce degenerate outputs that exploit the reward model (reward hacking). The reference policy $\pi_{ref}$ is typically the SFT model.
The PPO algorithm involves:
from trl import PPOConfig, PPOTrainer, AutoModelForCausalLMWithValueHead
# PPO requires a value head for advantage estimation
model = AutoModelForCausalLMWithValueHead.from_pretrained("sft-model")
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained("sft-model")
ppo_config = PPOConfig(
model_name="sft-model",
learning_rate=1.41e-5,
batch_size=64,
mini_batch_size=16,
gradient_accumulation_steps=4,
ppo_epochs=4,
kl_penalty="kl",
init_kl_coef=0.2, # Initial KL penalty coefficient
target_kl=6.0, # Target KL divergence
cliprange=0.2, # PPO clipping range
vf_coef=0.1, # Value function coefficient
)
ppo_trainer = PPOTrainer(
config=ppo_config,
model=model,
ref_model=ref_model,
tokenizer=tokenizer,
dataset=prompt_dataset,
)
for batch in ppo_trainer.dataloader:
# Generate responses
query_tensors = batch["input_ids"]
response_tensors = ppo_trainer.generate(query_tensors, max_new_tokens=256)
# Score with reward model
rewards = [reward_model(q, r) for q, r in zip(query_tensors, response_tensors)]
# PPO step
stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
PPO-based RLHF is notoriously difficult to implement and tune:
Rafailov et al. (2023) in "Direct Preference Optimization: Your Language Model Is Secretly a Reward Model" showed that the RLHF objective has a closed-form solution that maps the optimal policy directly to a function of the preference data, eliminating the need for explicit reward modeling and RL.
The key mathematical insight: given the standard RLHF objective, the optimal policy satisfies:
$$\pi^(y|x) = \frac{1}{Z(x)} \pi_{ref}(y|x) \exp\left(\frac{1}{\beta} r^(x, y)\right)$$
This can be rearranged to express the reward in terms of the policy:
$$r^(x, y) = \beta \log \frac{\pi^(y|x)}{\pi_{ref}(y|x)} + \beta \log Z(x)$$
Substituting this into the Bradley-Terry preference model and noting that the partition function $Z(x)$ cancels, we get the DPO loss:
$$\mathcal{L}{DPO} = -\mathbb{E}{(x, y_w, y_l)} \left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right)\right]$$
DPO is remarkably simple to implement compared to PPO:
from trl import DPOConfig, DPOTrainer
dpo_config = DPOConfig(
output_dir="./dpo-output",
num_train_epochs=1,
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
learning_rate=5e-7, # Very low LR for DPO
lr_scheduler_type="cosine",
warmup_ratio=0.1,
beta=0.1, # KL penalty strength
loss_type="sigmoid", # Standard DPO loss
bf16=True,
gradient_checkpointing=True,
max_length=2048,
max_prompt_length=1024,
)
trainer = DPOTrainer(
model=model,
ref_model=ref_model, # Or use implicit reference with peft
args=dpo_config,
train_dataset=preference_data,
tokenizer=tokenizer,
)
trainer.train()
Hong et al. (2024) proposed ORPO, which eliminates the need for a separate reference model entirely by combining SFT and preference optimization into a single training stage.
ORPO adds an odds ratio-based penalty to the standard SFT loss:
$$\mathcal{L}{ORPO} = \mathcal{L}{SFT}(y_w) + \lambda \cdot \mathcal{L}_{OR}$$
where the odds ratio loss penalizes the model for assigning higher odds to rejected responses than chosen responses:
$$\mathcal{L}{OR} = -\log \sigma\left(\log \frac{odds\theta(y_w|x)}{odds_\theta(y_l|x)}\right)$$
The odds of a sequence is defined as $odds(y|x) = \frac{P(y|x)}{1 - P(y|x)}$.
ORPO's advantage is efficiency: it requires only one model and one training stage, combining instruction following and preference alignment. The authors showed competitive results with DPO while being simpler to implement.
Ethayarajh et al. (2024) introduced KTO, which aligns models using only binary feedback (good/bad) rather than paired preferences. This is significant because binary feedback is far easier to collect than pairwise comparisons.
KTO is grounded in prospect theory (Kahneman & Tversky, 1979), which describes how humans evaluate gains and losses asymmetrically. The KTO loss:
$$\mathcal{L}{KTO} = \mathbb{E}{(x,y) \in D_{desirable}} [\lambda_D \sigma(\beta(r_{ref} - r_\theta(x,y)))] + \mathbb{E}{(x,y) \in D{undesirable}} [\lambda_U \sigma(\beta(r_\theta(x,y) - r_{ref}))]$$
where $r_\theta(x,y) = \log \frac{\pi_\theta(y|x)}{\pi_{ref}(y|x)}$ is the implicit reward.
KTO can match DPO performance even without paired data, making it practical when you have thumbs-up/thumbs-down feedback but no side-by-side comparisons.
The quality of alignment depends critically on preference data quality. Several approaches exist:
The gold standard. Human annotators compare model outputs and select the better one. Key considerations:
Using a stronger model (e.g., GPT-4, Claude) to generate preference data for training a weaker model:
def generate_synthetic_preferences(prompts, model_to_evaluate, judge_model):
"""Generate preference pairs using a strong model as judge."""
preferences = []
for prompt in prompts:
# Generate multiple responses from the model being trained
responses = [model_to_evaluate.generate(prompt) for _ in range(4)]
# Use a strong model to rank responses
ranking_prompt = f"""Rank these responses to: "{prompt}"
Response A: {responses[0]}
Response B: {responses[1]}
Response C: {responses[2]}
Response D: {responses[3]}
Rank from best to worst with reasoning."""
ranking = judge_model.generate(ranking_prompt)
# Parse ranking and create pairwise comparisons
pairs = create_pairwise_from_ranking(responses, ranking)
preferences.extend(pairs)
return preferences
This approach, used in papers like "Self-Play Fine-Tuning" (SPIN) and "Constitutional AI" (Bai et al., 2022), is cheaper but introduces the judge model's biases.
Constitutional AI (Anthropic, 2022) replaces human preferences with AI-generated feedback based on a set of principles (a "constitution"). The model critiques its own outputs and selects the response that better adheres to the constitutional principles.
This approach scales more easily than human annotation but depends on the quality of the constitution and the judge model's ability to apply it consistently.
Meng et al. (2024) introduced SimPO, which simplifies DPO further by removing the reference model entirely. SimPO uses the average log-probability of a sequence as an implicit reward, arguing that this length-normalized metric better reflects generation quality than DPO's raw log-ratio.
The SimPO reward is defined as:
$$r_{SimPO}(x, y) = \frac{1}{|y|} \log \pi_\theta(y|x) = \frac{1}{|y|} \sum_{t=1}^{|y|} \log \pi_\theta(y_t|x, y_{<t})$$
The length normalization is critical: without it, shorter responses are inherently favored because they accumulate fewer (typically negative) log-probability terms. SimPO also introduces a target reward margin $\gamma$ that explicitly separates winning and losing responses:
$$\mathcal{L}{SimPO} = -\mathbb{E}{(x, y_w, y_l)} \left[\log \sigma\left(\frac{\beta}{|y_w|} \log \pi_\theta(y_w|x) - \frac{\beta}{|y_l|} \log \pi_\theta(y_l|x) - \gamma\right)\right]$$
The margin $\gamma$ (typically 0.5-1.5) acts as a minimum reward gap, ensuring the model does not merely make the preferred response slightly better than the rejected one but maintains a meaningful quality difference. SimPO matches or exceeds DPO performance on benchmarks like AlpacaEval 2 and MT-Bench while requiring less memory (no reference model forward passes) and being straightforward to implement.
A key limitation of standard DPO is that it trains on a static, offline preference dataset -- typically generated by the SFT model before alignment begins. As the policy improves during training, this fixed dataset becomes increasingly off-policy: the preference pairs no longer reflect the kinds of outputs the current model would produce. This distribution mismatch can limit DPO's effectiveness and cause the policy to plateau early.
Online DPO addresses this by generating fresh preference data from the current policy at each training iteration. The procedure repeats a cycle:
This on-policy data generation ensures the training signal remains relevant to the model's current behavior. Empirically, online DPO converges to stronger policies than offline DPO, particularly when the initial preference dataset is small or low-quality (Guo et al., 2024).
Iterative DPO (Yuan et al., 2024) extends this idea across multiple rounds. After each DPO training phase, the improved policy generates new candidate responses, which are then judged (often by an LLM-as-judge -- see Article 33) to produce the next round's preference dataset. Each iteration refines both the policy and the quality of the on-policy data:
$$\pi^{(t+1)} = \text{DPO}(\pi^{(t)}, \mathcal{D}^{(t)}) \quad \text{where} \quad \mathcal{D}^{(t)} \sim \pi^{(t)}$$
The trade-off is computational cost: online and iterative variants require generation at each step (the same bottleneck as PPO), partially negating DPO's efficiency advantage. In practice, performing 2-4 iterations with moderate-sized batches offers a favorable balance between data freshness and compute cost.
Group Relative Policy Optimization (GRPO), introduced by Shao et al. (2024) and central to the training of DeepSeek-R1, represents a significant departure from preference-based alignment. Rather than learning from human preference comparisons, GRPO uses verifiable, outcome-based rewards to train reasoning capabilities -- making it one of the most impactful alignment developments for reasoning models.
The core idea is elegantly simple. For each prompt $x$, sample a group of $G$ responses ${y_1, y_2, \ldots, y_G}$ from the current policy. Compute a reward $r_i$ for each response using a verification function (e.g., checking mathematical correctness, code execution, or format compliance). Then normalize rewards within the group to compute advantages:
$$\hat{A}_i = \frac{r_i - \text{mean}({r_1, \ldots, r_G})}{\text{std}({r_1, \ldots, r_G})}$$
The policy is updated using a clipped objective similar to PPO, but without a learned value function (critic):
$$\mathcal{L}{GRPO} = -\frac{1}{G} \sum{i=1}^{G} \min\left(\frac{\pi_\theta(y_i|x)}{\pi_{old}(y_i|x)} \hat{A}i, ; \text{clip}\left(\frac{\pi\theta(y_i|x)}{\pi_{old}(y_i|x)}, 1-\epsilon, 1+\epsilon\right) \hat{A}i\right) + \beta \cdot D{KL}[\pi_\theta || \pi_{ref}]$$
GRPO's advantages are substantial for reasoning tasks:
The limitation is clear: GRPO requires tasks with verifiable outcomes. It is not directly applicable to open-ended generation tasks like creative writing or nuanced instruction following, where "correctness" cannot be mechanically verified. In practice, production systems like DeepSeek-R1 combine GRPO for reasoning with preference-based methods (DPO or PPO) for general alignment and helpfulness.
Standard reward models in RLHF are Outcome Reward Models (ORMs): they assign a single scalar score to the entire completed response. This works well for short, straightforward tasks but provides sparse, uninformative signal for multi-step reasoning, where an error in step 3 of a 10-step proof renders all subsequent steps worthless despite potentially correct reasoning within those steps.
Process Reward Models (PRMs) address this by providing step-level supervision. A PRM assigns a reward to each intermediate reasoning step, enabling the model to receive credit for correct partial reasoning and targeted penalties for the specific step where an error occurs.
Lightman et al. (2023) demonstrated that PRMs substantially outperform ORMs for mathematical reasoning, particularly when used for best-of-N selection (generating multiple solutions and selecting the one with the highest process reward). The training procedure requires step-level annotations indicating whether each reasoning step is correct:
| Aspect | ORM | PRM |
|---|---|---|
| Supervision granularity | Entire response | Per reasoning step |
| Annotation cost | Low (binary correct/wrong) | High (label each step) |
| Signal density | Sparse | Dense |
| Best for | Short tasks, general alignment | Multi-step reasoning, math, code |
| Credit assignment | Ambiguous | Precise |
PRMs are believed to be a key component in training frontier reasoning models like OpenAI's o1 and o3 series. The dense reward signal allows RL training to more efficiently learn correct reasoning patterns, because the model receives immediate feedback on where its reasoning goes wrong rather than only learning that the final answer was incorrect.
The primary challenge with PRMs is annotation cost: labeling individual reasoning steps requires expert annotators who can evaluate mathematical or logical correctness at each stage. Automated approaches to PRM training -- such as using Monte Carlo rollouts to estimate step-level correctness (Wang et al., 2024) -- are an active research direction that promises to make process supervision more practical. For a deeper discussion of dataset construction for alignment, see Article 22.
A compelling direction in alignment research is training models to improve themselves through self-play, reducing dependence on external human or AI feedback.
SPIN (Self-Play Fine-Tuning) by Chen et al. (2024) frames alignment as a two-player game. The current model policy plays against itself from the previous iteration: the "main player" learns to distinguish its own generations from ground-truth human responses, while the "opponent" is the previous iteration's policy. The objective trains the model to generate responses indistinguishable from (and eventually better than) high-quality reference data:
$$\mathcal{L}{SPIN} = \mathbb{E}{x \sim D} \left[ f\left(\lambda \log \frac{p_\theta(y_{real}|x)}{p_{\theta_t}(y_{real}|x)} - \lambda \log \frac{p_\theta(y_{synth}|x)}{p_{\theta_t}(y_{synth}|x)}\right) \right]$$
where $y_{real}$ is a human-written response and $y_{synth}$ is generated by the previous iteration $\pi_{\theta_t}$. SPIN converges when the model's generations are indistinguishable from human data -- at which point no further improvement is possible through this mechanism.
Self-Rewarding Language Models (Yuan et al., 2024) take a more ambitious approach. The model serves as its own reward model, using LLM-as-judge prompting (see Article 33) to evaluate and score its own outputs. Training proceeds iteratively:
The key finding is that the model's ability to judge quality improves alongside its generation ability, creating a virtuous cycle. Meta's work showed that self-rewarding models can surpass models trained with static human preference data, suggesting that iterative self-improvement may be more effective than collecting ever-larger human annotation datasets.
Both approaches share a philosophical connection with Constitutional AI (see Article 43): the model's own capabilities are leveraged to drive alignment, with varying degrees of human-specified principles guiding the process. The SFT foundations underlying all these methods are covered in Article 19.
Alignment is not free. The "alignment tax" refers to the performance degradation on raw capability benchmarks that often accompanies alignment training:
Research suggests DPO has a lower alignment tax than PPO, likely because it stays closer to the reference policy. Careful calibration of the $\beta$ parameter (or KL coefficient for PPO) is the primary tool for managing this tradeoff.
| Method | Paired Data? | Reference Model? | Complexity | Best For |
|---|---|---|---|---|
| PPO | No (uses RM) | Yes + RM + Value | Very High | Maximum quality, large budgets |
| DPO | Yes | Yes | Low | Standard alignment, good static data |
| Online DPO | Generated per iteration | Yes + judge/RM | Medium | Overcoming off-policy data limits |
| SimPO | Yes | No | Very Low | Memory-constrained alignment |
| ORPO | Yes | No | Very Low | Single-stage training |
| KTO | No (binary) | Yes | Low | Binary feedback data |
| GRPO | No (verifiable tasks) | Yes (no critic) | Medium | Reasoning, math, code |