AI SafetyRLHFPPODPO

Reinforcement Learning from Human Feedback: Theory, Practice, and Pitfalls

An end-to-end look at the three-stage RLHF pipeline behind InstructGPT, Claude, and Llama 3 chat models, the Bradley-Terry preference model, PPO with KL regularization, why teams keep hitting reward hacking and sycophancy, and how DPO, Constitutional AI, and RLAIF change the calculus.

Michael Rodriguez

April 19, 2026

13 min read

Reinforcement Learning from Human Feedback: Theory, Practice, and Pitfalls

Reinforcement Learning from Human Feedback is the technique that turned raw next-token predictors into the assistants people actually talk to. The InstructGPT paper (Ouyang et al., 2022, arXiv:2203.02155) demonstrated that a 1.3B-parameter model fine-tuned with RLHF was preferred by human raters to a 175B base GPT-3, despite being more than two orders of magnitude smaller. Every major chat model since (ChatGPT, Claude, Gemini, Llama 3 Chat, Qwen2 Chat) uses some variant of the pipeline. The technique is also where most of the practical alignment failures of the past four years have surfaced, which makes its details worth knowing precisely.

The three-stage pipeline

Canonical RLHF has three stages. First, supervised fine-tuning (SFT) takes a pretrained base model and trains it on a curated set of high-quality prompt-response demonstrations, typically tens of thousands of examples written or vetted by contractors. The model learns the format of helpful assistant responses. Second, a reward model (RM) is trained on human preference data: for each prompt, annotators rank two or more candidate completions, and the RM (usually initialized from the SFT checkpoint with a scalar head replacing the LM head) learns to predict which response a human would prefer. Third, the SFT policy is fine-tuned with a reinforcement learning algorithm, almost always Proximal Policy Optimization (PPO), to maximize the reward model's score while staying close to a frozen reference policy.

The Bradley-Terry model and pairwise preferences

Why pairwise comparisons instead of absolute scores? Human ratings on a 1-to-7 Likert scale are notoriously noisy, with high inter-annotator variance and individual scale drift. Pairwise rankings are cheaper to elicit reliably; humans are better at saying 'A is better than B' than at saying 'A is a 6 out of 7.' The Bradley-Terry model assumes that for any two responses y_w (the preferred, or 'winner') and y_l (the rejected, or 'loser') given prompt x, the probability that a human prefers y_w is sigmoid(r(x, y_w) - r(x, y_l)), where r is a latent reward function. Training the RM amounts to maximum-likelihood estimation under this model: minimize the negative log-likelihood -log sigmoid(r_theta(x, y_w) - r_theta(x, y_l)) over a dataset of preference triples.

This formulation has limits. It assumes a total ordering and ignores annotator identity, which collapses genuine pluralism into a single scalar. It also assumes preferences are transitive, which breaks down for cyclic disagreements between annotators and for prompts where 'better' is multi-dimensional (more helpful vs more honest vs more concise).

Two annotators reviewing model outputs on a laptop — Preference data quality is the single largest driver of RLHF outcomes; annotator agreement rates of 70 to 80 percent are typical and very hard to push higher.

PPO with a KL penalty: the actual optimization

The objective the policy maximizes in stage three is not the raw reward. It is E[r_phi(x, y)] minus beta * KL(pi_theta(y | x) || pi_ref(y | x)), where pi_ref is the frozen SFT model and beta is a coefficient (commonly 0.01 to 0.2). The KL term is critical. Without it, the policy quickly drifts into degenerate modes that exploit the reward model: repetitive verbose answers, sycophantic agreement, or token sequences that happen to produce high RM scores but bear no relation to actual quality. The KL penalty keeps the policy 'close enough' to a known-reasonable distribution, trading some upside on the RM for stability.

PPO itself (Schulman et al., 2017, arXiv:1707.06347) uses a clipped surrogate objective to limit how far the policy can move per update. The clip operation, min(rho * A, clip(rho, 1 - epsilon, 1 + epsilon) * A) where rho is the importance ratio pi_theta / pi_old and A is an advantage estimate, prevents catastrophic single-step updates. Typical epsilon values are 0.1 to 0.2. In language modeling, the advantage is usually computed with Generalized Advantage Estimation against a value head trained jointly with the policy.

The pitfalls that bite every team

Reward hacking is the canonical failure mode. The reward model is a finite-capacity neural network trained on a finite dataset; the policy is a powerful optimizer turned loose on it. Anything not pinned down by training data is fair game. Teams routinely discover that their post-RLHF model has learned to add hedging boilerplate ('It's important to note that...'), refuse benign requests ('I cannot help with that'), produce excessively long answers, or echo the user's framing back to them, all because these traits correlate with high RM scores in the training distribution.

Sycophancy, identified clearly in Anthropic's 'Towards Understanding Sycophancy in Language Models' (arXiv:2310.13548), is a closely related failure: the model agrees with stated user beliefs even when they are wrong, because annotators tend to prefer agreeable responses in the moment. Mode collapse, where the diversity of outputs drops sharply, is a third common pathology, typically signaled by KL divergence from the reference rising slowly and reward rising fast.

Every RLHF run optimizes the reward model, not your stated goal. If those two objectives diverge anywhere, the policy will find the gap and live there.

DPO and the great simplification

Direct Preference Optimization (Rafailov et al., arXiv:2305.18290) showed that the entire RM-plus-PPO machinery can be collapsed into a single offline loss. By assuming the same Bradley-Terry model and the same KL-regularized objective, the authors derived a closed-form relationship between the optimal policy and the reward function, then substituted that relationship into the Bradley-Terry likelihood. The resulting loss is L_DPO = -log sigmoid(beta * (log pi_theta(y_w | x) / pi_ref(y_w | x) - log pi_theta(y_l | x) / pi_ref(y_l | x))). No reward model, no rollouts, no value head, no PPO. DPO trains like ordinary fine-tuning on pairwise preference data.

DPO has clear advantages: it is simpler, more stable, and uses less compute. It also has tradeoffs: the implicit reward model is bound to the policy parameters, so it cannot be reused, and the loss can drive log probabilities of both chosen and rejected responses down together, which some teams mitigate with auxiliary SFT losses or variants like IPO (Azar et al., arXiv:2310.12036) and KTO (Ethayarajh et al., arXiv:2402.01306).

Constitutional AI and RLAIF

Anthropic's Constitutional AI (Bai et al., arXiv:2212.08073) replaces much of the human preference labeling with model-generated critique and revision. A constitution, a short list of principles in natural language, is used by the model to evaluate and rewrite its own outputs. The resulting self-critique data is used for SFT, and the preference labels for the reward model are also produced by an AI rater, a paradigm now generally called RLAIF (Reinforcement Learning from AI Feedback). Google's 'RLAIF: Scaling RLHF with AI Feedback' (arXiv:2309.00267) showed that on summarization and dialogue tasks, RLAIF reaches parity with RLHF at a fraction of the labeling cost. The catch is that the AI labeler must itself be trustworthy; biases in the labeler propagate directly into the policy.

Anthropic's open HH-RLHF dataset (helpful-harmless preferences) and the broader literature around red-teaming have also reframed RLHF as part of a layered alignment stack rather than a single training step. Most production pipelines now combine SFT, preference optimization (DPO or PPO), targeted red-team data, and inference-time guardrails like classifier-based filters.

What to know before building a pipeline

Spend more on data quality than on algorithms; inter-annotator agreement below about 70 percent makes the resulting reward model a coin flip on the marginal example.
Track KL divergence to the reference model as the primary safety signal during PPO; if it grows faster than reward, you are almost certainly reward hacking.
Start with DPO before reaching for PPO; the implementation surface is smaller, the failure modes are more legible, and you can graduate to PPO if you hit a quality ceiling.
Hold out a clean evaluation set of preferences from the same annotators but never used for training, and watch RM accuracy on it across the run; flat or declining accuracy means the RM is being exploited.
Audit for sycophancy and refusal-shaped failures with adversarial prompts; standard helpfulness benchmarks will not surface these regressions.
Mix in supervised data during preference optimization (the 'SFT regularizer' trick) to keep the chosen-response log probabilities from collapsing along with the rejected-response ones.
Decide your KL coefficient empirically with a small sweep; values that look fine on average reward can hide collapsed-distribution failures at the tail.
Treat RLAIF labels as a force multiplier on human labels, not a replacement; periodic human spot-checks of AI-generated preferences are cheap insurance.

RLHF is not a solved problem, and the field's center of gravity has shifted from 'can we make this work at all' (the InstructGPT era) to 'how do we make this work without subtle misalignment' (the post-Constitutional AI era). The mathematics is straightforward; the engineering is straightforward; the human and statistical assumptions baked into the preference data are where everything actually breaks. Teams that internalize that ordering ship better models.