All Things RLHF: How We Actually Teach AI to Care What Humans Think

Tushar Prasad
Apr 19
20 min read

"ChatGPT feels different."

I heard that phrase about a hundred times in late 2022. Everyone was groping for why it felt qualitatively better than GPT-3, which by parameter count and pre-training data was the same model underneath. The answer wasn't a bigger model or more tokens. It was a training technique most engineers outside the alignment world had never touched: Reinforcement Learning from Human Feedback.

RLHF is the bridge between "an LLM that predicts the next token" and "an LLM that is actually helpful." It's the step OpenAI used to turn a raw language engine into an assistant. It's what Meta used to ship Llama 3. It's what decides whether your model sounds like a toddler who swallowed an encyclopedia or like a colleague who's read the room.

I was scared of this stack for a while. The papers are dense, the math is chewy, and everyone talks about PPO and DPO like they're obvious. They aren't. They're beautiful, but you have to earn them.

So here's the whole thing. Reward models, PPO, DPO, with the math, the code, the failure modes, and the analogies I wish someone had handed me on day one. If you've been nodding along in meetings while people say "we just DPO'd it," by the end of this post you'll know what that actually means.

The foundation: what RL even is

Before we do anything RLHF-specific, we need to be honest about the paradigm.

Supervised learning has an answer key. You show the model an input, show it the correct output, and train until its predictions match. Clean. Deterministic. Boring in a good way.

Reinforcement learning has no answer key. It has consequences.

Imagine playing "Hot and Cold" blindfolded. You step forward, a friend shouts "Warmer." You step left, "Colder." Over time you learn a direction. Nobody told you where the hidden object was. You figured it out by being graded.

Formalized:

Observation. The agent sees a state (a prompt).
Action. The agent picks something to do (generate a token).
Feedback. The environment returns a reward signal.
Update. The agent tweaks its policy so rewarded actions get more likely.

Inside that loop there's always a tension: exploration (try something new, maybe discover gold) vs exploitation (do the thing that worked last time). Every RL algorithm is a different answer to that tradeoff.

Why not just use supervised learning for everything?

Because "good writing" doesn't have an answer key.

If I ask you to write a better summary of this paragraph, there isn't one right answer. There are a thousand acceptable ones, a thousand mediocre ones, and a few that are actively bad. Supervised fine-tuning (IFT) is great when you're teaching a model to extract JSON, write SQL, or classify support tickets. Tasks where "correct" is binary.

It falls apart for summarization, chat, safety, creativity. You can't write a training example for "be helpful without being sycophantic." You can only recognize it when you see it.

That's where RLHF lives. And it stacks with IFT, not instead of it. The real pipeline looks like this:

Pretraining → IFT → RLHF

Pretraining teaches grammar and facts. IFT teaches format. RLHF teaches values and taste. Different tools, different stages.

A quick tour of the Transformer, because we need it

You can't modify a language model's behavior if you don't know what's under the hood. Here's the five-minute version.

When text enters a Transformer, three things happen first. Tokenization turns text into integer IDs (the model doesn't read words, it reads numbers). Embeddings map each ID into a high-dimensional vector (768 dims in DistilBERT, 4096 in Llama-7B), and similar words cluster in that space. Positional encoding injects a math signal that tells the model which token came first, because Transformers process all tokens in parallel and would otherwise treat "Dog bites man" and "Man bites dog" as the same sentence.

The beating heart is self-attention. For every token, the model asks: "Which other tokens in this sentence matter for understanding me?" It implements this through three learned matrices: Query, Key, and Value.

Think of it as a library database. The Query is what this word is looking for. The Key is the searchable metadata every other word raises. The Value is the actual content pulled in based on Q·K alignment.

Example: in "The bark of the tree was rough," the word "bark" is ambiguous. It generates a Query like "I'm a noun, what's my context?" Every other word raises a Key. The word "tree" raises "I am a plant." The dot product Q_bark · K_tree scores high. The Value of "tree" gets blended into the representation of "bark." Result: the model decides "bark" means tree bark, not a dog sound.

Then comes the forward pass, loss, backprop, gradient descent. The update rule is:

θ_new = θ_old − (Learning_Rate × ∇Loss)

Over trillions of iterations, a Transformer trained on internet text learns to predict "what an internet user would write." Which, and this is the whole point of RLHF, includes a lot of stuff humans don't actually want. Toxic takes, biased reasoning, confidently wrong answers. RLHF is the mechanism that takes this raw language engine and redirects its weights toward what humans actually value.

Part 1: the reward model, building the judge

The classic RLHF pipeline has three phases: SFT, reward model training, RL optimization. We'll do the second one first because everything downstream depends on it.

The reward model is a separate AI whose entire job is to read a (prompt, response) pair and output a single number representing quality. That's it. It's the judge.

Why we rank instead of score

Here's something that tripped me up the first time. The training data for a reward model is not people giving responses a score out of 10. It's pairs: "Given this prompt and these two responses, which one is better?"

Why?

Scores are noisy. Human 1 rates a response 7/10, Human 2 rates the same response 9/10. That isn't disagreement, that's calibration drift. Everyone uses the 1-10 scale differently.

Rankings are consistent. Ask two humans "Is A better than B?" and they usually agree. Pairwise ranking is way more reliable than absolute scoring.

A typical training example:

Prompt:        "Summarise the history of the internet."
Chosen (A):    "The internet started in the 1960s as a US military
                project called ARPANET..."
Rejected (B):  "Computers are good because they connect people."
Human label:   A > B

Bradley-Terry: the math that turns preferences into loss

We need to convert "A > B," a binary human judgment, into a trainable loss. The Bradley-Terry model does this by assuming every response has a hidden latent quality score r, and modeling the probability that a human prefers A over B via the sigmoid of the score difference:

P(A > B) = σ( r(x, y_winner) − r(x, y_loser) )
        = 1 / (1 + e^(−(r_A − r_B)))

The sigmoid σ squashes any real number into [0, 1]. When the gap r_A − r_B is large and positive, σ goes to ~1 (the model is confident A is better). When the gap is negative, σ goes to ~0.

Training is just "maximize the probability of each preference pair." Equivalently, minimize the negative log likelihood:

L(θ) = −log( σ( r_θ(x, y_winner) − r_θ(x, y_loser) ) )

The Gap: why this loss shape is so good

Let Gap = r_winner − r_loser. This is the quantity the loss is really sensitive to:

Correct and confident     Gap = +4  →  σ(+4) ≈ 0.98  →  Loss ≈ 0.02
Wrong and confident       Gap = −4  →  σ(−4) ≈ 0.02  →  Loss ≈ 3.9

That's the magic of the log. Confidently wrong answers get enormous loss. If the model is 100% sure about the wrong answer, the penalty approaches infinity. The math literally screams at the model to fix itself.

Backprop then does something I find quietly delightful. Because the loss is a function of (r_winner − r_loser), gradients flow in two opposite directions at the same time. The winner pathway gets "increase the weights that produced r_winner" (score goes up). The loser pathway gets "decrease the weights that produced r_loser" (score goes down). One comparison, two gradient signals. After thousands of iterations, the model reliably assigns high scores to quality text and low scores to junk.

And here's the philosophical bit that took me a while to absorb: the reward model invents the scores from scratch. The dataset never contained a single numerical score. Just thousands of "A > B." The model learns to fabricate a coherent numerical scale that's consistent with every human preference it's seen.

Think of a boxing match. The referee only raises the winner's hand. No scorecard, no numbers. The reward model is like the ringside judge who watches every fight and learns to write detailed scorecards that always predict which fighter the referee will pick. It invents the numbers. It only needed to see who won.

The surgery: turning an LLM into a scorer

You can't use a language model off the shelf as a reward model. LLMs output probabilities over a 50,000-token vocabulary, not a scalar. So we do surgery.

First, load a pre-trained backbone (BERT, DistilBERT, a small Llama, anything that already understands language). Then rip off the Language Head, the final layer that maps hidden states to vocabulary probabilities. Attach a Score Head in its place: a single nn.Linear(hidden_size, 1) that maps the hidden state to one number.

The hidden state itself is worth pausing on. When DistilBERT processes a sentence, the [CLS] token produces a 768-dimensional vector. Each dimension is a learned feature. You can think of it as an RPG character sheet:

Feature 1  (positivity?):    0.02
Feature 2  (about animals?): 9.5
Feature 3  (past tense?):    5.1
...
Feature 768 (political?):   -0.01

The Score Head, nn.Linear(768, 1), is just a weighted sum:

Score = (F1 × w1) + (F2 × w2) + ... + (F768 × w768)

Training figures out which of the 768 weights should be positive and which should be negative. Features like "polite" and "detailed" get positive weights. Features like "vague" and "rude" get negative weights. That's it. That's the whole model.

Full PyTorch implementation

Here's the complete, annotated reward model. The comments matter; read them.

Part A, Dataset

The key trick here is teacher forcing. We feed [prompt + response] together so the model reads both at once rather than letting it generate freely.

import torch
from torch.utils.data import Dataset

class PreferenceDataset(Dataset):
    def __init__(self, data, tokenizer):
        self.data = data          # List of (prompt, winner, loser) tuples
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
# ... (21 more lines truncated for brevity)

Part B, the model

import torch.nn as nn
from transformers import AutoModel

class RewardModel(nn.Module):
    def __init__(self, base_model_name):
        super().__init__()
        # STEP 1: Load pre-trained backbone.
        # Already understands English. Does NOT yet know quality.
        self.backbone = AutoModel.from_pretrained(base_model_name)

        # STEP 2: Query the backbone's hidden size.
        # DistilBERT → 768, BERT-large → 1024, Llama-7B → 4096
# ... (14 more lines truncated for brevity)

Part C, the training loop with Bradley-Terry loss

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModel

# ─── CONFIG ────────────────────────────────────────────────────
MODEL_NAME = "distilbert-base-uncased"  # 66M params, CPU-friendly
BATCH_SIZE = 2
EPOCHS     = 10
LR         = 1e-5   # Reward models are sensitive, keep LR small

# ... (73 more lines truncated for brevity)

A few things that will save you a day of debugging.

[CLS] vs last token. BERT-style models use [CLS] at index 0. GPT and Llama-style models use the last real token. Because batches are padded, "last" lives at different indices per row, so you'll need gather().

Always use logsigmoid(x), never log(sigmoid(x)). The second form explodes to -inf for very negative inputs and you will stare at a nan loss for an hour before noticing.

Initial loss should be around 0.693, which is −log(0.5), a coin-flip guess. A rapid drop from there confirms the model is learning. If it doesn't move, your LR is probably wrong or your data has a bug.

For production, don't build from scratch. ArmoRM, Starling-RM, Skywork-Reward-Model, and DeBERTa-v3-base-reward-model already exist. They were trained on millions of human votes. Use them unless your domain is genuinely weird (medical, internal legal compliance). The "Buy vs Build" decision is almost always "Buy" for v1.

Okay. We have a judge. Now what do we do with it?

Part 2: PPO, the classic RLHF engine

Proximal Policy Optimization is the algorithm that trained ChatGPT. It's the "classic" RLHF engine, the one most papers still reference as the baseline. It also has a reputation for being unstable, VRAM-hungry, and a pain to tune. That reputation is earned.

The core loop is simple. The model generates a response, the reward model scores it, the model updates to get more reward next time. You're training a dog with treats.

The complications come from making that loop actually stable.

The four models

Before you run a single training step, you need to load four neural networks into memory. This is why PPO is expensive.

The Actor is the main LLM being trained. It generates text and its weights change every step.

The Reference is a frozen copy of the Actor at step 0. It never changes. Its whole purpose is to compute the KL penalty that keeps the Actor from wandering off into gibberish.

The Reward Model is the frozen judge from Part 1. It reads generated text and outputs a scalar. It never changes.

The Critic runs alongside the Actor and predicts the expected future reward at every token position. It trains simultaneously with the Actor. I'll explain why it exists in a minute.

Four models in VRAM, three of them the size of the one you're actually training. You see why people moved to DPO.

The four phases of a rollout

Phase 1, rollout (playing the game). The Actor gets a batch of prompts and generates full responses token by token. Critically, during generation, the log probability of each chosen token is saved. These are the "Old LogProbs" and they matter in Phase 4.

Prompt:   "Summarise this article about climate change."
Actor:    "Climate change is driven by greenhouse gas emissions, primarily..."
LogProbs: [log(0.82), log(0.91), log(0.74), ...]

Phase 2, scoring (Reward plus KL penalty). The completed response gets two evaluations.

The Reward Model returns a scalar, say +5.0. The same text is run through the frozen Reference Model, and the KL divergence between Actor and Reference token distributions measures how far the Actor has drifted from its original self.

The total reward blends them:

Total_Reward = R_judge − β × log( P_actor(token) / P_reference(token) )

β is typically 0.1, the stiffness of the leash.

If the Actor is generating normal English, P_actor ≈ P_reference, the log ratio is near zero, and there's basically no penalty. If the Actor starts generating gibberish to game the reward model, P_actor >> P_reference, the log ratio explodes, and the penalty wipes out the reward.

This is the mechanism that prevents reward hacking, the classic failure mode where the AI discovers that outputting "Amazing! Great! Wonderful!" triggers high sentiment scores, so it just says that forever and forgets English.

Phase 3, advantage estimation (the Critic's job). Here's where the fourth model earns its keep. PPO doesn't optimize for raw reward. It optimizes for Advantage: how much better or worse the actual reward was than what was expected.

Advantage = Actual_Total_Reward − Expected_Reward(Critic's prediction)

Two examples:

Example A, Positive Advantage:
  Actor: "...absolutely mind-blowing!"
  Critic predicted: 0.5
  Actual reward:    0.9
  Advantage: +0.4
  → Strongly INCREASE probability of "mind-blowing" in similar contexts.

Example B, Negative Advantage:
  Actor: "...boring."
  Critic predicted: 0.5
  Actual reward:    0.2
  Advantage: −0.3
# ... (1 more lines truncated for brevity)

Here's the key insight that took me a beat to accept: PPO optimizes for surprise, not reward. If the Critic predicted 0.9 and the Actor delivered 0.9, Advantage = 0, and nothing updates. Only genuine surprises, better or worse than expected, drive learning.

Why do we need a Critic?

This is where I stumbled the first time. If the reward model already tells me the score, why do I need a second model predicting scores?

The answer is the credit assignment problem. Picture this:

Actor generates: "Why did the chicken cross the road? To punch you in the face." Reward Model reads the complete sentence and gives it -10 (toxic ending).

The reward model applies its score to the entire sentence. But were the first 8 tokens responsible? Or only the last 5? Without a Critic, the Actor can't tell. It would penalize every token equally, including the perfectly innocent "Why did the chicken."

The Critic fixes this. By predicting expected reward at every token position, it can identify the exact moment the sentence went off the rails. When the Actor picks "To punch" over "To get to the other side," the Advantage at that specific token is sharply negative. The weight update focuses precisely where the sentence went wrong.

That's the credit assignment problem, solved by a learned value function.

Phase 4, the PPO update with clipping. This is the innovation that makes PPO stable.

The Actor re-evaluates the same text it generated and computes new log probabilities. We take the ratio of new to old:

Ratio = π_θ_new(action) / π_θ_old(action)
      = exp(LogProb_new − LogProb_old)

The unclipped objective would be:

L = Ratio × Advantage

But that's dangerous. If the model discovers a massive update gives a huge reward, it might make that massive update in one step and permanently destroy its language ability. So PPO clips:

L_CLIP = min( Ratio × Advantage,
              clip(Ratio, 1−ε, 1+ε) × Advantage )

ε is typically 0.2. Translation: the ratio can't move more than 20% per update. If Ratio > 1.2 or < 0.8, the gain is capped.

This is the "Proximal" in PPO. The model is mathematically forced to take small, safe steps.

Picture a golfer fixing their swing. They can make a 1-inch adjustment to their grip, which gives stable improvement, or they can twist 180 degrees and change their whole stance, which breaks their swing forever. PPO enforces the 1-inch option. Even if the math says "a 500% change would score much higher," PPO clips it to 20%. This prevents policy collapse, where a single catastrophic update destroys the model's language ability.

PPO in code with TRL

Writing PPO from scratch is a multi-hundred-line commitment. In practice, Hugging Face's trl library does the heavy lifting.

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from transformers import AutoTokenizer, pipeline
import torch

# ─── 1. LOAD MODELS ────────────────────────────────────────────
# AutoModelForCausalLMWithValueHead wraps the base model with TWO heads:
#   (a) Language Head: predicts next token
#   (b) Value Head:    predicts expected future reward (the Critic)
# Both share the 768-dim hidden state backbone.
model     = AutoModelForCausalLMWithValueHead.from_pretrained("gpt2")

# Frozen safety net. Weights NEVER update. Used for KL penalty.
# ... (54 more lines truncated for brevity)

The 4-step summary of what step() actually does

Step | Action | Technical detail

Step: 1 — Action: Check for madness (KL leash) — Technical detail: Compare Actor probs against frozen Reference. Too much divergence → subtract KL penalty from reward.

Step: 2 — Action: Calculate total reward — Technical detail: Total_Reward = Reward_Score − (β × KL_Penalty). β = 0.1 typical.

Step: 3 — Action: Calculate surprise (Advantage) — Technical detail: Advantage = Total_Reward − Value_Head_Prediction. Positive → reinforce. Negative → discourage. Zero → no learning.

Step: 4 — Action: Update weights (clipped) — Technical detail: Apply clipped objective. clip(Ratio, 0.8, 1.2) guardrail caps behavior change at 20% per step.

The name Proximal Policy Optimization encodes the guarantee: the updated policy must remain proximal (close) to the previous policy. Clipping and the KL penalty work as two independent guardrails. Small, safe, cumulative steps win the alignment race.

A quick terminology detour, parameters vs weights

This comes up constantly. "Parameters" is the umbrella term; weights are a subset. From the algebra you learned in high school: y = mx + b.

The weight (m) is the slope. It multiplies the input and determines strength and direction of influence.

The bias (b) is the y-intercept. A baseline offset. It shifts the activation.

When someone says "a model has 7 billion parameters," they mean ~7 billion numbers, mostly weights inside the Q/K/V attention matrices. Those numbers are what encode everything the model knows.

Part 3: DPO, the beautiful shortcut

In 2023, a team at Stanford published a paper with a mathematical claim that sounded too good to be true: you don't need the reward model. You don't need the Critic. You don't need the whole online RL loop.

You can do RLHF with two models and one loss function.

This is Direct Preference Optimization, and it's the reason Llama 3, Mistral, and most modern models train in days instead of weeks. It's also much less likely to catastrophically destroy your model at 3am.

The trick: DPO proves mathematically that the reward model is redundant. You can compute the exact same alignment signal directly from the log-probability ratio between the policy you're training and a frozen reference copy of itself. No separate judge. No rollouts. No Critic.

Only two models

The Policy Model (π_θ) is the student. Its weights update every step.

The Reference Model (π_ref) is a frozen copy of the policy at step 0. Its weights never change. It acts as the KL anchor.

That's it. Two models. Offline training on a fixed preference dataset. It looks like supervised learning.

The dataset

Same format as reward model training, pairwise preferences:

Prompt (x):       "Explain the sky to a toddler."
Chosen (y_c):     "The sky is a big blue blanket over the world!"
                  ← human preferred
Rejected (y_r):   "The sky is the atmosphere scattering Rayleigh light."
                  ← too complex

Teacher forcing, extracting four log probabilities

In DPO's forward pass, the model does not generate freely. We use teacher forcing: feed the model [prompt + response] together, then extract the log probability the model assigns to that exact text. We do this four times:

log π_θ(y_c | x), active model's logprob for chosen
log π_θ(y_r | x), active model's logprob for rejected
log π_ref(y_c | x), reference model's logprob for chosen
log π_ref(y_r | x), reference model's logprob for rejected

A detour on why logs

Multiplying hundreds of tiny fractions to score a sentence is a disaster. A 100-token sentence might have a joint probability around 10^-100. At around 10^-38, 32-bit float memory rounds to exactly 0.0. This is arithmetic underflow. Once a probability hits zero, your model can no longer distinguish anything, because every comparison becomes 0 == 0.

Logs dodge this using log(A × B) = log(A) + log(B):

Multiplying (underflows):
  0.40 × 0.80 × 0.50 × 0.90 = 0.144

Adding logs (safe):
  log(0.40) + log(0.80) + log(0.50) + log(0.90)
  = (−0.916) + (−0.223) + (−0.693) + (−0.105)
  = −1.937

Two reasons logs are everywhere. The first is safety: multiplication becomes addition, no underflow. The second is order preservation: if P(A) > P(B), then log(P(A)) > log(P(B)). Rankings survive. The model only cares about which answer is better, and logs guarantee that.

And because someone always asks: log(0.40) = −0.916 because 2.718^(−0.916) = 0.40. Computers compute this via Taylor series: log(1+y) ≈ y − y²/2 + y³/3 − y⁴/4 + ..., an infinite polynomial that converges to the exact log.

The implicit reward, the move that makes DPO work

Now the elegant bit. Instead of training a separate reward model, DPO defines an implicit reward directly from the log ratio between the Active and Reference models:

Implicit_Reward(chosen)   = log π_θ(y_c | x) − log π_ref(y_c | x)
Implicit_Reward(rejected) = log π_θ(y_r | x) − log π_ref(y_r | x)

Concrete numbers:

Active:    Chosen = −10,  Rejected = −12
Reference: Chosen = −15,  Rejected = −11

Reward(Chosen)   = (−10) − (−15) = +5
  → "Active likes the good answer 5 units MORE than originally."
Reward(Rejected) = (−12) − (−11) = −1
  → "Active likes the bad answer 1 unit LESS than originally."

Scale by β = 0.1:
  Reward_C = +0.5,  Reward_R = −0.1

Here's the thing that finally made me understand the paper. That log ratio log(π_θ / π_ref) is literally the KL divergence term. The same penalty PPO uses to prevent reward hacking. DPO bakes it into the implicit reward for free. No separate penalty step. No β-tuned KL subtraction. It's already in there.

The DPO loss, same Bradley-Terry, different inputs

With the two implicit rewards in hand, the rest looks exactly like reward model training:

STEP 2, Gap (Advantage):
  Gap = Reward_Chosen − Reward_Rejected
      = (+0.5) − (−0.1) = +0.6

STEP 3, Sigmoid:
  σ(Gap) = 1 / (1 + e^(−0.6)) ≈ 0.65
  → "Active is 65% confident Chosen > Rejected."

STEP 4, Loss (negative log):
  L_DPO = −log(σ(0.6)) ≈ 0.43

The full formula:

L_DPO = −log( σ( β × [log π_θ(y_c|x) − log π_ref(y_c|x)]
               − β × [log π_θ(y_r|x) − log π_ref(y_r|x)] ) )

Loss behavior you'll see in practice:

Model perfect   → Gap = +5  → σ(+5) ≈ 0.99 → Loss ≈ 0.01  [tiny update]
Model undecided → Gap =  0  → σ(0)  = 0.50 → Loss ≈ 0.69  [moderate]
Model wrong     → Gap = −3  → σ(−3) ≈ 0.04 → Loss ≈ 3.2   [large]

The Step 0 paradox

Here's the question that stumped me. At step 0, the Active Model and the Reference Model are perfect clones. Their log probs for every sentence are identical. The implicit rewards are both zero. The Gap is zero. How does learning start?

Active = Frozen
  → Reward(Chosen) = 0,  Reward(Rejected) = 0
  → Gap = 0
  → σ(0) = 0.50      (perfectly undecided)
  → L = −log(0.50) ≈ 0.693

The loss is 0.693, not zero. The algorithm interprets "50% confidence" as error, because we want 100% confidence. That 0.693 triggers the first backprop, which makes a microscopic nudge: slightly increases chosen-token probs, slightly decreases rejected-token probs.

Now the Active Model is no longer a perfect clone:

Active: Chosen = −9.99,  Rejected = −12.01
Frozen: Chosen = −10.00, Rejected = −12.00
Gap = 0.02 → Loss drops from 0.693 to 0.680

The snowball rolled. Learning is underway.

Backprop in plain English

One loss number has to update millions of weights. This is the Chain Rule, implemented by an engine called Autograd.

Here's an analogy that clicked for me. Picture 160,000 employees (weights). The CEO sees a $10M quarterly loss. Instead of randomly blaming everyone, perfect accounting traces exactly which decisions caused it.

At the department level: "Sales caused 80%, Engineering 20%." At the manager level: "Manager A's project caused 70% of Sales' share." At the employee level: "Employee #5,432, you adjusted your dial too far right. You caused 0.0001% of the total loss."

That's backprop. It flows backward from the loss through every mathematical layer, answering one question at each step: "If I tweak this weight by a tiny amount, does the loss go up or down?"

Then gradient descent:

θ_new = θ_old − (Learning_Rate × ∇Loss)

There are two distinct steps. The calculus step asks "Which direction does this weight move? How steep is the hill?" and produces the gradient. The learning rate step multiplies the gradient by something like 0.0001 to take a safe, small step. That prevents overshooting.

Example:

Weight A: Gradient = +5.0  → adjust by +0.0005   (moderate)
Weight B: Gradient = +0.5  → adjust by +0.00005  (tiny)
Weight C: Gradient = −2.0  → adjust by −0.0002   (reverse direction)

A shower analogy works too. Scalding hot water is the loss. You turn the knob in the opposite direction (minus sign) but only a small amount (learning rate). Do it again. And again. Eventually you find the right temperature.

DPO in code with TRL

This is what sold me. Compare it to the PPO code above.

from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

# ─── 1. DATASET ────────────────────────────────────���───────────
# DPO requires 3 columns: "prompt", "chosen", "rejected"
# Real options:
#   load_dataset("HuggingFaceH4/ultrafeedback_binarized")
# Or generate pairs via RLAIF with a teacher model (GPT-4 etc.).
dataset = load_dataset("json", data_files="preference_data.json")

# ─── 2. MODELS ─────────────────────────────────────────────────
# ... (50 more lines truncated for brevity)

That's it. That's RLHF in ~40 lines of config. No rollouts, no clipping, no Critic, no manual reward loop. It behaves like supervised training. It's stable. It's cheap.

DPO config parameters, the ones that matter

Parameter | Typical | When to change

Parameter: beta (β) — Typical: 0.1 — When to change: Raise to 0.5 for more conservative updates. Drop to 0.01 for faster alignment.

Parameter: learning_rate — Typical: 5e-7 — When to change: Very low. Raise to 1e-6 only if loss isn't decreasing.

Parameter: per_device_train_batch_size — Typical: 2 to 8 — When to change: VRAM-limited. Bigger batches produce more stable gradients.

Parameter: gradient_accumulation_steps — Typical: 4 to 8 — When to change: Simulate bigger batches without more VRAM.

Parameter: max_length — Typical: 512 — When to change: Total prompt + response tokens.

Parameter: num_train_epochs — Typical: 1 to 3 — When to change: Usually 1 is enough. More risks overfitting preferences.

PPO vs DPO, side by side

Having done both, here's the decision matrix.

Dimension | PPO (classic) | DPO (modern)

Dimension: Year — PPO (classic): 2017 / 2022 for RLHF — DPO (modern): 2023

Dimension: Models needed — PPO (classic): 4: Actor, Reference, Reward, Critic — DPO (modern): 2: Policy, Reference

Dimension: Reward model — PPO (classic): Required, trained separately — DPO (modern): Not needed, reward is implicit

Dimension: Paradigm — PPO (classic): Online RL, model generates and gets scored — DPO (modern): Offline, fixed preference dataset

Dimension: Stability — PPO (classic): Unstable, hyperparameter-sensitive — DPO (modern): Stable, behaves like SFT

Dimension: Memory — PPO (classic): High, 4 models in VRAM — DPO (modern): 30–40% less than PPO

Dimension: Complexity — PPO (classic): High, manual loops, clipping, Critic — DPO (modern): Low, DPOTrainer handles it

Dimension: Reward hacking risk — PPO (classic): High, Actor can game Reward Model — DPO (modern): Low, KL baked into loss

Dimension: Best for — PPO (classic): Online adaptation, strong Reward Model — DPO (modern): Standard chat, summarisation, safety

Dimension: Notable users — PPO (classic): OpenAI (ChatGPT), early Claude — DPO (modern): Meta (Llama 3), Mistral, most 2024+

When to pick PPO

You need online exploration, meaning the model must try responses it's never seen. You have a very strong, well-calibrated reward model and the engineering muscle to babysit the training run. You need fine-grained credit assignment at the token level, like long reasoning chains.

When to pick DPO

You have a preference dataset, or you can generate one with RLAIF. Engineering simplicity and GPU efficiency are priorities. You're aligning a model for standard chat, safety, summarisation, or instruction-following. You want stable, predictable training without risking policy collapse.

The hybrid approach that the cool kids are doing in 2025

Generate preference pairs via RLAIF (an AI teacher creates chosen/rejected pairs). Train with DPO, since it's cheap and stable. If specific capabilities regress, fine-tune targeted prompts with PPO.

This is more or less what every state-of-the-art model builder is running right now.

So what changed for me

The first time I saw the DPO loss function, my reaction was "that can't be all of it." I kept searching for a hidden step. There isn't one. The whole algorithm is two log-probability differences, scale by β, subtract, sigmoid, negative log. Done.

And somehow that's enough to turn raw next-token predictors into assistants that know when to push back.

What RLHF is actually doing, regardless of PPO or DPO, is surprisingly humble. You're not teaching the model new facts. You're not giving it new capabilities. You're teaching it which of the responses it was already capable of generating a human would prefer. It's curation, not creation.

The math is gorgeous, but the philosophical point is simpler. These models contain a distribution of possible outputs. Most of them are bad. Some are great. RLHF shifts probability mass from bad to great without teaching anything new.

Two practical takeaways if you're about to do this work:

Don't build a reward model for v1. Use ArmoRM or Starling-RM. You'll save a month and they're better than what you'd build in a month. Only roll your own when your domain is genuinely weird.

Start with DPO. Unless you have a specific reason to need online exploration, DPO is cheaper, more stable, and usually good enough. You can always graduate to PPO later for targeted fixes.

The RLHF stack is the most important idea in LLM training that isn't "bigger model, more data." It's the reason these things feel like anything at all. And now you know how it works.

Go train something.

If anything here was unclear, or straight-up wrong, I want to know. These posts get better when people push back.