Direct Preference Optimization (DPO)

Cameron R. Wolfe, Ph.D.

Jul 28

104

How to align LLMs with limited hardware and minimal complexity...

Read →

17 Comments

Nathan Lambert

Jul 29

gosh, DPO is so forgotten man, was such a major innovation.

Expand full comment

Reply (1)

Cameron R. Wolfe, Ph.D.

Jul 29

Do you think it is still a relevant / useful tool for post-training today? I mostly made this as part of a "learn RL for LLMs from scratch" lecture series, but obviously the usage of DPO has declined a lot since its proposal. My current thought is that it's still useful for cheap / polishing iterations on formatting and style (or for ironing out small issues that might come up for a model), but PPO now consumes a much larger portion of the post-training process even for OSS models and does all the heavy lifting.

Expand full comment

Reply (1)

Nathan Lambert

Jul 29

yes. tulu 3 is a great example. its the simplest baseline and thats often the strongest for small teams. now i bet doing DPO after RL for reasoning is the default for what we should do.

Expand full comment

Pankaj

Jul 29

You have become my fav!

Expand full comment

Reply (1)

Cameron R. Wolfe, Ph.D.

Jul 29

thank you for reading!

Expand full comment

Sep 2

This is an awesome article! And the most clearly explained DPO I have ever seen! I have a question about the beta value. In the article "Setting the β hyperparameter" part, we mentioned

"β is the weight by which we multiply the KL constraint in the RLHF objective, which controls the strength of preference alignment in DPO—higher β values mean that the model is updated more aggressively to adapt to observed preference in the data. "

Please correct me if I am wrong. The KL defines the difference between the learned policy and the reference policy. So should it be

"A higher β value forces the model to stay closer to its initial behavior, resulting in less divergence in the probability distributions of its outputs, while a lower β allows for more aggressive updates toward preferred responses in the data." Since a higher β will give more penalty to the big difference between two policies in the training process.

Expand full comment

Reply (2)

Cameron R. Wolfe, Ph.D.

Sep 2

Correct! It should be that lower beta (less regularization) allows for more aggressive updates.

Expand full comment

Sep 2

Thank you for clarifying! Looking forward to your new articles 😄

Expand full comment

Paul

Aug 16

uff, great stuff again, as always, thank you very much!

Expand full comment

Reply (1)

Cameron R. Wolfe, Ph.D.

Aug 16

Of course! Glad you liked it!

Expand full comment

Reply (1)

Paul

Aug 20

Especially the math derivations are great. Detailed but also very intuitive!

Expand full comment

Reply (1)

Cameron R. Wolfe, Ph.D.

Aug 20

Thanks!

Expand full comment

Savoxism

Aug 8

thank you, this was super helpful and insightful, please keep on making more of these!

Expand full comment

Reply (1)

Cameron R. Wolfe, Ph.D.

Aug 8

Glad it was helpful. More coming for sure!

Expand full comment

Alex

Jul 28

Another great overview of a topic that is sorely lacking good entry-level explanations. You are cooking! Keep up the amazing work!

If you do a follow-up in this post-training series, I would suggest expanding a bit on the intuition for why off policy methods are more limited than on policy methods. The way I typically explain it: In step 0 of the training run, DPO is near-equivalent to RLHF with a good RM trained on the same preference dataset. As the policy model is updated during training, the two methods diverge:

- The samples used in each step of the RLHF run will get progressively better, which ideally allows the training process to continue to push response quality from "good" to "great". The analogy I use is adaptive learning in a school setting: If a student is mastering a set of single-digit addition flash cards, a good system will adapt by presenting harder flash cards (double digit, multiplication, etc.) that track the frontier of the student's current understanding.

- The samples used in each step of the DPO run are the same static level of quality from the original SFT model. This is like continuing to exclusively show the student single-digit addition flash cards, even when they are getting 95% of them right.

That being said, both methods are fundamentally limited in two ways that tend to reduce upside of RLHF:

- The reward model (explicit in RLHF, implicit in DPO) won't necessarily generalize to discriminate on "harder" cases than were in the original preference data set. As the RLHF model generates better responses over the course of training, it will often just saturate the RM's ability to discriminate between good responses, and plateau at a level where improvements just represent reward hacking.

- Both methods are limited in how far they can diverge from the SFT model by the KL divergence term in the loss function. This is necessary for training stability and avoid degenerate cases of "reward hacking" (again, implicit in the case of DPO).

Another hairy issue with DPO is how to handle cases where two samples in the data set have significantly differences in length. This poses 3 related problems:

1. Longer sequences of tokens inherently have lower cumulative probability. If probability for each token is similar, then more things have to "go right" to get a longer string, including trivial differences in wording, punctuation, etc. This causes a bias for shorter responses within specific response pairs.

2. Longer sequences of tokens mechanically result in higher magnitudes for training loss, because the loss is calculated by summing over more tokens. This causes a bias for weighting longer response pairs more highly within the dataset.

3. Preference datasets often have a bias towards longer responses. Raters are just empirically more likely to click "prefer" on the response that has more detail, even if they wouldn't holistically prefer a model that is more verbose in every case. This biases the training process towards longer responses, which is mechanically correct but often doesn't match the model developer's overall goals.

#1 and #2 are often addressed by introducing length-normalization - instead of calculating a sum of log probabilities over each of the responses, we calculate average per-token log-probability (i.e. divide the sum by the token count). This "fixes" the issue by making the loss length-invariant, but it does so by changing the training objective to something that doesn't necessarily match our goals. A coworker used to describe statistical techniques as "giving an answer to a question sort of like the one you asked, and maybe not to the question you would want". Plugging the ratio of mean per-token probabilities into a Bradley-Terry model feels a bit like that.

Ultimately a lot of this stuff works reasonably well in practice. End-to-end evaluations and tasteful expert "vibe checks" remain essential no matter what type of post-training you're doing.

Expand full comment

Reply (1)

Cameron R. Wolfe, Ph.D.

Jul 28

Online vs. offline is coming next (I try to publish every 3 weeks)! Just started reading / working on this today. Thank you so much for the detailed writeup - definitely helpful as I start to get into this topic :)

Expand full comment

Reply (1)

Alex

Jul 28

Excited to read the next installment!

Expand full comment

Deep (Learning) Focus

Direct Preference Optimization (DPO)