Do you think it is still a relevant / useful tool for post-training today? I mostly made this as part of a "learn RL for LLMs from scratch" lecture series, but obviously the usage of DPO has declined a lot since its proposal. My current thought is that it's still useful for cheap / polishing iterations on formatting and style (or for ironing out small issues that might come up for a model), but PPO now consumes a much larger portion of the post-training process even for OSS models and does all the heavy lifting.
yes. tulu 3 is a great example. its the simplest baseline and thats often the strongest for small teams. now i bet doing DPO after RL for reasoning is the default for what we should do.
Another great overview of a topic that is sorely lacking good entry-level explanations. You are cooking! Keep up the amazing work!
If you do a follow-up in this post-training series, I would suggest expanding a bit on the intuition for why off policy methods are more limited than on policy methods. The way I typically explain it: In step 0 of the training run, DPO is near-equivalent to RLHF with a good RM trained on the same preference dataset. As the policy model is updated during training, the two methods diverge:
- The samples used in each step of the RLHF run will get progressively better, which ideally allows the training process to continue to push response quality from "good" to "great". The analogy I use is adaptive learning in a school setting: If a student is mastering a set of single-digit addition flash cards, a good system will adapt by presenting harder flash cards (double digit, multiplication, etc.) that track the frontier of the student's current understanding.
- The samples used in each step of the DPO run are the same static level of quality from the original SFT model. This is like continuing to exclusively show the student single-digit addition flash cards, even when they are getting 95% of them right.
That being said, both methods are fundamentally limited in two ways that tend to reduce upside of RLHF:
- The reward model (explicit in RLHF, implicit in DPO) won't necessarily generalize to discriminate on "harder" cases than were in the original preference data set. As the RLHF model generates better responses over the course of training, it will often just saturate the RM's ability to discriminate between good responses, and plateau at a level where improvements just represent reward hacking.
- Both methods are limited in how far they can diverge from the SFT model by the KL divergence term in the loss function. This is necessary for training stability and avoid degenerate cases of "reward hacking" (again, implicit in the case of DPO).
Another hairy issue with DPO is how to handle cases where two samples in the data set have significantly differences in length. This poses 3 related problems:
1. Longer sequences of tokens inherently have lower cumulative probability. If probability for each token is similar, then more things have to "go right" to get a longer string, including trivial differences in wording, punctuation, etc. This causes a bias for shorter responses within specific response pairs.
2. Longer sequences of tokens mechanically result in higher magnitudes for training loss, because the loss is calculated by summing over more tokens. This causes a bias for weighting longer response pairs more highly within the dataset.
3. Preference datasets often have a bias towards longer responses. Raters are just empirically more likely to click "prefer" on the response that has more detail, even if they wouldn't holistically prefer a model that is more verbose in every case. This biases the training process towards longer responses, which is mechanically correct but often doesn't match the model developer's overall goals.
#1 and #2 are often addressed by introducing length-normalization - instead of calculating a sum of log probabilities over each of the responses, we calculate average per-token log-probability (i.e. divide the sum by the token count). This "fixes" the issue by making the loss length-invariant, but it does so by changing the training objective to something that doesn't necessarily match our goals. A coworker used to describe statistical techniques as "giving an answer to a question sort of like the one you asked, and maybe not to the question you would want". Plugging the ratio of mean per-token probabilities into a Bradley-Terry model feels a bit like that.
Ultimately a lot of this stuff works reasonably well in practice. End-to-end evaluations and tasteful expert "vibe checks" remain essential no matter what type of post-training you're doing.
Online vs. offline is coming next (I try to publish every 3 weeks)! Just started reading / working on this today. Thank you so much for the detailed writeup - definitely helpful as I start to get into this topic :)
gosh, DPO is so forgotten man, was such a major innovation.
Do you think it is still a relevant / useful tool for post-training today? I mostly made this as part of a "learn RL for LLMs from scratch" lecture series, but obviously the usage of DPO has declined a lot since its proposal. My current thought is that it's still useful for cheap / polishing iterations on formatting and style (or for ironing out small issues that might come up for a model), but PPO now consumes a much larger portion of the post-training process even for OSS models and does all the heavy lifting.
yes. tulu 3 is a great example. its the simplest baseline and thats often the strongest for small teams. now i bet doing DPO after RL for reasoning is the default for what we should do.
You have become my fav!
thank you for reading!
thank you, this was super helpful and insightful, please keep on making more of these!
Glad it was helpful. More coming for sure!
Another great overview of a topic that is sorely lacking good entry-level explanations. You are cooking! Keep up the amazing work!
If you do a follow-up in this post-training series, I would suggest expanding a bit on the intuition for why off policy methods are more limited than on policy methods. The way I typically explain it: In step 0 of the training run, DPO is near-equivalent to RLHF with a good RM trained on the same preference dataset. As the policy model is updated during training, the two methods diverge:
- The samples used in each step of the RLHF run will get progressively better, which ideally allows the training process to continue to push response quality from "good" to "great". The analogy I use is adaptive learning in a school setting: If a student is mastering a set of single-digit addition flash cards, a good system will adapt by presenting harder flash cards (double digit, multiplication, etc.) that track the frontier of the student's current understanding.
- The samples used in each step of the DPO run are the same static level of quality from the original SFT model. This is like continuing to exclusively show the student single-digit addition flash cards, even when they are getting 95% of them right.
That being said, both methods are fundamentally limited in two ways that tend to reduce upside of RLHF:
- The reward model (explicit in RLHF, implicit in DPO) won't necessarily generalize to discriminate on "harder" cases than were in the original preference data set. As the RLHF model generates better responses over the course of training, it will often just saturate the RM's ability to discriminate between good responses, and plateau at a level where improvements just represent reward hacking.
- Both methods are limited in how far they can diverge from the SFT model by the KL divergence term in the loss function. This is necessary for training stability and avoid degenerate cases of "reward hacking" (again, implicit in the case of DPO).
Another hairy issue with DPO is how to handle cases where two samples in the data set have significantly differences in length. This poses 3 related problems:
1. Longer sequences of tokens inherently have lower cumulative probability. If probability for each token is similar, then more things have to "go right" to get a longer string, including trivial differences in wording, punctuation, etc. This causes a bias for shorter responses within specific response pairs.
2. Longer sequences of tokens mechanically result in higher magnitudes for training loss, because the loss is calculated by summing over more tokens. This causes a bias for weighting longer response pairs more highly within the dataset.
3. Preference datasets often have a bias towards longer responses. Raters are just empirically more likely to click "prefer" on the response that has more detail, even if they wouldn't holistically prefer a model that is more verbose in every case. This biases the training process towards longer responses, which is mechanically correct but often doesn't match the model developer's overall goals.
#1 and #2 are often addressed by introducing length-normalization - instead of calculating a sum of log probabilities over each of the responses, we calculate average per-token log-probability (i.e. divide the sum by the token count). This "fixes" the issue by making the loss length-invariant, but it does so by changing the training objective to something that doesn't necessarily match our goals. A coworker used to describe statistical techniques as "giving an answer to a question sort of like the one you asked, and maybe not to the question you would want". Plugging the ratio of mean per-token probabilities into a Bradley-Terry model feels a bit like that.
Ultimately a lot of this stuff works reasonably well in practice. End-to-end evaluations and tasteful expert "vibe checks" remain essential no matter what type of post-training you're doing.
Online vs. offline is coming next (I try to publish every 3 weeks)! Just started reading / working on this today. Thank you so much for the detailed writeup - definitely helpful as I start to get into this topic :)
Excited to read the next installment!