In terms of what comes next - I'm currently burned out on RL. So, I'm going to writeup some stuff on Olmo 3 that'll come out next week. Then, I'll probably return to cap off the RL series by covering:
- GRPO variants
- Properties of RL versus supervised training properties (mostly related to generalization, lots of recent papers looking at this)
- Maybe one final post that puts everything together into a big "RL for LLMs" reference post
Hey, great read as always. Your explanation of GRPO's refreshing simplicty and efficiency for LLM reasoning is spot on. It reminds me a bit of mastering a Pilates flow, simplifying complex movements for greater impact. Democratising RL research is such a crucial step for the field, really well put.
The comparison between critic-free aproach and PPO really clarifies why GRPO works so well for outcome rewards. When you're only getting feedback at the sequence level, trying to assign per-token credit throgh a value function seems like overengineering. The fact that distillation from R1 outperforms direct RL on smaller models is particularly interesting for practical deployment.
I have a quick question on KL penalty estimation formula on the figure. I computed the expected value of two estimators, but I guess they may have errors?
Top one: According to KL div definition, $\log p(x) - \log q(x)$ is correct, isn't it?
Bottom one: To make this expected value as KL div, I guess, r should be p(x) / q(x)?
Great article Cameron. Thanks!
One typo in the conclusion section where "require" appears twice: "algorithms like PPO that require require substantial domain knowledge..."
Thank you! Fixed it!
Interesting, interesting and thanks for the hint at https://thinkingmachines.ai/blog/on-policy-distillation/
Thanks for the deep dive from Reward Models over DPO and PPO to GRPO. Curious what comes next!
Is a typo in your pseudocode?
Shouldn't this
```
kl_div_alt = (
torch.exp(ref_per_token_logps - per_token_logps)
- (ref_per_token_logps - **ref_per_token_logps**)
- 1
)
```
rather be this
```
kl_div_alt = (
torch.exp(ref_per_token_logps - per_token_logps)
- (ref_per_token_logps - **per_token_logps**)
- 1
)
```
?
You're right! Fixed it, and thank you for catching this :)
In terms of what comes next - I'm currently burned out on RL. So, I'm going to writeup some stuff on Olmo 3 that'll come out next week. Then, I'll probably return to cap off the RL series by covering:
- GRPO variants
- Properties of RL versus supervised training properties (mostly related to generalization, lots of recent papers looking at this)
- Maybe one final post that puts everything together into a big "RL for LLMs" reference post
I thought so, it was A LOT of detailed RL deep dives over the past months.
Looking forward to the Olmo 3 writeup! Impressive model and super open about their approach from what I can tell/see from Nathan Lambert
Hey, great read as always. Your explanation of GRPO's refreshing simplicty and efficiency for LLM reasoning is spot on. It reminds me a bit of mastering a Pilates flow, simplifying complex movements for greater impact. Democratising RL research is such a crucial step for the field, really well put.
Thanks for the kind words!
The comparison between critic-free aproach and PPO really clarifies why GRPO works so well for outcome rewards. When you're only getting feedback at the sequence level, trying to assign per-token credit throgh a value function seems like overengineering. The fact that distillation from R1 outperforms direct RL on smaller models is particularly interesting for practical deployment.
Agree - thanks for reading!
Great stuff, subscribed!
Hi, thank you for a great article.
I have a quick question on KL penalty estimation formula on the figure. I computed the expected value of two estimators, but I guess they may have errors?
Top one: According to KL div definition, $\log p(x) - \log q(x)$ is correct, isn't it?
Bottom one: To make this expected value as KL div, I guess, r should be p(x) / q(x)?