12 Comments
User's avatar
Eli's avatar

Great article Cameron. Thanks!

One typo in the conclusion section where "require" appears twice: "algorithms like PPO that require require substantial domain knowledge..."

Cameron R. Wolfe, Ph.D.'s avatar

Thank you! Fixed it!

Paul's avatar
Dec 7Edited

Interesting, interesting and thanks for the hint at https://thinkingmachines.ai/blog/on-policy-distillation/

Thanks for the deep dive from Reward Models over DPO and PPO to GRPO. Curious what comes next!

Is a typo in your pseudocode?

Shouldn't this

```

kl_div_alt = (

torch.exp(ref_per_token_logps - per_token_logps)

- (ref_per_token_logps - **ref_per_token_logps**)

- 1

)

```

rather be this

```

kl_div_alt = (

torch.exp(ref_per_token_logps - per_token_logps)

- (ref_per_token_logps - **per_token_logps**)

- 1

)

```

?

Cameron R. Wolfe, Ph.D.'s avatar

You're right! Fixed it, and thank you for catching this :)

Cameron R. Wolfe, Ph.D.'s avatar

In terms of what comes next - I'm currently burned out on RL. So, I'm going to writeup some stuff on Olmo 3 that'll come out next week. Then, I'll probably return to cap off the RL series by covering:

- GRPO variants

- Properties of RL versus supervised training properties (mostly related to generalization, lots of recent papers looking at this)

- Maybe one final post that puts everything together into a big "RL for LLMs" reference post

Paul's avatar

I thought so, it was A LOT of detailed RL deep dives over the past months.

Looking forward to the Olmo 3 writeup! Impressive model and super open about their approach from what I can tell/see from Nathan Lambert

Rainbow Roxy's avatar

Hey, great read as always. Your explanation of GRPO's refreshing simplicty and efficiency for LLM reasoning is spot on. It reminds me a bit of mastering a Pilates flow, simplifying complex movements for greater impact. Democratising RL research is such a crucial step for the field, really well put.

Cameron R. Wolfe, Ph.D.'s avatar

Thanks for the kind words!

Neural Foundry's avatar

The comparison between critic-free aproach and PPO really clarifies why GRPO works so well for outcome rewards. When you're only getting feedback at the sequence level, trying to assign per-token credit throgh a value function seems like overengineering. The fact that distillation from R1 outperforms direct RL on smaller models is particularly interesting for practical deployment.

Cameron R. Wolfe, Ph.D.'s avatar

Agree - thanks for reading!

Neural Foundry's avatar

Great stuff, subscribed!

Hanbyul Kim's avatar

Hi, thank you for a great article.

I have a quick question on KL penalty estimation formula on the figure. I computed the expected value of two estimators, but I guess they may have errors?

Top one: According to KL div definition, $\log p(x) - \log q(x)$ is correct, isn't it?

Bottom one: To make this expected value as KL div, I guess, r should be p(x) / q(x)?