Group Relative Policy Optimization (GRPO)

Cameron R. Wolfe, Ph.D.

Nov 24, 2025

How the algorithm that teaches LLMs to reason actually works...

Read →

12 Comments

Eli

Dec 8

Great article Cameron. Thanks!

One typo in the conclusion section where "require" appears twice: "algorithms like PPO that require require substantial domain knowledge..."

Reply (1)

Cameron R. Wolfe, Ph.D.

Dec 8

Thank you! Fixed it!

Paul

Dec 7Edited

Interesting, interesting and thanks for the hint at https://thinkingmachines.ai/blog/on-policy-distillation/

Thanks for the deep dive from Reward Models over DPO and PPO to GRPO. Curious what comes next!

Is a typo in your pseudocode?

Shouldn't this

```

kl_div_alt = (

torch.exp(ref_per_token_logps - per_token_logps)

- (ref_per_token_logps - **ref_per_token_logps**)

- 1

)

```

rather be this

```

kl_div_alt = (

torch.exp(ref_per_token_logps - per_token_logps)

- (ref_per_token_logps - **per_token_logps**)

- 1

)

```

Reply (1)

Cameron R. Wolfe, Ph.D.

Dec 8

You're right! Fixed it, and thank you for catching this :)

Reply (1)

Cameron R. Wolfe, Ph.D.

Dec 8

In terms of what comes next - I'm currently burned out on RL. So, I'm going to writeup some stuff on Olmo 3 that'll come out next week. Then, I'll probably return to cap off the RL series by covering:

- GRPO variants

- Properties of RL versus supervised training properties (mostly related to generalization, lots of recent papers looking at this)

- Maybe one final post that puts everything together into a big "RL for LLMs" reference post

Reply (1)

Paul

Dec 8

I thought so, it was A LOT of detailed RL deep dives over the past months.

Looking forward to the Olmo 3 writeup! Impressive model and super open about their approach from what I can tell/see from Nathan Lambert

Rainbow Roxy

Nov 28

Hey, great read as always. Your explanation of GRPO's refreshing simplicty and efficiency for LLM reasoning is spot on. It reminds me a bit of mastering a Pilates flow, simplifying complex movements for greater impact. Democratising RL research is such a crucial step for the field, really well put.

Reply (1)

Cameron R. Wolfe, Ph.D.

Nov 30

Thanks for the kind words!

Neural Foundry

Nov 24

The comparison between critic-free aproach and PPO really clarifies why GRPO works so well for outcome rewards. When you're only getting feedback at the sequence level, trying to assign per-token credit throgh a value function seems like overengineering. The fact that distillation from R1 outperforms direct RL on smaller models is particularly interesting for practical deployment.

Reply (1)

Cameron R. Wolfe, Ph.D.

Nov 25

Agree - thanks for reading!

Reply (1)

Neural Foundry

Nov 25

Great stuff, subscribed!

Hanbyul Kim

6dEdited

Hi, thank you for a great article.

I have a quick question on KL penalty estimation formula on the figure. I computed the expected value of two estimators, but I guess they may have errors?

Top one: According to KL div definition, $\log p(x) - \log q(x)$ is correct, isn't it?

Bottom one: To make this expected value as KL div, I guess, r should be p(x) / q(x)?

Deep (Learning) Focus

Group Relative Policy Optimization (GRPO)