Discussion about this post

User's avatar
Eli's avatar

Great article Cameron. Thanks!

One typo in the conclusion section where "require" appears twice: "algorithms like PPO that require require substantial domain knowledge..."

Paul's avatar
Dec 7Edited

Interesting, interesting and thanks for the hint at https://thinkingmachines.ai/blog/on-policy-distillation/

Thanks for the deep dive from Reward Models over DPO and PPO to GRPO. Curious what comes next!

Is a typo in your pseudocode?

Shouldn't this

```

kl_div_alt = (

torch.exp(ref_per_token_logps - per_token_logps)

- (ref_per_token_logps - **ref_per_token_logps**)

- 1

)

```

rather be this

```

kl_div_alt = (

torch.exp(ref_per_token_logps - per_token_logps)

- (ref_per_token_logps - **per_token_logps**)

- 1

)

```

?

10 more comments...

No posts

Ready for more?