10 Comments
User's avatar
Nicola Dainese's avatar

Great post, thank you so much for it! I think it might be my personal favourite so far

Cameron R. Wolfe, Ph.D.'s avatar

Thanks for reading!

Paul's avatar

as always, great work, thank you!

Cameron R. Wolfe, Ph.D.'s avatar

Thanks for reading, and I'm glad it was helpful!

Ethan Ray's avatar

Recommend you to read paper: LitePPO, in 2025 by Alibaba. A paper about comprehensive GRPO trick experiments.

Cameron R. Wolfe, Ph.D.'s avatar

It's linked at the end of the post!

Ethan Ray's avatar

Maybe whether use reward model is the difference between RLVR and RLHF, but it's not the difference between GRPO family and PPO.

Cameron R. Wolfe, Ph.D.'s avatar

Yes this is just the difference between RLVR and RLHF. Both PPO and GRPO can be used with or without a reward model.

waynestack's avatar

that’s an interesting idea. I will have to look into it too,thanks for the suggestion.

User's avatar
Comment removed
Jan 5
Comment removed
Cameron R. Wolfe, Ph.D.'s avatar

Totally agreed in terms of the sensitivity. Also, the clipping bound needs to be completely re-tuned depending on the exact type of aggregation you're doing; e.g., GSPO / GMPO require a completely different clipping bound.

In terms of dynamic / adaptive clipping, there is this paper that's linked at the end of the post: https://arxiv.org/abs/2509.02333