Totally agreed in terms of the sensitivity. Also, the clipping bound needs to be completely re-tuned depending on the exact type of aggregation you're doing; e.g., GSPO / GMPO require a completely different clipping bound.
In terms of dynamic / adaptive clipping, there is this paper that's linked at the end of the post: https://arxiv.org/abs/2509.02333
Great post, thank you so much for it! I think it might be my personal favourite so far
Thanks for reading!
as always, great work, thank you!
Thanks for reading, and I'm glad it was helpful!
Recommend you to read paper: LitePPO, in 2025 by Alibaba. A paper about comprehensive GRPO trick experiments.
It's linked at the end of the post!
Maybe whether use reward model is the difference between RLVR and RLHF, but it's not the difference between GRPO family and PPO.
Yes this is just the difference between RLVR and RLHF. Both PPO and GRPO can be used with or without a reward model.
that’s an interesting idea. I will have to look into it too,thanks for the suggestion.
Totally agreed in terms of the sensitivity. Also, the clipping bound needs to be completely re-tuned depending on the exact type of aggregation you're doing; e.g., GSPO / GMPO require a completely different clipping bound.
In terms of dynamic / adaptive clipping, there is this paper that's linked at the end of the post: https://arxiv.org/abs/2509.02333