Great breakdown of the GRPO pain points. The entropy collapse issue particularly caught my attention since I noticed similar patterns in a recent project. The clip higher approach makes intuitive sense, but what surprised me was how small the adjustment needs to be (0.2 to 0.28) to prevent low-probability tokens from being constrained. This reminds me of how delicate hyperparameter tuning can be in RL setups where a seemingly minor tweak cascades through the entire training dynamic. The combination of token-level loss aggregation with dynamic sampling seemsto address orthogonal problems simultaneously, which is elegant. Curious if anyone's experimented with adaptive clipping schedules instead of static bounds?
Totally agreed in terms of the sensitivity. Also, the clipping bound needs to be completely re-tuned depending on the exact type of aggregation you're doing; e.g., GSPO / GMPO require a completely different clipping bound.
In terms of dynamic / adaptive clipping, there is this paper that's linked at the end of the post: https://arxiv.org/abs/2509.02333
Great breakdown of the GRPO pain points. The entropy collapse issue particularly caught my attention since I noticed similar patterns in a recent project. The clip higher approach makes intuitive sense, but what surprised me was how small the adjustment needs to be (0.2 to 0.28) to prevent low-probability tokens from being constrained. This reminds me of how delicate hyperparameter tuning can be in RL setups where a seemingly minor tweak cascades through the entire training dynamic. The combination of token-level loss aggregation with dynamic sampling seemsto address orthogonal problems simultaneously, which is elegant. Curious if anyone's experimented with adaptive clipping schedules instead of static bounds?
Totally agreed in terms of the sensitivity. Also, the clipping bound needs to be completely re-tuned depending on the exact type of aggregation you're doing; e.g., GSPO / GMPO require a completely different clipping bound.
In terms of dynamic / adaptive clipping, there is this paper that's linked at the end of the post: https://arxiv.org/abs/2509.02333
Recommend you to read paper: LitePPO, in 2025 by Alibaba. A paper about comprehensive GRPO trick experiments.
It's linked at the end of the post!
Maybe whether use reward model is the difference between RLVR and RLHF, but it's not the difference between GRPO family and PPO.
Yes this is just the difference between RLVR and RLHF. Both PPO and GRPO can be used with or without a reward model.
that’s an interesting idea. I will have to look into it too,thanks for the suggestion.