GRPO++: Tricks for Making RL Actually Work

Jan 5

How to go from the vanilla GRPO algorithm to functional RL training at scale...

7 Comments

Great breakdown of the GRPO pain points. The entropy collapse issue particularly caught my attention since I noticed similar patterns in a recent project. The clip higher approach makes intuitive sense, but what surprised me was how small the adjustment needs to be (0.2 to 0.28) to prevent low-probability tokens from being constrained. This reminds me of how delicate hyperparameter tuning can be in RL setups where a seemingly minor tweak cascades through the entire training dynamic. The combination of token-level loss aggregation with dynamic sampling seemsto address orthogonal problems simultaneously, which is elegant. Curious if anyone's experimented with adaptive clipping schedules instead of static bounds?

Expand full comment

Reply (1)

Cameron R. Wolfe, Ph.D.

Totally agreed in terms of the sensitivity. Also, the clipping bound needs to be completely re-tuned depending on the exact type of aggregation you're doing; e.g., GSPO / GMPO require a completely different clipping bound.

In terms of dynamic / adaptive clipping, there is this paper that's linked at the end of the post: https://arxiv.org/abs/2509.02333

Expand full comment

Ethan Ray

Recommend you to read paper: LitePPO, in 2025 by Alibaba. A paper about comprehensive GRPO trick experiments.

Expand full comment

Reply (1)

Cameron R. Wolfe, Ph.D.

It's linked at the end of the post!

Expand full comment

Ethan Ray

Maybe whether use reward model is the difference between RLVR and RLHF, but it's not the difference between GRPO family and PPO.

Expand full comment

Reply (1)

Cameron R. Wolfe, Ph.D.

Yes this is just the difference between RLVR and RLHF. Both PPO and GRPO can be used with or without a reward model.

Expand full comment

waynestack

that’s an interesting idea. I will have to look into it too,thanks for the suggestion.

Expand full comment

Deep (Learning) Focus

GRPO++: Tricks for Making RL Actually Work