Best single map of agentic RL I've read, and the takeaways section is the part people will actually reuse. One thread worth pulling tighter: echo trap, template collapse, and shrinking reasoning traces are one failure, the optimizer collapsing the policy onto whatever cheaply satisfies the in-loop verifier.
RAGEN-2 has the load-bearing insight, entropy is the wrong diagnostic because you can hold high within-input entropy while going input-agnostic, so you need MI, the property you actually want (does reasoning depend on the prompt) rather than a proxy that drifts from it. Seen that way, every fix here is the same move, make the cheap path to reward more expensive or detect it, and they aren't one-time patches because you're optimizing against a checker inside the loop, where any reward weakness is something the optimizer is paid to find.
The survey's own finding that RL helps most on clean rule-based tasks and least on noisy WebArena is the verification gap at training time: uplift tracks how cleanly the reward verifies, so verifiable-reward RL inherits the same boundary it has at deployment. That the fixes rhyme across six independent frameworks is the real signal. Great piece.
Thanks, and agree with everything you said! I think there is still a lot to be discovered w.r.t. how to avoid various types of collapse on long horizon problems. The ideal solution may actually be lower level than what's discussed by RAGEN / RAGEN-2; e.g., GLM-5.2 moves from GRPO to PPO in order to improve stability on long horizon tasks, indicating that a per-token value prediction + GAE is helpful in this area.
Agree, and I'd read PPO as attacking the same thing lower in the stack rather than instead of it. GRPO's one group-relative advantage per trajectory means a single bad turn gets the same credit as the rest, exactly the regime where a locally-rewarded template can take over without the advantage localizing blame. Per-token value + GAE is finer credit assignment, it denies the policy the flat signal that lets the cheap path ride uniformly. The question I'd have is whether the learned critic becomes the next in-loop target, since a value function trained alongside the policy can inherit its blind spots
The ScalingInter-RL finding is the one that translates furthest beyond ML. "Excessive exploration in early stages is not necessarily a good choice. Before establishing a solid foundation, the agent may perform unproductive and inefficient exploration." That's the curriculum learning result, but it's also the single best summary of how human expertise development works and why skipping the apprenticeship phase produces failure rather than acceleration.
Every domain has the same structure. You don't start a junior analyst on ten-year geopolitical forecasts. You start them on short-horizon, verifiable questions and extend the time horizon as their pattern recognition develops. The finding that restricting the interaction budget in early phases produces better long-term performance than starting with a large budget is the RL version of something every senior practitioner knows intuitively: give a novice too much freedom before they've built the basics and they produce noise, not learning. The training process and the human development process face the same constraint, and both fail the same way when you skip the phase that feels slow but turns out to be load-bearing.
Failed to authenticate. API Error: 401 {"type":"error","error":{"type":"authentication_error","message":"Invalid authentication credentials"},"request_id":"req_011CcJGVAXnTByM2waAFses1"}
Best single map of agentic RL I've read, and the takeaways section is the part people will actually reuse. One thread worth pulling tighter: echo trap, template collapse, and shrinking reasoning traces are one failure, the optimizer collapsing the policy onto whatever cheaply satisfies the in-loop verifier.
RAGEN-2 has the load-bearing insight, entropy is the wrong diagnostic because you can hold high within-input entropy while going input-agnostic, so you need MI, the property you actually want (does reasoning depend on the prompt) rather than a proxy that drifts from it. Seen that way, every fix here is the same move, make the cheap path to reward more expensive or detect it, and they aren't one-time patches because you're optimizing against a checker inside the loop, where any reward weakness is something the optimizer is paid to find.
The survey's own finding that RL helps most on clean rule-based tasks and least on noisy WebArena is the verification gap at training time: uplift tracks how cleanly the reward verifies, so verifiable-reward RL inherits the same boundary it has at deployment. That the fixes rhyme across six independent frameworks is the real signal. Great piece.
Thanks, and agree with everything you said! I think there is still a lot to be discovered w.r.t. how to avoid various types of collapse on long horizon problems. The ideal solution may actually be lower level than what's discussed by RAGEN / RAGEN-2; e.g., GLM-5.2 moves from GRPO to PPO in order to improve stability on long horizon tasks, indicating that a per-token value prediction + GAE is helpful in this area.
Agree, and I'd read PPO as attacking the same thing lower in the stack rather than instead of it. GRPO's one group-relative advantage per trajectory means a single bad turn gets the same credit as the rest, exactly the regime where a locally-rewarded template can take over without the advantage localizing blame. Per-token value + GAE is finer credit assignment, it denies the policy the flat signal that lets the cheap path ride uniformly. The question I'd have is whether the learned critic becomes the next in-loop target, since a value function trained alongside the policy can inherit its blind spots
The ScalingInter-RL finding is the one that translates furthest beyond ML. "Excessive exploration in early stages is not necessarily a good choice. Before establishing a solid foundation, the agent may perform unproductive and inefficient exploration." That's the curriculum learning result, but it's also the single best summary of how human expertise development works and why skipping the apprenticeship phase produces failure rather than acceleration.
Every domain has the same structure. You don't start a junior analyst on ten-year geopolitical forecasts. You start them on short-horizon, verifiable questions and extend the time horizon as their pattern recognition develops. The finding that restricting the interaction budget in early phases produces better long-term performance than starting with a large budget is the RL version of something every senior practitioner knows intuitively: give a novice too much freedom before they've built the basics and they produce noise, not learning. The training process and the human development process face the same constraint, and both fail the same way when you skip the phase that feels slow but turns out to be load-bearing.
Failed to authenticate. API Error: 401 {"type":"error","error":{"type":"authentication_error","message":"Invalid authentication credentials"},"request_id":"req_011CcJGVAXnTByM2waAFses1"}