Discussion about this post

User's avatar
Michael Lopez Chiesa's avatar

Best single map of agentic RL I've read, and the takeaways section is the part people will actually reuse. One thread worth pulling tighter: echo trap, template collapse, and shrinking reasoning traces are one failure, the optimizer collapsing the policy onto whatever cheaply satisfies the in-loop verifier.

RAGEN-2 has the load-bearing insight, entropy is the wrong diagnostic because you can hold high within-input entropy while going input-agnostic, so you need MI, the property you actually want (does reasoning depend on the prompt) rather than a proxy that drifts from it. Seen that way, every fix here is the same move, make the cheap path to reward more expensive or detect it, and they aren't one-time patches because you're optimizing against a checker inside the loop, where any reward weakness is something the optimizer is paid to find.

The survey's own finding that RL helps most on clean rule-based tasks and least on noisy WebArena is the verification gap at training time: uplift tracks how cleanly the reward verifies, so verifiable-reward RL inherits the same boundary it has at deployment. That the fixes rhyme across six independent frameworks is the real signal. Great piece.

Scenarica's avatar

The ScalingInter-RL finding is the one that translates furthest beyond ML. "Excessive exploration in early stages is not necessarily a good choice. Before establishing a solid foundation, the agent may perform unproductive and inefficient exploration." That's the curriculum learning result, but it's also the single best summary of how human expertise development works and why skipping the apprenticeship phase produces failure rather than acceleration.

Every domain has the same structure. You don't start a junior analyst on ten-year geopolitical forecasts. You start them on short-horizon, verifiable questions and extend the time horizon as their pattern recognition develops. The finding that restricting the interaction budget in early phases produces better long-term performance than starting with a large budget is the RL version of something every senior practitioner knows intuitively: give a novice too much freedom before they've built the basics and they produce noise, not learning. The training process and the human development process face the same constraint, and both fail the same way when you skip the phase that feels slow but turns out to be load-bearing.

3 more comments...

No posts

Ready for more?