7 Comments
User's avatar
Paul's avatar

that's what I call a deep dive, took me 4 hours to read! thank you!

i'll watch the talk from Nathan

Cameron R. Wolfe, Ph.D.'s avatar

I guess the reading time estimate from substack is way off! Lol

The AI Architect's avatar

The OlmoRL infrastructure improvements are clutch for scaling RLVR. Cutting training time from 15 to 6 days via asynchronous updates and inflight weight refreshes is the kind of eng work that actually matters when compute budgets dominate. Been wrestling with similiar inference bottlenecks on our side, and the 5x inference-to-training ratio really captures why naive RL setups struggle at scale. The truncated importance sampling fix for engine mismatches is kinda slept on too, subtle but essential for numerical stability in heteregenous environments.

Cameron R. Wolfe, Ph.D.'s avatar

Totally agree. Based on Olmo 3, it seems like importance sampling is essential to get stable training. And yes async is very important, this idea is explained really well in pipelineRL: https://arxiv.org/abs/2509.19128

Cameron R. Wolfe, Ph.D.'s avatar

Also, I would highly recommend checking out Nathan's full talk on Olmo 3 Think!

https://www.interconnects.ai/p/building-olmo-3-think

Rainbow Roxy's avatar

Couldn't agree more. What's the real value despite performance?