15 Comments
User's avatar
Paul's avatar

stellar work as always, thank you.

I didn't know about the forward and reverse KL divergences and https://huggingface.co/blog/NormalUhr/kl-divergence-estimator-rl-llm was also very interesting, thank you for the link!

Cameron R. Wolfe, Ph.D.'s avatar

of course! I also did not understand this before writing the post, provides some really nice intuition :)

MatthewK's avatar

RL also involves fewer bits of information than SFT. So how do you fairly compare them?

Cameron R. Wolfe, Ph.D.'s avatar

usually by providing both approaches access to the same data

MatthewK's avatar

How is that possible? If SFT trains the model to say a particular token, that’s log(vocab size) bits. If RL tells the model “wrong answer”, that’s one bit.

Christopher Fleetwood's avatar

Exactly right, all of these papers seem a little silly for not normalising by information content provided.

More on information content of RL here: https://richardli.xyz/post/information-bandwidth-rl/

MatthewK's avatar

Great post!

Some of the equations don’t appear in dark mode.

Cameron R. Wolfe, Ph.D.'s avatar

Yeah they're png with black text. Will fix this

Vaibhav's avatar

I tend to think of RL as nudging the model towards a target. Instead of changing the way how it processes things entirely.

Cameron R. Wolfe, Ph.D.'s avatar

If you're working with a pretrained model (as opposed to a model learning from scratch) then I definitely agree with this.

(((★Augustine)))'s avatar

This is informative, I’m currently doing a master’s research on incremental and continual learning .

Cameron R. Wolfe, Ph.D.'s avatar

Great! And thank you for reading!

Matt's avatar

It's nice, but is there a source code? :)

Fazeela saleem's avatar

Using on-policy RL and relabeling helps robustify models so they can learn "on-the-job" without forgetting what they already know. By balancing new and old data through stratified sampling, we create an adaptable system that actually gets better over time. This approach turns static AI into a dynamic learner that finally acts more like a human

Rainbow Roxy's avatar

Thanks for writing this, it clarifies a lot. You really highlight the core challenge with continual learning. Given how unstructured and noisy real-world data is, I'm curious about the first steps to make that open-ended reality measurable for LLMs. It seems like such a huge hurdle.