2 Comments
Sep 25, 2023Liked by Cameron R. Wolfe, Ph.D.

This is another really nice article, Cameron!

Sorry, but there is one thing I either misunderstood or is not entirely correct.

> In other words, we cannot backpropagate a loss applied to this score through the rest of the neural network. This would require that we are able to differentiate (i.e., compute the gradient of) the system that generates the score, which is a human that subjectively evaluates the generated text; see above.

I don't think it's necessary to be able to backpropagate through the system (here: human) that generates the score. You can think of the score as a label similar to what you have in supervised learning: a class label if it's discrete ("good" / "bad" for example) or a continuous number as in regression losses (e.g., a reward score reflecting how good the answer is).

There's actually also a recent paper called Direct Policy Optimization where they skip the RL part and update the LLM directly on the reward signal in supervised fashion: https://arxiv.org/abs/2305.18290

That's a small detail and I love the article though, and I do think that RL(HF) is a worthwhile approach for improving the helpfulness & safety of LLMs!

Expand full comment
author

Totally correct. I think the problem here is the exact manner in which RL is applied in the LLM domain, which is slightly different than the "traditional" RL setup.

In the traditional RL literature, the problem here is that the environment is a complete black box that is non-differentiable. So, we get a reward, but there is no way to connect this reward to training network via supervised learning. This is because the reward is being provided by a non-differentiable black box.

For LLMs, there's a key detail that changes this to a certain extent. Namely, the reward is being provided by a reward model, which is obviously differentiable. In this way, we actually make our environment differentiable and can learn end-to-end via supervised learning. AFAIK, however, if we got the scores directly from humans and never trained a model (or just some extra layers on top of the LLM) to automate these rewards, this would not be possible (how would be perform backprop?).

I agree that there are ways of foregoing RL and trying to do everything via a supervised objective. However, this requires learning a model of the environment/rewards. For language models, we already do this and it is actually pretty simple. Because of this, it seems like always adding some extra layers and learning to predict the reward to enable supervised learning is super easy. However, this is not always the case. E.g., if we want to model rewards for an autonomous vehicle setup, it's way harder to figure out how to train an accurate reward model (what inputs/outputs do we need to accurately predict the reward?) that could allow us to learn in a supervised fashion. Put simply, the reward/environment model may be really difficult to train/create depending on the problem domain.

With all this being said, I'm definitely not an expert in RL, and I'm trying to learn this myself as I write these articles. So, I really appreciate the feedback given that I could be getting some details wrong here!

Expand full comment