Discussion about this post

User's avatar
Keagen Hadley's avatar

Fantastic article, Cameron.

Super informative!

Expand full comment
Sebastian Raschka, PhD's avatar

This is another really nice article, Cameron!

Sorry, but there is one thing I either misunderstood or is not entirely correct.

> In other words, we cannot backpropagate a loss applied to this score through the rest of the neural network. This would require that we are able to differentiate (i.e., compute the gradient of) the system that generates the score, which is a human that subjectively evaluates the generated text; see above.

I don't think it's necessary to be able to backpropagate through the system (here: human) that generates the score. You can think of the score as a label similar to what you have in supervised learning: a class label if it's discrete ("good" / "bad" for example) or a continuous number as in regression losses (e.g., a reward score reflecting how good the answer is).

There's actually also a recent paper called Direct Policy Optimization where they skip the RL part and update the LLM directly on the reward signal in supervised fashion: https://arxiv.org/abs/2305.18290

That's a small detail and I love the article though, and I do think that RL(HF) is a worthwhile approach for improving the helpfulness & safety of LLMs!

Expand full comment
3 more comments...

No posts