Thanks, Cameron for sharing your deep insights on RL. Really enjoying learning from them. A question regarding the contextual bandit section where you mention "Our complete trajectory is a single action and reward!": Does it mean this is in a single turn (single turn meaning single query-response generation) setting? If so, how could we extend this to a multi-turn conversational setting?
No, this can handle both single turn / multi-turn. We just have to map the multi-turn chat to a prompt/response format.
For example if we have multiple turns of chat:
user: xxx
assistant: xxx
user: xxx
assistant: xxx
The prompt/response format would be
=======prompt=======
user: xxx
assistant: xxx
user: xxx
======response======
assistant: xxx
We can generalize this however we want, but the idea is that you can just model all prior turns as the prompt and train the LLM to product the correct response / turn.
Modeling the completion as a single action mostly relates to how we compute the loss for RL. We can either compute the loss on a token level, or we can aggregate across all tokens / the entire sequence (e.g., by summing token-level probabilities) to compute the loss on a completion level. This is mainly what I mean when referring to "Our complete trajectory is a single action and reward!".
Thanks, Cameron for your detailed response. I have a few follow up questions to understand more, and would appreciate your thoughts:
1. As number of conversation turns grows within a session/across sessions, what would be the best way to train?
2. If each turn is given a separate reward and there is an overall reward after the session (for instance in task-oriented systems, where the user may upvote most of the turns but didn't convert after the session), how could we model this when calculating the loss using the bandit approach?
as always, quality content, thank you very much!
thank you for reading!
Thanks, Cameron for sharing your deep insights on RL. Really enjoying learning from them. A question regarding the contextual bandit section where you mention "Our complete trajectory is a single action and reward!": Does it mean this is in a single turn (single turn meaning single query-response generation) setting? If so, how could we extend this to a multi-turn conversational setting?
Thanks for the question!
No, this can handle both single turn / multi-turn. We just have to map the multi-turn chat to a prompt/response format.
For example if we have multiple turns of chat:
user: xxx
assistant: xxx
user: xxx
assistant: xxx
The prompt/response format would be
=======prompt=======
user: xxx
assistant: xxx
user: xxx
======response======
assistant: xxx
We can generalize this however we want, but the idea is that you can just model all prior turns as the prompt and train the LLM to product the correct response / turn.
Modeling the completion as a single action mostly relates to how we compute the loss for RL. We can either compute the loss on a token level, or we can aggregate across all tokens / the entire sequence (e.g., by summing token-level probabilities) to compute the loss on a completion level. This is mainly what I mean when referring to "Our complete trajectory is a single action and reward!".
Thanks, Cameron for your detailed response. I have a few follow up questions to understand more, and would appreciate your thoughts:
1. As number of conversation turns grows within a session/across sessions, what would be the best way to train?
2. If each turn is given a separate reward and there is an overall reward after the session (for instance in task-oriented systems, where the user may upvote most of the turns but didn't convert after the session), how could we model this when calculating the loss using the bandit approach?
Thank you for your time in advance.
Thank you for posting such a clear and insightful explanation of how these RL algos work. The diagrams and pseudocode work well together.
Glad you liked this! I’m really trying to provide more code / pseudo code to make ideas more concrete.
So much to unpack here. Going to read this through the week. Thanks for your wonder "knowledge distillation" as always :D
Thank you! Working towards one huge blog that covers all of RL for LLMs
Essential and high-quality RL content. Thanks for taking your time writing this!
Of course! Thank you for reading