11 Comments
User's avatar
Paul's avatar

as always, quality content, thank you very much!

Cameron R. Wolfe, Ph.D.'s avatar

thank you for reading!

Anshu's avatar
Oct 5Edited

Thanks, Cameron for sharing your deep insights on RL. Really enjoying learning from them. A question regarding the contextual bandit section where you mention "Our complete trajectory is a single action and reward!": Does it mean this is in a single turn (single turn meaning single query-response generation) setting? If so, how could we extend this to a multi-turn conversational setting?

Cameron R. Wolfe, Ph.D.'s avatar

Thanks for the question!

No, this can handle both single turn / multi-turn. We just have to map the multi-turn chat to a prompt/response format.

For example if we have multiple turns of chat:

user: xxx

assistant: xxx

user: xxx

assistant: xxx

The prompt/response format would be

=======prompt=======

user: xxx

assistant: xxx

user: xxx

======response======

assistant: xxx

We can generalize this however we want, but the idea is that you can just model all prior turns as the prompt and train the LLM to product the correct response / turn.

Modeling the completion as a single action mostly relates to how we compute the loss for RL. We can either compute the loss on a token level, or we can aggregate across all tokens / the entire sequence (e.g., by summing token-level probabilities) to compute the loss on a completion level. This is mainly what I mean when referring to "Our complete trajectory is a single action and reward!".

Anshu's avatar
Oct 6Edited

Thanks, Cameron for your detailed response. I have a few follow up questions to understand more, and would appreciate your thoughts:

1. As number of conversation turns grows within a session/across sessions, what would be the best way to train?

2. If each turn is given a separate reward and there is an overall reward after the session (for instance in task-oriented systems, where the user may upvote most of the turns but didn't convert after the session), how could we model this when calculating the loss using the bandit approach?

Thank you for your time in advance.

Joshua Mabry's avatar

Thank you for posting such a clear and insightful explanation of how these RL algos work. The diagrams and pseudocode work well together.

Cameron R. Wolfe, Ph.D.'s avatar

Glad you liked this! I’m really trying to provide more code / pseudo code to make ideas more concrete.

Sairam Sundaresan's avatar

So much to unpack here. Going to read this through the week. Thanks for your wonder "knowledge distillation" as always :D

Cameron R. Wolfe, Ph.D.'s avatar

Thank you! Working towards one huge blog that covers all of RL for LLMs

Dr. Ashish Bamania's avatar

Essential and high-quality RL content. Thanks for taking your time writing this!

Cameron R. Wolfe, Ph.D.'s avatar

Of course! Thank you for reading