REINFORCE: Easy Online RL for LLMs

Cameron R. Wolfe, Ph.D.

Sep 29, 2025

How to get the benefits of online RL without the complexity of PPO...

Read →

11 Comments

Paul

Oct 31

as always, quality content, thank you very much!

Reply (1)

Cameron R. Wolfe, Ph.D.

Oct 31

thank you for reading!

Anshu

Oct 5Edited

Thanks, Cameron for sharing your deep insights on RL. Really enjoying learning from them. A question regarding the contextual bandit section where you mention "Our complete trajectory is a single action and reward!": Does it mean this is in a single turn (single turn meaning single query-response generation) setting? If so, how could we extend this to a multi-turn conversational setting?

Reply (1)

Cameron R. Wolfe, Ph.D.

Oct 6

Thanks for the question!

No, this can handle both single turn / multi-turn. We just have to map the multi-turn chat to a prompt/response format.

For example if we have multiple turns of chat:

user: xxx

assistant: xxx

user: xxx

assistant: xxx

The prompt/response format would be

=======prompt=======

user: xxx

assistant: xxx

user: xxx

======response======

assistant: xxx

We can generalize this however we want, but the idea is that you can just model all prior turns as the prompt and train the LLM to product the correct response / turn.

Modeling the completion as a single action mostly relates to how we compute the loss for RL. We can either compute the loss on a token level, or we can aggregate across all tokens / the entire sequence (e.g., by summing token-level probabilities) to compute the loss on a completion level. This is mainly what I mean when referring to "Our complete trajectory is a single action and reward!".

Reply (1)

Anshu

Oct 6Edited

Thanks, Cameron for your detailed response. I have a few follow up questions to understand more, and would appreciate your thoughts:

1. As number of conversation turns grows within a session/across sessions, what would be the best way to train?

2. If each turn is given a separate reward and there is an overall reward after the session (for instance in task-oriented systems, where the user may upvote most of the turns but didn't convert after the session), how could we model this when calculating the loss using the bandit approach?

Thank you for your time in advance.