Very detailed and well articulated. I have written about full derivations of PPO in general and for PPO for RLHF in my books, yet I enjoyed reading yours and gaining new insights.
This comprehensive guide brilliantly bridges the gap between theoretical RL and practical LLM implementation. The progression from basic policy gradients to GAE is particularly well structred. Your breakdown of the four clipping cases in PPO finally made that mechanism click for me - seeing how advantage sign determines when clipping activates is invaluable.
rewards = RM(completions) # (B*G, 1) <- I am confused about this shape
Since I feel that the rewards here are actually action value function q(s, a), should rewards also be of shape (B*G, T) instead? Also, this `rewards` variable is used later to compute the advantage function per token, so according to the definition, it should be on per-token level?
For LLMs, we are usually operating in an outcome reward regime, which means that each completion receives a single reward for the full completion. If we have process rewards (i.e., a reward assigned for every token), then we would indeed have size (B*G, T), but for now this is usually not the case for LLMs.
However, you are also correct that the reward is later used to compute the advantage per token. The nuance here is that for PPO + LLMs, the KL divergence is incorporated directly into the reward. And, the KL divergence is computed per token. So, in this case, the reward for nearly all of our tokens is just the KL divergence, while the reward for the final token is KL divergence + outcome reward. In other words, the reward for the final token is basically just broadcasted across every token in the sequence. For this reason, we actually get a reward for every token in the completion even though we are only receiving an outcome reward.
This is a common criticism of PPO - this per-token setup is complicated and unnecessary if we are only receiving an outcome reward. This has led to popularization of algorithms like REINFORCE and GRPO that eliminate this extra complexity and adopt a simpler approach that does not leverage token-level information.
However, given that we are implementing PPO with per token reward (MDP setup) in this code example, the algorithm would work exactly the same if we have rewards for every token! If we wanted to add per-token rewards from a process reward model there would not be much to change. It would be naturally handled in our implementation, and the size of the reward matrix would indeed be (B*G, T) instead of (B*G, 1).
Thanks for the clarification! My major confusion is that when I look at the advantage function, it is computing on each time step how good (i.e., the final reward predicted by the reward model based on partial token sequence already generated from time 0 to time t and prompt) the predicted token a_t is when appended to the existing t-1 tokens plus the prompt, by subtracting critic at s_t from the action value function (for lower variance), so this q(s_t, a_t) should have different values based on different (and growing) generated token sequence. You also point out this is not infeasible, and it is quite natural to just use the pretrained reward model to output this exact q(s_t, a_t) on growing token sequences. On the other hand, if I understand it correctly, a simplified version of using q(s_T, a_T) (i.e. fully generated output token sequence) replacing the varying q(s_t, a_t) is also applicable and universally used, utilizing empirically that the problem puts high emphasis on reward at the terminal state (because it makes most sense to judge how good the complete token sequence is), while all the intermediate steps yield 0 rewards.
Very detailed and well articulated. I have written about full derivations of PPO in general and for PPO for RLHF in my books, yet I enjoyed reading yours and gaining new insights.
Glad you enjoyed, and thanks so much for reading!
This comprehensive guide brilliantly bridges the gap between theoretical RL and practical LLM implementation. The progression from basic policy gradients to GAE is particularly well structred. Your breakdown of the four clipping cases in PPO finally made that mechanism click for me - seeing how advantage sign determines when clipping activates is invaluable.
Glad it was helpful!
again, thank you!
Might be a stupid question, but in the initial pseudocode of PPO, I see this comment:
# sample prompt completions and rewards
with torch.no_grad():
completions = LLM.generate(prompts) # (B*G, L)
rewards = RM(completions) # (B*G, 1) <- I am confused about this shape
Since I feel that the rewards here are actually action value function q(s, a), should rewards also be of shape (B*G, T) instead? Also, this `rewards` variable is used later to compute the advantage function per token, so according to the definition, it should be on per-token level?
Great question.
For LLMs, we are usually operating in an outcome reward regime, which means that each completion receives a single reward for the full completion. If we have process rewards (i.e., a reward assigned for every token), then we would indeed have size (B*G, T), but for now this is usually not the case for LLMs.
However, you are also correct that the reward is later used to compute the advantage per token. The nuance here is that for PPO + LLMs, the KL divergence is incorporated directly into the reward. And, the KL divergence is computed per token. So, in this case, the reward for nearly all of our tokens is just the KL divergence, while the reward for the final token is KL divergence + outcome reward. In other words, the reward for the final token is basically just broadcasted across every token in the sequence. For this reason, we actually get a reward for every token in the completion even though we are only receiving an outcome reward.
This is a common criticism of PPO - this per-token setup is complicated and unnecessary if we are only receiving an outcome reward. This has led to popularization of algorithms like REINFORCE and GRPO that eliminate this extra complexity and adopt a simpler approach that does not leverage token-level information.
However, given that we are implementing PPO with per token reward (MDP setup) in this code example, the algorithm would work exactly the same if we have rewards for every token! If we wanted to add per-token rewards from a process reward model there would not be much to change. It would be naturally handled in our implementation, and the size of the reward matrix would indeed be (B*G, T) instead of (B*G, 1).
Hope this helps!
Thanks for the clarification! My major confusion is that when I look at the advantage function, it is computing on each time step how good (i.e., the final reward predicted by the reward model based on partial token sequence already generated from time 0 to time t and prompt) the predicted token a_t is when appended to the existing t-1 tokens plus the prompt, by subtracting critic at s_t from the action value function (for lower variance), so this q(s_t, a_t) should have different values based on different (and growing) generated token sequence. You also point out this is not infeasible, and it is quite natural to just use the pretrained reward model to output this exact q(s_t, a_t) on growing token sequences. On the other hand, if I understand it correctly, a simplified version of using q(s_T, a_T) (i.e. fully generated output token sequence) replacing the varying q(s_t, a_t) is also applicable and universally used, utilizing empirically that the problem puts high emphasis on reward at the terminal state (because it makes most sense to judge how good the complete token sequence is), while all the intermediate steps yield 0 rewards.