Policy Gradients: The Foundation of RLHF

Cameron R. Wolfe, Ph.D.

Oct 2, 2023

Understanding policy optimization and how it is used in reinforcement learning...

Read →

1 Comment

Nerner

Feb 7

Hello!

Thank you very much for the post, I am a data scientist but RL has always scared me. But your posts encouraged me to take a shot at understanding them :)

I have two/three questions and remarks:

1. The way you use the subscript t in the policy in probability of a trajectory made me for a while think that the policy would get updated in every timestep of the trajectory because you use t to express time steps on the right side of the multiplication notation. (And I do not think that is the case with the algorithms here.)

2. You use the phrase "maintaining/keeping desired expectation of policy gradients" for variants of Basic Policy Gradients, what do you mean by that?

3. Why do we re-fit the baseline in the vanilla policy gradient?

Best!

Expand full comment

Deep (Learning) Focus

Policy Gradients: The Foundation of RLHF