5 Comments
User's avatar
Sairam Sundaresan's avatar

This is phenomenally detailed. Going to come back to this in a while to absorb all the information. Great work, Cameron. This must have taken a lot of effort to put together

Cameron R. Wolfe, Ph.D.'s avatar

Thanks! I made the decision to switch to writing posts every 2 weeks again (instead of every week). I felt like the more frequent posts were lacking in quality a bit, and my best posts are those that I spend time refining, making comprehensive, etc. So, I'm glad that you can tell a quality difference! I'll try to keep the detailed posts coming :)

Sairam Sundaresan's avatar

You already set a high bar to begin with. This just elevates it. Awesome work.

Finn's avatar

Thanks Cameron!

Ziniu Li's avatar

Thank Cameron for the great blog!!! You may be interested in ReMax, a more efficient reinforcement learning method than PPO when used in RLHF. In addition, ReMax is very simple with just 6 lines of code to implement. ReMax's paper discusses interesting properties of RLHF that may also be insightful for designing better RLHF algorithms.

ReMax's paper link: https://arxiv.org/abs/2310.10505