This is an exceptionally thorough analysis of the online-offline performance gap in LLM alignment. The finding that on-policy sampling provides consistent benefits across different scales and domains is partcularly compelling. The semi-online approach seems like a pragmatic middle ground that could democratize access to higher-quality alignment techniques for smaller teams.
again, super dense and great article, thx!
Thanks for reading!
I wonder if these algorithms use network analysis like Eigen vector and degree centrality
Haven't seen anything like this yet, but totally possible that it's out there!
This is an exceptionally thorough analysis of the online-offline performance gap in LLM alignment. The finding that on-policy sampling provides consistent benefits across different scales and domains is partcularly compelling. The semi-online approach seems like a pragmatic middle ground that could democratize access to higher-quality alignment techniques for smaller teams.