<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Deep (Learning) Focus]]></title><description><![CDATA[I contextualize and explain important topics in AI research.]]></description><link>https://cameronrwolfe.substack.com</link><image><url>https://substackcdn.com/image/fetch/$s_!87xa!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab9b43fb-52d5-40da-995d-5b7cd3f91064_896x896.png</url><title>Deep (Learning) Focus</title><link>https://cameronrwolfe.substack.com</link></image><generator>Substack</generator><lastBuildDate>Tue, 12 May 2026 09:24:14 GMT</lastBuildDate><atom:link href="https://cameronrwolfe.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Cameron R. Wolfe]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[cameronrwolfe@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[cameronrwolfe@substack.com]]></itunes:email><itunes:name><![CDATA[Cameron R. Wolfe, Ph.D.]]></itunes:name></itunes:owner><itunes:author><![CDATA[Cameron R. Wolfe, Ph.D.]]></itunes:author><googleplay:owner><![CDATA[cameronrwolfe@substack.com]]></googleplay:owner><googleplay:email><![CDATA[cameronrwolfe@substack.com]]></googleplay:email><googleplay:author><![CDATA[Cameron R. Wolfe, Ph.D.]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[RL Scaling Laws for LLMs]]></title><description><![CDATA[How scaling laws have evolved from pretraining to reinforcement learning...]]></description><link>https://cameronrwolfe.substack.com/p/rl-scaling-laws</link><guid isPermaLink="false">https://cameronrwolfe.substack.com/p/rl-scaling-laws</guid><dc:creator><![CDATA[Cameron R. Wolfe, Ph.D.]]></dc:creator><pubDate>Mon, 20 Apr 2026 09:33:44 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/abed5c67-abb9-497c-8919-033e2df09e43_1960x1100.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xRsc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d1fc9df-44ec-4458-946a-09a0265de59f_1942x1088.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xRsc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d1fc9df-44ec-4458-946a-09a0265de59f_1942x1088.png 424w, https://substackcdn.com/image/fetch/$s_!xRsc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d1fc9df-44ec-4458-946a-09a0265de59f_1942x1088.png 848w, https://substackcdn.com/image/fetch/$s_!xRsc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d1fc9df-44ec-4458-946a-09a0265de59f_1942x1088.png 1272w, https://substackcdn.com/image/fetch/$s_!xRsc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d1fc9df-44ec-4458-946a-09a0265de59f_1942x1088.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xRsc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d1fc9df-44ec-4458-946a-09a0265de59f_1942x1088.png" width="1456" height="816" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7d1fc9df-44ec-4458-946a-09a0265de59f_1942x1088.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:816,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:815119,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/192734052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d1fc9df-44ec-4458-946a-09a0265de59f_1942x1088.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xRsc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d1fc9df-44ec-4458-946a-09a0265de59f_1942x1088.png 424w, https://substackcdn.com/image/fetch/$s_!xRsc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d1fc9df-44ec-4458-946a-09a0265de59f_1942x1088.png 848w, https://substackcdn.com/image/fetch/$s_!xRsc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d1fc9df-44ec-4458-946a-09a0265de59f_1942x1088.png 1272w, https://substackcdn.com/image/fetch/$s_!xRsc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d1fc9df-44ec-4458-946a-09a0265de59f_1942x1088.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1, 2, 3])</figcaption></figure></div><p>Scaling is one of the most impactful concepts in the history of AI research. For large language models (LLMs), scaling has mostly been studied in the context of pretraining, where rigorous <a href="http://cameronrwolfe.substack.com/p/llm-scaling-laws">scaling laws</a> have allowed us to clearly define the relationship between compute and performance. Inspired by these predictable trends, the LLM research community has empirically validated pretraining scaling laws across several orders of magnitude. Through this process, we have discovered that meaningful improvements in model capabilities can be consistently achieved by investing more data and compute into pretraining.  </p><blockquote><p><em>&#8220;The way ML used to work is that people would just tinker with stuff and try to get interesting results. That&#8217;s what&#8217;s been going on in the past. Then the scaling insight arrived. Scaling laws, GPT-3, and suddenly everyone realized we should scale. This is an example of how language affects thought. Scaling is just one word, but it&#8217;s such a powerful word because it informs people what to do.&#8221;</em> - <a href="https://www.dwarkesh.com/p/ilya-sutskever-2">Ilya Sutskever</a></p></blockquote><p>The success of scaling laws in the context of pretraining has inspired the same concept of scaling to be applied in other areas of the LLM training process. Most notably, scaling now plays a key role in reinforcement learning (RL), where researchers have demonstrated smooth and predictable model capability improvements with larger-scale training. In this overview, we will study scaling laws in the context of RL. Rather than studying this topic in isolation, however, we will first build a deep understanding of scaling laws for pretraining and aim to outline how scaling laws have evolved in their application to RL. As we will see, the exact definition of scaling laws is completely different between these two domains, <em>but the fundamental concept of scale remains powerful in both</em>. </p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://cameronrwolfe.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Join many others who use Deep (Learning) Focus to understand AI research. Consider a paid subscription if you would like to help support the newsletter.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Scaling Law Fundamentals</h2><p>Many early advancements in LLMs were driven by scaling up the pretraining process. Put simply, investing more compute into pretraining&#8212;<em>by training a larger model on more data</em>&#8212;yields better performance. We can rigorously define the relationship between compute and performance via a scaling law [13], or an equation that models the decrease in an LLM&#8217;s test loss as compute increases. As we will see, the pretraining process for an LLM follows smooth trends that can be accurately predicted via a scaling law, allowing the performance of larger models to be estimated before they are even trained. This ability to granularly forecast the expected result of a certain training configuration has many benefits:</p><ul><li><p>Significant compute investments are less daunting, as we know what the result of this invested compute will be.</p></li><li><p>Iteration speed for experiments can be increased by running smaller scale experiments and extrapolating their results. </p></li></ul><p>We will now build an understanding of scaling laws for LLM pretraining from the ground up. This knowledge of the mechanics and practical application of scaling laws for pretraining is needed to form a contrast with the scaling laws used for RL training. Pretraining scaling laws are highly standardized and follow a well-defined approach to estimate very particular training metrics. On the other hand, RL scaling laws&#8212;<em>while still informative</em>&#8212;tend to be much messier and bespoke, both in terms of their structure and the quantities that we measure. </p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;262f0611-7ef2-4dd9-82e3-b606901b201c&quot;,&quot;caption&quot;:&quot;A majority of recent advancements in AI research&#8212;and large language models (LLMs) in particular&#8212;have been driven by scale. If we train larger models over more data, we get better results. This relationship can be defined more rigorously via a scaling law, which is just an equation that describes how an LLM&#8217;s test loss will decrease &#8230;&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Scaling Laws for LLMs: From GPT-3 to o3&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:29736521,&quot;name&quot;:&quot;Cameron R. Wolfe, Ph.D.&quot;,&quot;bio&quot;:&quot;Research @ Netflix &#8226; Rice University PhD &#8226; I make AI understandable&quot;,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/69aba7df-b571-4609-aa47-fc2d031c11b8_1242x1595.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2025-01-06T10:33:42.787Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/44e9a03a-b5c7-4eb2-aef8-ef019c38d671_2578x1440.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://cameronrwolfe.substack.com/p/llm-scaling-laws&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:152758713,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:149,&quot;comment_count&quot;:9,&quot;publication_id&quot;:1092659,&quot;publication_name&quot;:&quot;Deep (Learning) Focus&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!87xa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab9b43fb-52d5-40da-995d-5b7cd3f91064_896x896.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p><strong>Further learning.</strong> Although we cover the key details of pretraining scaling laws in this section, this is a popular and complex topic with a long history of study. For more details and links to further reading, please see the overview above. </p><h4>What is a power law?</h4><p>We can model the LLM pretraining process with a <a href="https://en.wikipedia.org/wiki/Power_law">power law</a>. At the simplest level, a power law describes a relationship between two quantities. A basic power law can be expressed as <code>y</code> <code>=</code> <code>a</code> &#215; <code>x^p. </code>The two quantities being studied are <code>x</code> and <code>y</code>, while <code>a</code> and <code>p</code> are constants that describe their relationship&#8212;<code>a</code><em> controls the vertical position of the curve, while </em><code>p</code><em> controls the steepness or direction of the curve</em>. Plotting this simple power law function gives us the figure shown below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pZed!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3b410c6-03f9-4214-8d6a-074b1cbbf6ec_800x300.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pZed!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3b410c6-03f9-4214-8d6a-074b1cbbf6ec_800x300.png 424w, https://substackcdn.com/image/fetch/$s_!pZed!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3b410c6-03f9-4214-8d6a-074b1cbbf6ec_800x300.png 848w, https://substackcdn.com/image/fetch/$s_!pZed!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3b410c6-03f9-4214-8d6a-074b1cbbf6ec_800x300.png 1272w, https://substackcdn.com/image/fetch/$s_!pZed!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3b410c6-03f9-4214-8d6a-074b1cbbf6ec_800x300.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pZed!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3b410c6-03f9-4214-8d6a-074b1cbbf6ec_800x300.png" width="800" height="300" data-attrs="{&quot;src&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/c3b410c6-03f9-4214-8d6a-074b1cbbf6ec_800x300.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:300,&quot;width&quot;:800,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:24331,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!pZed!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3b410c6-03f9-4214-8d6a-074b1cbbf6ec_800x300.png 424w, https://substackcdn.com/image/fetch/$s_!pZed!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3b410c6-03f9-4214-8d6a-074b1cbbf6ec_800x300.png 848w, https://substackcdn.com/image/fetch/$s_!pZed!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3b410c6-03f9-4214-8d6a-074b1cbbf6ec_800x300.png 1272w, https://substackcdn.com/image/fetch/$s_!pZed!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3b410c6-03f9-4214-8d6a-074b1cbbf6ec_800x300.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Plot of a basic power law between <code>x</code> and <code>y</code></figcaption></figure></div><p>We provide the power law plot in both normal and log scale because most papers that study LLM scaling laws tend to plot their results in log scale. However, the plots provided for LLM scaling do not look like the plot shown above&#8212;<em>they are usually flipped upside down</em>; see below for an example. This is just an inverse power law, which can be formulated as <code>y</code> <code>=</code> <code>a</code> &#215; <code>(1</code> <code>/</code> <code>x)^p</code>. This is nearly identical to a standard power law, but we just use a negative exponent for p. As we can see below, using a negative exponent for <code>p</code> flips the power law plot upside down. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Av0X!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82fbae3c-e5d1-4936-8328-057eba9893a3_806x524.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Av0X!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82fbae3c-e5d1-4936-8328-057eba9893a3_806x524.png 424w, https://substackcdn.com/image/fetch/$s_!Av0X!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82fbae3c-e5d1-4936-8328-057eba9893a3_806x524.png 848w, https://substackcdn.com/image/fetch/$s_!Av0X!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82fbae3c-e5d1-4936-8328-057eba9893a3_806x524.png 1272w, https://substackcdn.com/image/fetch/$s_!Av0X!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82fbae3c-e5d1-4936-8328-057eba9893a3_806x524.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Av0X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82fbae3c-e5d1-4936-8328-057eba9893a3_806x524.png" width="452" height="293.8560794044665" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/82fbae3c-e5d1-4936-8328-057eba9893a3_806x524.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:524,&quot;width&quot;:806,&quot;resizeWidth&quot;:452,&quot;bytes&quot;:75955,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Av0X!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82fbae3c-e5d1-4936-8328-057eba9893a3_806x524.png 424w, https://substackcdn.com/image/fetch/$s_!Av0X!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82fbae3c-e5d1-4936-8328-057eba9893a3_806x524.png 848w, https://substackcdn.com/image/fetch/$s_!Av0X!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82fbae3c-e5d1-4936-8328-057eba9893a3_806x524.png 1272w, https://substackcdn.com/image/fetch/$s_!Av0X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82fbae3c-e5d1-4936-8328-057eba9893a3_806x524.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [13])</figcaption></figure></div><p><strong>LLM power laws.</strong> This inverse power law, when plotted with a log scale, yields the signature linear relationship that characterizes LLM scaling laws. The two quantities we model via this inverse power law in LLM pretraining are:</p><ol><li><p>The LLM&#8217;s test loss <code>L</code>&#8212;<em>the <a href="https://cameronrwolfe.substack.com/i/136638774/understanding-next-token-prediction">next token prediction</a> or cross entropy loss in particular (or another entropy-based metric like <a href="https://thegradient.pub/understanding-evaluation-metrics-for-language-models/">bits-per-byte or perplexity</a>)</em>&#8212;measured over an in-distribution, held-out validation set.</p></li><li><p>The compute <code>C</code> spent during pretraining that is estimated via the number of training <a href="https://en.wikipedia.org/wiki/Floating_point_operations_per_second">FLOPs</a> <code>C</code> <code>=</code> <code>6</code> <code>&#215;</code> <code>N</code> <code>&#215;</code> <code>D</code>, where <code>N</code> is the number of model parameters and <code>D</code> is the number of tokens observed during pretraining. </p></li></ol><p>The factor of six used when estimating training compute comes from the fact that the LLM performs a single forward and backward pass during each training step. A single forward pass costs about <code>2N</code> FLOPs per token, and the backward pass is <code>2&#215;</code> the cost of the forward pass. Therefore, a training step costs about <code>6N</code> FLOPs per token, and we multiply this quantity by the total number of tokens observed during training to yield the <code>C</code> <code>=</code> <code>6</code> <code>&#215;</code> <code>N</code> <code>&#215;</code> <code>D</code> approximation. This approximation of pretraining compute was used in one of the first papers to study pretraining scaling laws [13], leading to its adoption in other work on the topic. </p><h4><a href="https://arxiv.org/abs/2001.08361">Neural Scaling Laws</a> [13] and <a href="https://arxiv.org/abs/2203.15556">Chinchilla</a> [14]</h4><p>To develop a more concrete understanding of scaling laws for pretraining, we will overview two seminal papers [13, 14] that established the foundational principles of scaling. In [13], authors study the impact of several settings on the pretraining process, discovering that performance improves smoothly as we increase:</p><ol><li><p>Model parameters.</p></li><li><p>Data volume.</p></li><li><p>Training compute. </p></li></ol><p>More specifically, <em>a power law relationship is observed between each of these factors and the LLM&#8217;s test loss when performance is not bottlenecked by either of the other two factors.</em> To observe these power laws, LLMs with sizes up to 1.5B parameters are trained on several subsets of the <a href="https://github.com/EleutherAI/openwebtext2">WebText2 corpus</a>. As shown below, the performance of these models steadily improves as we increase model size, data volume, or compute. These trends span eight orders of magnitude in compute, six orders of magnitude in model size, and two orders of magnitude in dataset size. The exact power law relationships and equations are provided below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OvRU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F071a53ae-e1d5-4af6-bcd5-4402cc27e924_2144x1244.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OvRU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F071a53ae-e1d5-4af6-bcd5-4402cc27e924_2144x1244.png 424w, https://substackcdn.com/image/fetch/$s_!OvRU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F071a53ae-e1d5-4af6-bcd5-4402cc27e924_2144x1244.png 848w, https://substackcdn.com/image/fetch/$s_!OvRU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F071a53ae-e1d5-4af6-bcd5-4402cc27e924_2144x1244.png 1272w, https://substackcdn.com/image/fetch/$s_!OvRU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F071a53ae-e1d5-4af6-bcd5-4402cc27e924_2144x1244.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OvRU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F071a53ae-e1d5-4af6-bcd5-4402cc27e924_2144x1244.png" width="1456" height="845" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/071a53ae-e1d5-4af6-bcd5-4402cc27e924_2144x1244.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:845,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OvRU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F071a53ae-e1d5-4af6-bcd5-4402cc27e924_2144x1244.png 424w, https://substackcdn.com/image/fetch/$s_!OvRU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F071a53ae-e1d5-4af6-bcd5-4402cc27e924_2144x1244.png 848w, https://substackcdn.com/image/fetch/$s_!OvRU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F071a53ae-e1d5-4af6-bcd5-4402cc27e924_2144x1244.png 1272w, https://substackcdn.com/image/fetch/$s_!OvRU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F071a53ae-e1d5-4af6-bcd5-4402cc27e924_2144x1244.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [13])</figcaption></figure></div><p>Each of these equations is very similar to the inverse power law equation that we saw before, but we set <code>a</code> <code>=</code> <code>1</code> and have an additional multiplicative constant (i.e., <code>C_c</code>, <code>D_c</code>, or <code>N_c</code>) inside of the parenthesis. To fit these power laws, we train a collection of models with different sizes while varying the amount of compute and data used for training. We can then measure the test loss for each of these models, forming a dataset of training configurations with a corresponding test loss. We can then fit the parameters of our power law to this data. Although there are many ways to fit a power law, one common approach for simple power law relationships is to fit a linear model on the observed data in log-log space.</p><p><strong>What do power laws tell us? </strong>Although the power law plots provided above look promising, we should notice that these plots are generated using a log scale. If we generate normal plots (i.e., without log scale), we get the figures below, where we see that the shape of the power law resembles an exponential decay. In this way, <em>increasing the quality of an LLM becomes exponentially more difficult with scale</em>. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!L-l2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabf07f96-f9d9-467f-a459-fb24788f4e33_2540x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!L-l2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabf07f96-f9d9-467f-a459-fb24788f4e33_2540x630.png 424w, https://substackcdn.com/image/fetch/$s_!L-l2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabf07f96-f9d9-467f-a459-fb24788f4e33_2540x630.png 848w, https://substackcdn.com/image/fetch/$s_!L-l2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabf07f96-f9d9-467f-a459-fb24788f4e33_2540x630.png 1272w, https://substackcdn.com/image/fetch/$s_!L-l2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabf07f96-f9d9-467f-a459-fb24788f4e33_2540x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!L-l2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabf07f96-f9d9-467f-a459-fb24788f4e33_2540x630.png" width="1456" height="361" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/abf07f96-f9d9-467f-a459-fb24788f4e33_2540x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:361,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:637526,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!L-l2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabf07f96-f9d9-467f-a459-fb24788f4e33_2540x630.png 424w, https://substackcdn.com/image/fetch/$s_!L-l2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabf07f96-f9d9-467f-a459-fb24788f4e33_2540x630.png 848w, https://substackcdn.com/image/fetch/$s_!L-l2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabf07f96-f9d9-467f-a459-fb24788f4e33_2540x630.png 1272w, https://substackcdn.com/image/fetch/$s_!L-l2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabf07f96-f9d9-467f-a459-fb24788f4e33_2540x630.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Power law plots without log scale</figcaption></figure></div><p><strong>Compute-optimal allocation.</strong> The key scaling trends for LLM pretraining were established in [13], where we see that the LLM&#8217;s test loss follows smooth power law trends with compute, model parameters, and data volume. One important takeaway from this analysis is that, given a fixed compute budget, we get the best results by training a larger model over less data&#8212;<em>usually ending the training process before the model fully converges</em>. Chinchilla [14] builds upon this analysis with an extensive study of optimal compute allocations for pretraining. In particular, the analysis in [14] studies how to optimally allocate a fixed compute budget between model parameters and the number of training tokens to minimize the test loss. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Q8V_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5dd7af5d-dc5c-4751-be58-3802d59943a7_1862x864.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Q8V_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5dd7af5d-dc5c-4751-be58-3802d59943a7_1862x864.png 424w, https://substackcdn.com/image/fetch/$s_!Q8V_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5dd7af5d-dc5c-4751-be58-3802d59943a7_1862x864.png 848w, https://substackcdn.com/image/fetch/$s_!Q8V_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5dd7af5d-dc5c-4751-be58-3802d59943a7_1862x864.png 1272w, https://substackcdn.com/image/fetch/$s_!Q8V_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5dd7af5d-dc5c-4751-be58-3802d59943a7_1862x864.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Q8V_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5dd7af5d-dc5c-4751-be58-3802d59943a7_1862x864.png" width="1456" height="676" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5dd7af5d-dc5c-4751-be58-3802d59943a7_1862x864.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:676,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:304698,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/192734052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5dd7af5d-dc5c-4751-be58-3802d59943a7_1862x864.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Q8V_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5dd7af5d-dc5c-4751-be58-3802d59943a7_1862x864.png 424w, https://substackcdn.com/image/fetch/$s_!Q8V_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5dd7af5d-dc5c-4751-be58-3802d59943a7_1862x864.png 848w, https://substackcdn.com/image/fetch/$s_!Q8V_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5dd7af5d-dc5c-4751-be58-3802d59943a7_1862x864.png 1272w, https://substackcdn.com/image/fetch/$s_!Q8V_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5dd7af5d-dc5c-4751-be58-3802d59943a7_1862x864.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [14])</figcaption></figure></div><p>By training over 400 LLMs of varying sizes over different amounts of data, we learn that the scaling recommendations provided in [13] lead most LLMs to be undertrained&#8212;<em>training these models on more data would yield better results.</em> More specifically, Chinchilla finds that model and data scale should be increased proportionally for pretraining to be compute optimal; see above. This study is conducted using the same scaling law formulations, but authors explicitly sweep various model and data size combinations under a fixed compute budget to find the optimal balance of data and parameters for minimizing test loss. </p><h4>Scaling Laws beyond Pretraining</h4><p>Until recently, most of the compute used for training an LLM was invested into pretraining. We mostly focused on scaling up the pretraining process, while post-training was a less expensive endeavor used to optimize a model&#8217;s style and behavior. The advent of <a href="https://cameronrwolfe.substack.com/p/demystifying-reasoning-models">reasoning models</a> drastically changed these standards.</p><div class="pullquote"><p><em>&#8220;Scaling RL compute is emerging as a critical paradigm for advancing LLMs. While pre-training establishes the foundations of a model; the subsequent phase of RL training unlocks many of today&#8217;s most important LLM capabilities, from test-time thinking to agentic capabilities&#8230; Deepseek-R1-Zero used 100,000 H800 GPU hours for RL training &#8211; 3.75% of its pre-training compute. This dramatic increase in RL compute is amplified across frontier LLM generations, with more than 10&#215; increase from o1 to o3 and a similar leap from Grok-3 to Grok-4.&#8221; - from [1]</em></p></div><p>Compared to a standard LLM, reasoning models output a long reasoning trace or chain of thought&#8212;<em>typically encapsulated by </em><code>&lt;think&gt;</code><em> </em><code>&#8230;</code><em> </em><code>&lt;/think&gt;</code><em> tokens</em>&#8212;before providing a final answer. This idea was popularized by OpenAI&#8217;s <a href="https://openai.com/o1/">o-series</a> models, which demonstrated drastic improvements in reasoning capabilities by training models to generate reasoning tokens prior to their final answer. The initial release of o1 highlighted two important new axes of scaling:</p><ol><li><p>RL training compute.</p></li><li><p>Inference-time compute.</p></li></ol><p>As shown in the figure below, we observe a smooth increase in performance&#8212;<em>resembling a scaling law</em>&#8212;by increasing RL training and inference-time compute. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OfaP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1e98bbb-e1d6-454b-ac9f-c43a38ba0fb5_2516x952.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OfaP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1e98bbb-e1d6-454b-ac9f-c43a38ba0fb5_2516x952.png 424w, https://substackcdn.com/image/fetch/$s_!OfaP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1e98bbb-e1d6-454b-ac9f-c43a38ba0fb5_2516x952.png 848w, https://substackcdn.com/image/fetch/$s_!OfaP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1e98bbb-e1d6-454b-ac9f-c43a38ba0fb5_2516x952.png 1272w, https://substackcdn.com/image/fetch/$s_!OfaP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1e98bbb-e1d6-454b-ac9f-c43a38ba0fb5_2516x952.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OfaP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1e98bbb-e1d6-454b-ac9f-c43a38ba0fb5_2516x952.png" width="1456" height="551" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a1e98bbb-e1d6-454b-ac9f-c43a38ba0fb5_2516x952.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:551,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OfaP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1e98bbb-e1d6-454b-ac9f-c43a38ba0fb5_2516x952.png 424w, https://substackcdn.com/image/fetch/$s_!OfaP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1e98bbb-e1d6-454b-ac9f-c43a38ba0fb5_2516x952.png 848w, https://substackcdn.com/image/fetch/$s_!OfaP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1e98bbb-e1d6-454b-ac9f-c43a38ba0fb5_2516x952.png 1272w, https://substackcdn.com/image/fetch/$s_!OfaP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1e98bbb-e1d6-454b-ac9f-c43a38ba0fb5_2516x952.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(<a href="https://arcprize.org/blog/oai-o3-pub-breakthrough">source</a>)</figcaption></figure></div><p>This breakthrough in reasoning was followed by the release of the open-weight DeepSeek-R1 [15] reasoning model, which performed on par with o1 and provided a technical report describing how the model was trained. Today, such reasoning capabilities have become the new standard for both open and closed models.</p><p><strong>RL for reasoning. </strong>Reasoning models are trained via <a href="https://cameronrwolfe.substack.com/i/153722335/reinforcement-learning-with-verifiable-rewards">RL with verifiable rewards</a>. As shown above, model performance improves as we scale up the RL training process. As a result, recent LLM research has heavily focused on scaling RL training for verifiable tasks (e.g., math or coding). Large-scale RL training has unlocked huge improvements in general reasoning capabilities and the quality of coding agents. However, a non-negligible fraction of total training compute is now being spent on RL, and optimally allocating compute for RL is difficult. </p><p>For pretraining, we use known scaling laws to reason about how to properly invest available compute&#8212;<em>these laws provide a standardized understanding of how performance changes with model size, data, and compute</em>. Given that compute is the primary bottleneck to AI progress, we need analogous scaling laws that enable us to better understand and predict the results of RL training at scale. </p><h2>Background on Reinforcement Learning</h2><p>We will soon build upon our understanding of pretraining scaling laws to study RL scaling laws. However, we cannot properly interpret the scaling properties of RL without first understanding the basics of the RL training process. In this section, we will briefly outline the key concepts needed for this discussion, focusing on the GRPO algorithm and its many variants. We focus on GRPO in particular because it is the most common algorithm to use for large-scale RL training with reasoning models&#8212;<em>at least in publicly disclosed research</em>. For example, the popular DeepSeek-R1 reasoning model [15] uses GRPO for RL training.</p><h4><a href="https://cameronrwolfe.substack.com/p/grpo">Group Relative Policy Optimization (GRPO)</a> [4]</h4><blockquote><p><em>&#8220;We introduce the Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO). GRPO foregoes the critic model, instead estimating the baseline from group scores, significantly reducing training resources.&#8221;</em> - from [4]</p></blockquote><p>Proposed in [4], GRPO is an RL optimization algorithm that builds upon prior algorithms like <a href="https://cameronrwolfe.substack.com/p/ppo-llm">Proximal Policy Optimization (PPO)</a>. Whereas PPO was the most popular RL optimizer for LLMs in the <a href="https://cameronrwolfe.substack.com/p/the-story-of-rlhf-origins-motivations">RLHF era</a>, GRPO is now almost universally used for large-scale RL with reasoning models<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>. GRPO is a simpler and lighter weight optimizer compared to PPO, which has aided its adoption by the LLM research community (especially for open research). The main change made by GRPO relative to PPO is in the advantage estimation technique.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dzfC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dzfC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png 424w, https://substackcdn.com/image/fetch/$s_!dzfC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png 848w, https://substackcdn.com/image/fetch/$s_!dzfC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png 1272w, https://substackcdn.com/image/fetch/$s_!dzfC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dzfC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png" width="1456" height="701" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:701,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dzfC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png 424w, https://substackcdn.com/image/fetch/$s_!dzfC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png 848w, https://substackcdn.com/image/fetch/$s_!dzfC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png 1272w, https://substackcdn.com/image/fetch/$s_!dzfC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [4])</figcaption></figure></div><p>Instead of using a <a href="https://cameronrwolfe.substack.com/i/175107358/proximal-policy-optimization-algorithms-1">value model</a> and <a href="https://cameronrwolfe.substack.com/i/175107358/generalized-advantage-estimation-gae-3">GAE</a> to estimate advantage as in PPO, GRPO estimates the advantage by sampling multiple completions or rollouts (i.e., a &#8220;group&#8221; of completions) for each prompt in a batch and using their rewards to form a <a href="https://cameronrwolfe.substack.com/i/175107358/policy-gradient-basics">baseline</a>. This group-derived baseline replaces the value function and allows GRPO to not train a value model, thus drastically reducing the memory and compute overhead of RL training. Concretely, the advantage for completion <code>i</code> is computed by normalizing the reward for this completion <code>r_i</code> with the mean and standard deviation of rewards in the group; see below. The same advantage value is assigned to every token <code>t</code> in the completion.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nguf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nguf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png 424w, https://substackcdn.com/image/fetch/$s_!nguf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png 848w, https://substackcdn.com/image/fetch/$s_!nguf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png 1272w, https://substackcdn.com/image/fetch/$s_!nguf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nguf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png" width="1456" height="597" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:597,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nguf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png 424w, https://substackcdn.com/image/fetch/$s_!nguf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png 848w, https://substackcdn.com/image/fetch/$s_!nguf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png 1272w, https://substackcdn.com/image/fetch/$s_!nguf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [4])</figcaption></figure></div><p>Intuitively, GRPO looks at the relative difference in rewards between multiple completions to the same prompt. The advantage is defined as the delta of one completion&#8217;s reward relative to the average reward observed in a group. <em>This approach teaches the model to emphasize completions with higher-than-average reward.</em></p><p><strong>Loss function.</strong> Once the advantage has been computed, the loss function used for GRPO is quite similar to that of PPO. The center point of the loss function for both PPO and GRPO is the token-level policy (or importance) ratio. Specifically, this is the ratio of the probability assigned to a token by the current policy and the policy used to generate the rollout (i.e., the &#8220;old&#8221; policy); see below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!33-I!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6068f1d-de43-4c1f-a49f-db7bc11a85a8_1442x748.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!33-I!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6068f1d-de43-4c1f-a49f-db7bc11a85a8_1442x748.png 424w, https://substackcdn.com/image/fetch/$s_!33-I!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6068f1d-de43-4c1f-a49f-db7bc11a85a8_1442x748.png 848w, https://substackcdn.com/image/fetch/$s_!33-I!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6068f1d-de43-4c1f-a49f-db7bc11a85a8_1442x748.png 1272w, https://substackcdn.com/image/fetch/$s_!33-I!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6068f1d-de43-4c1f-a49f-db7bc11a85a8_1442x748.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!33-I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6068f1d-de43-4c1f-a49f-db7bc11a85a8_1442x748.png" width="504" height="261.43689320388347" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a6068f1d-de43-4c1f-a49f-db7bc11a85a8_1442x748.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:748,&quot;width&quot;:1442,&quot;resizeWidth&quot;:504,&quot;bytes&quot;:153545,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/192734052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6068f1d-de43-4c1f-a49f-db7bc11a85a8_1442x748.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!33-I!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6068f1d-de43-4c1f-a49f-db7bc11a85a8_1442x748.png 424w, https://substackcdn.com/image/fetch/$s_!33-I!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6068f1d-de43-4c1f-a49f-db7bc11a85a8_1442x748.png 848w, https://substackcdn.com/image/fetch/$s_!33-I!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6068f1d-de43-4c1f-a49f-db7bc11a85a8_1442x748.png 1272w, https://substackcdn.com/image/fetch/$s_!33-I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6068f1d-de43-4c1f-a49f-db7bc11a85a8_1442x748.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The policy (or importance) ratio</figcaption></figure></div><p>Using this policy ratio and the advantage, we can compute the loss function for GRPO as shown below. This loss function uses the same clipping mechanism proposed by PPO; see <a href="https://cameronrwolfe.substack.com/i/175107358/proximal-policy-optimization-algorithms-1">here</a> for more details. Similarly to PPO, GRPO takes the minimum of a clipped and unclipped objective in its loss formulation, where the objective is just the product of the policy ratio and advantage for token <code>t</code>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CjA3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd3413a-a68c-4296-89fb-00611fee8016_2314x800.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CjA3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd3413a-a68c-4296-89fb-00611fee8016_2314x800.png 424w, https://substackcdn.com/image/fetch/$s_!CjA3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd3413a-a68c-4296-89fb-00611fee8016_2314x800.png 848w, https://substackcdn.com/image/fetch/$s_!CjA3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd3413a-a68c-4296-89fb-00611fee8016_2314x800.png 1272w, https://substackcdn.com/image/fetch/$s_!CjA3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd3413a-a68c-4296-89fb-00611fee8016_2314x800.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CjA3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd3413a-a68c-4296-89fb-00611fee8016_2314x800.png" width="1456" height="503" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9cd3413a-a68c-4296-89fb-00611fee8016_2314x800.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:503,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:332135,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/192734052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd3413a-a68c-4296-89fb-00611fee8016_2314x800.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CjA3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd3413a-a68c-4296-89fb-00611fee8016_2314x800.png 424w, https://substackcdn.com/image/fetch/$s_!CjA3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd3413a-a68c-4296-89fb-00611fee8016_2314x800.png 848w, https://substackcdn.com/image/fetch/$s_!CjA3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd3413a-a68c-4296-89fb-00611fee8016_2314x800.png 1272w, https://substackcdn.com/image/fetch/$s_!CjA3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cd3413a-a68c-4296-89fb-00611fee8016_2314x800.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">GRPO loss function</figcaption></figure></div><p>The inner term of the loss function for GRPO is computed on the token level. By default, we aggregate this loss over our batch by:</p><ol><li><p>Averaging the token-level losses within each completion.</p></li><li><p>Averaging completion-level losses over the group.</p></li></ol><p>The exact manner in which we aggregate the loss in GRPO can change, and we will soon see that the manner in which we aggregate loss over a batch can impact performance. Given that GRPO computes advantage based on group-level reward statistics, we must sample a large number of completions per prompt to obtain a reliable advantage estimate. As a result, GRPO usually needs relatively large batch sizes in order for training to be stable<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>; see <a href="https://cameronrwolfe.substack.com/i/181791956/assessing-the-health-of-rl-training">here</a> for more details.</p><p><strong>GRPO &amp; reward models. </strong>GRPO is mostly used in verifiable reward settings without a neural reward model. A common misconception about GRPO is that it eliminates the need for a reward model, <em>but GRPO can be used with or without a reward model</em>. In fact, the original GRPO paper [4] used a reward model instead of verifiable rewards! Removing the reward model is a benefit of verifiable rewards, not an intrinsic benefit of GRPO itself&#8212;<em>the primary advantage of GRPO is the elimination of the value model.</em> For more details on GRPO, including example code and a discussion of prior work that led to GRPO, please see the overview below.</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;14a68421-c94d-4a64-ac3a-2456edd0a011&quot;,&quot;caption&quot;:&quot;This overview provides a deep dive into GRPO, where it comes from, how it works, and the role it has played in creating better reasoning models. RL training is a complex process, but GRPO is a refreshingly simple algorithm that is more efficient and approachable than its predecessors.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Group Relative Policy Optimization (GRPO)&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:29736521,&quot;name&quot;:&quot;Cameron R. Wolfe, Ph.D.&quot;,&quot;bio&quot;:&quot;Research @ Netflix &#8226; Rice University PhD &#8226; I make AI understandable&quot;,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/69aba7df-b571-4609-aa47-fc2d031c11b8_1242x1595.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2025-11-24T10:33:31.743Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f98b75b5-c615-4139-a045-ad9572f3cf9f_2008x1130.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://cameronrwolfe.substack.com/p/grpo&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:177823868,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:109,&quot;comment_count&quot;:10,&quot;publication_id&quot;:1092659,&quot;publication_name&quot;:&quot;Deep (Learning) Focus&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!87xa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab9b43fb-52d5-40da-995d-5b7cd3f91064_896x896.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h4>Recent GRPO Variants</h4><p>The GRPO algorithm exploded in popularity after the release of DeepSeek-R1 [15] as many researchers began to replicate or extend results from the paper. Despite details of the model being openly published, fully replicating the training pipeline for DeepSeek-R1 proved non-trivial, leading many subsequent works to propose tweaks to the GRPO algorithm. In this section, we will overview the most successful modifications that are now commonly adopted for better RL training.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8mDV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a1d5910-aa2c-491c-b82e-0ea3ca5ca43f_2212x879.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8mDV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a1d5910-aa2c-491c-b82e-0ea3ca5ca43f_2212x879.png 424w, https://substackcdn.com/image/fetch/$s_!8mDV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a1d5910-aa2c-491c-b82e-0ea3ca5ca43f_2212x879.png 848w, https://substackcdn.com/image/fetch/$s_!8mDV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a1d5910-aa2c-491c-b82e-0ea3ca5ca43f_2212x879.png 1272w, https://substackcdn.com/image/fetch/$s_!8mDV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a1d5910-aa2c-491c-b82e-0ea3ca5ca43f_2212x879.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8mDV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a1d5910-aa2c-491c-b82e-0ea3ca5ca43f_2212x879.png" width="1456" height="579" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a1d5910-aa2c-491c-b82e-0ea3ca5ca43f_2212x879.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:579,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8mDV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a1d5910-aa2c-491c-b82e-0ea3ca5ca43f_2212x879.png 424w, https://substackcdn.com/image/fetch/$s_!8mDV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a1d5910-aa2c-491c-b82e-0ea3ca5ca43f_2212x879.png 848w, https://substackcdn.com/image/fetch/$s_!8mDV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a1d5910-aa2c-491c-b82e-0ea3ca5ca43f_2212x879.png 1272w, https://substackcdn.com/image/fetch/$s_!8mDV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a1d5910-aa2c-491c-b82e-0ea3ca5ca43f_2212x879.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Token-level importance and sequence-level advantage in GRPO</figcaption></figure></div><p><strong>Group Sequence Policy Optimization (GSPO) [5]</strong> modifies the GRPO objective by computing the policy ratio on a sequence level rather than at the token level. The GRPO loss (shown above) introduces a misalignment between how the model is optimized and how rewards (or advantages) are assigned:</p><ul><li><p>Advantage is computed at the sequence level (in an outcome reward setting).</p></li><li><p>Policy ratios&#8212;<em>and the loss in general</em>&#8212;are computed at the token level.</p></li></ul><p>As shown in [5], per-token policy ratios tend to have high variance during RL training, which increases the variance of policy gradients and, in turn, leads to training instability. Specifically, the high variance of policy ratios can lead a single token to dominate the loss expression or even cause numerical instability during the RL training process. This problem is particularly acute when training LLMs on long sequences or using large, sparse <a href="https://cameronrwolfe.substack.com/p/moe-llms">Mixture-of-Experts models</a>. </p><p>To protect against this variance, token-level importance ratios are clipped in the range <code>[1</code> <code>-</code> <code>&#949;,</code> <code>1</code> <code>+</code> <code>&#949;]</code>. This clipping operation is formulated such that tokens have zero contribution to the gradient update if they are clipped within the objective. The importance ratio captures the change in a token&#8217;s probability after multiple policy updates over the same data&#8212;<em>we clip tokens for which we observe a sufficiently large change in their probability</em>. However, simply removing the contribution of these tokens to the policy gradient can be problematic. These can be rare (low probability) tokens that are identified as important by the policy update. Such tokens may capture key reasoning steps that the model needs to learn, but we suppress the learning process via the clipping operation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mUnt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b72d974-8494-49fc-a763-5836c05b3150_1320x766.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mUnt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b72d974-8494-49fc-a763-5836c05b3150_1320x766.png 424w, https://substackcdn.com/image/fetch/$s_!mUnt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b72d974-8494-49fc-a763-5836c05b3150_1320x766.png 848w, https://substackcdn.com/image/fetch/$s_!mUnt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b72d974-8494-49fc-a763-5836c05b3150_1320x766.png 1272w, https://substackcdn.com/image/fetch/$s_!mUnt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b72d974-8494-49fc-a763-5836c05b3150_1320x766.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mUnt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b72d974-8494-49fc-a763-5836c05b3150_1320x766.png" width="1320" height="766" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9b72d974-8494-49fc-a763-5836c05b3150_1320x766.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:766,&quot;width&quot;:1320,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:185239,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/192734052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b72d974-8494-49fc-a763-5836c05b3150_1320x766.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mUnt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b72d974-8494-49fc-a763-5836c05b3150_1320x766.png 424w, https://substackcdn.com/image/fetch/$s_!mUnt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b72d974-8494-49fc-a763-5836c05b3150_1320x766.png 848w, https://substackcdn.com/image/fetch/$s_!mUnt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b72d974-8494-49fc-a763-5836c05b3150_1320x766.png 1272w, https://substackcdn.com/image/fetch/$s_!mUnt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b72d974-8494-49fc-a763-5836c05b3150_1320x766.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The key idea of GSPO is to compute importance ratios for the sequence rather than each token. Once we have derived the sequence-level importance ratio, the GSPO training objective is almost identical to that of GRPO; see above. We apply clipping to the sequence-level importance ratio, use the same advantage, and take a minimum of clipped and unclipped objectives at the sequence level.</p><p>The sequence-level importance ratio can be derived by factorizing the probability of a sequence into a product of individual token probabilities. However, authors in [5] choose to define the sequence-level importance ratio using the logarithmic form of a <a href="https://en.wikipedia.org/wiki/Geometric_mean">geometric mean</a>, which is defined as shown below. This geometric mean is taken over token-level probabilities, which normalizes the sequence-level policy ratio by the length of the sequence. By using this approach, we ensure that importance ratios for sequences of different lengths are comparable, as well as improve numerical stability&#8212;<em>especially for long sequences</em>&#8212;by formulating the ratio as a sum over logprobs instead of a product over raw probabilities.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qwpX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b2b2caa-32a8-4bd3-ab98-eee49cad4ad7_592x218.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qwpX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b2b2caa-32a8-4bd3-ab98-eee49cad4ad7_592x218.png 424w, https://substackcdn.com/image/fetch/$s_!qwpX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b2b2caa-32a8-4bd3-ab98-eee49cad4ad7_592x218.png 848w, https://substackcdn.com/image/fetch/$s_!qwpX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b2b2caa-32a8-4bd3-ab98-eee49cad4ad7_592x218.png 1272w, https://substackcdn.com/image/fetch/$s_!qwpX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b2b2caa-32a8-4bd3-ab98-eee49cad4ad7_592x218.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qwpX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b2b2caa-32a8-4bd3-ab98-eee49cad4ad7_592x218.png" width="378" height="139.19594594594594" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1b2b2caa-32a8-4bd3-ab98-eee49cad4ad7_592x218.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:218,&quot;width&quot;:592,&quot;resizeWidth&quot;:378,&quot;bytes&quot;:33575,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/192734052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b2b2caa-32a8-4bd3-ab98-eee49cad4ad7_592x218.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qwpX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b2b2caa-32a8-4bd3-ab98-eee49cad4ad7_592x218.png 424w, https://substackcdn.com/image/fetch/$s_!qwpX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b2b2caa-32a8-4bd3-ab98-eee49cad4ad7_592x218.png 848w, https://substackcdn.com/image/fetch/$s_!qwpX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b2b2caa-32a8-4bd3-ab98-eee49cad4ad7_592x218.png 1272w, https://substackcdn.com/image/fetch/$s_!qwpX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b2b2caa-32a8-4bd3-ab98-eee49cad4ad7_592x218.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Geometric mean definition</figcaption></figure></div><p>We see in [5] that GSPO improves training stability, sample efficiency, and overall performance. The stability of GSPO is found to be especially useful when training large MoE models, such as <a href="https://huggingface.co/Qwen/Qwen3-235B-A22B">Qwen3-235B-A22B</a>. For these reasons, GSPO was adopted in the training process for the popular <a href="https://arxiv.org/abs/2505.09388">Qwen 3 model series</a>. </p><p><strong>Dynamic Sampling Policy Optimization (DAPO)</strong> [6] is not a single algorithm, but rather a modified recipe that proposes several useful tweaks to the vanilla GRPO optimizer. We see in [6] that the vanilla GRPO optimizer suffers from notable issues like:</p><ul><li><p><em>Entropy collapse</em>: the entropy of the model&#8217;s next token distribution collapses during the training process. Probability mass is primarily assigned to a single token and outputs are more deterministic.</p></li><li><p><em>Reward noise</em>: the training reward is very noisy and does not steadily increase during the RL training process.</p></li><li><p><em>Training instability</em>: the training process is unstable and may diverge.</p></li></ul><p>To solve these issues, authors in [6] propose a suite of tricks that can be used in tandem. First, the entropy collapse problem in GRPO is shown to be caused by the fact that clipping emphasizes high probability tokens and punishes low probability (exploratory) tokens. The &#8220;clip higher&#8221; approach is proposed in [6] to solve this issue by decoupling lower and upper clipping bounds. Specifically, we clip in the range <code>[1-&#949;_low, 1+&#949;_high]</code>, where <code>&#949;_low=0.2</code> (default setting in GRPO) and <code>&#949;_high=0.28</code> in [6]. Increasing <code>&#949;_high</code> prevents entropy collapse and improves overall GRPO performance; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SNyV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb039aeac-b8b3-442f-8efe-7a37c6a1679d_3130x1337.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SNyV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb039aeac-b8b3-442f-8efe-7a37c6a1679d_3130x1337.png 424w, https://substackcdn.com/image/fetch/$s_!SNyV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb039aeac-b8b3-442f-8efe-7a37c6a1679d_3130x1337.png 848w, https://substackcdn.com/image/fetch/$s_!SNyV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb039aeac-b8b3-442f-8efe-7a37c6a1679d_3130x1337.png 1272w, https://substackcdn.com/image/fetch/$s_!SNyV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb039aeac-b8b3-442f-8efe-7a37c6a1679d_3130x1337.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SNyV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb039aeac-b8b3-442f-8efe-7a37c6a1679d_3130x1337.png" width="1456" height="622" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b039aeac-b8b3-442f-8efe-7a37c6a1679d_3130x1337.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:622,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SNyV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb039aeac-b8b3-442f-8efe-7a37c6a1679d_3130x1337.png 424w, https://substackcdn.com/image/fetch/$s_!SNyV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb039aeac-b8b3-442f-8efe-7a37c6a1679d_3130x1337.png 848w, https://substackcdn.com/image/fetch/$s_!SNyV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb039aeac-b8b3-442f-8efe-7a37c6a1679d_3130x1337.png 1272w, https://substackcdn.com/image/fetch/$s_!SNyV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb039aeac-b8b3-442f-8efe-7a37c6a1679d_3130x1337.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [6])</figcaption></figure></div><p>As RL training progresses, the number of samples for which all completions in a group are accurate increases. Such groups have zero advantage and, in turn, no impact on the policy gradient. As a result, these groups effectively reduce the batch size in GRPO, leading to noisier gradient estimates and degraded sample efficiency. Dynamic sampling is proposed in [6] to solve this problem by:</p><ol><li><p>Filtering all prompts with perfect accuracy from a batch.</p></li><li><p>Continuing to sample prompts until we have a full batch. </p></li></ol><p>This approach can increase the cost of constructing a batch, as we dynamically continue sampling prompts until the batch is full. However, we see in [6] that this cost is offset by the improved sample efficiency of RL training; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!201L!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4484a60-e6ab-477f-89ab-9a68a851954f_1766x756.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!201L!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4484a60-e6ab-477f-89ab-9a68a851954f_1766x756.png 424w, https://substackcdn.com/image/fetch/$s_!201L!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4484a60-e6ab-477f-89ab-9a68a851954f_1766x756.png 848w, https://substackcdn.com/image/fetch/$s_!201L!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4484a60-e6ab-477f-89ab-9a68a851954f_1766x756.png 1272w, https://substackcdn.com/image/fetch/$s_!201L!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4484a60-e6ab-477f-89ab-9a68a851954f_1766x756.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!201L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4484a60-e6ab-477f-89ab-9a68a851954f_1766x756.png" width="1456" height="623" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a4484a60-e6ab-477f-89ab-9a68a851954f_1766x756.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:623,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!201L!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4484a60-e6ab-477f-89ab-9a68a851954f_1766x756.png 424w, https://substackcdn.com/image/fetch/$s_!201L!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4484a60-e6ab-477f-89ab-9a68a851954f_1766x756.png 848w, https://substackcdn.com/image/fetch/$s_!201L!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4484a60-e6ab-477f-89ab-9a68a851954f_1766x756.png 1272w, https://substackcdn.com/image/fetch/$s_!201L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4484a60-e6ab-477f-89ab-9a68a851954f_1766x756.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [6])</figcaption></figure></div><p>Finally, DAPO also proposes a modified loss aggregation strategy and a new approach for handling completions that exceed the maximum sequence length. Vanilla GRPO aggregates token-level losses by i) computing the average loss in each sequence and ii) averaging sequence-level losses in the batch. However, this approach introduces a subtle bias&#8212;<em>tokens within longer sequences have relatively less contribution to the overall batch gradient</em>. To solve this, DAPO computes a token-level loss that is simply averaged over all tokens in the batch; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hW2Z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed791d47-2965-48f7-8fef-eca918cb7d93_2122x1339.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hW2Z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed791d47-2965-48f7-8fef-eca918cb7d93_2122x1339.png 424w, https://substackcdn.com/image/fetch/$s_!hW2Z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed791d47-2965-48f7-8fef-eca918cb7d93_2122x1339.png 848w, https://substackcdn.com/image/fetch/$s_!hW2Z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed791d47-2965-48f7-8fef-eca918cb7d93_2122x1339.png 1272w, https://substackcdn.com/image/fetch/$s_!hW2Z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed791d47-2965-48f7-8fef-eca918cb7d93_2122x1339.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hW2Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed791d47-2965-48f7-8fef-eca918cb7d93_2122x1339.png" width="1456" height="919" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ed791d47-2965-48f7-8fef-eca918cb7d93_2122x1339.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:919,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hW2Z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed791d47-2965-48f7-8fef-eca918cb7d93_2122x1339.png 424w, https://substackcdn.com/image/fetch/$s_!hW2Z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed791d47-2965-48f7-8fef-eca918cb7d93_2122x1339.png 848w, https://substackcdn.com/image/fetch/$s_!hW2Z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed791d47-2965-48f7-8fef-eca918cb7d93_2122x1339.png 1272w, https://substackcdn.com/image/fetch/$s_!hW2Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed791d47-2965-48f7-8fef-eca918cb7d93_2122x1339.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [6])</figcaption></figure></div><p>Additionally, a length-based penalty term is introduced to the reward to apply a &#8220;soft&#8221; punishment to completions that are too long. Instead of assigning a hard negative reward to any completion that exceeds the maximum sequence length, authors in [6] argue that we should slowly increase the overlong penalty to its maximum value as we approach the maximum sequence length. This approach provides a smooth length penalty from which the model can effectively learn.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gilS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6afc090-8b41-415a-b1eb-5e2436e10662_1702x526.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gilS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6afc090-8b41-415a-b1eb-5e2436e10662_1702x526.png 424w, https://substackcdn.com/image/fetch/$s_!gilS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6afc090-8b41-415a-b1eb-5e2436e10662_1702x526.png 848w, https://substackcdn.com/image/fetch/$s_!gilS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6afc090-8b41-415a-b1eb-5e2436e10662_1702x526.png 1272w, https://substackcdn.com/image/fetch/$s_!gilS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6afc090-8b41-415a-b1eb-5e2436e10662_1702x526.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gilS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6afc090-8b41-415a-b1eb-5e2436e10662_1702x526.png" width="1456" height="450" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a6afc090-8b41-415a-b1eb-5e2436e10662_1702x526.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gilS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6afc090-8b41-415a-b1eb-5e2436e10662_1702x526.png 424w, https://substackcdn.com/image/fetch/$s_!gilS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6afc090-8b41-415a-b1eb-5e2436e10662_1702x526.png 848w, https://substackcdn.com/image/fetch/$s_!gilS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6afc090-8b41-415a-b1eb-5e2436e10662_1702x526.png 1272w, https://substackcdn.com/image/fetch/$s_!gilS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6afc090-8b41-415a-b1eb-5e2436e10662_1702x526.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [7])</figcaption></figure></div><p><strong>GRPO Done Right (Dr. GRPO) [7]</strong> outlined two key sources of bias that exist in the vanilla GRPO algorithm (depicted above):</p><ol><li><p><em>Response-level length bias</em>: GRPO normalizes the summed loss of tokens in each sequence by the total number of tokens in that sequence, leading to biased gradient updates based on the length of each response.</p></li><li><p><em>Question-level difficulty biases</em>: the standard deviation term in the denominator of the advantage formulation in GRPO causes the advantage to become very large for questions that are either too easy (i.e., most responses have a reward of one) or too hard (i.e., most responses have a reward of zero).</p></li></ol><p>To solve the first bias, Dr. GRPO aggregates the loss by summing token-level losses in a sequence and dividing this sum by a fixed constant <code>MAX_TOKENS</code>, thus removing response length from the aggregation process. The difference between this loss aggregation strategy and that of DAPO is nuanced. In DAPO, each token in the batch has an equal contribution to the gradient. As a result, DAPO still places more emphasis upon longer sequences in a batch, as these sequences have a larger ratio of total tokens (even if all tokens are weighted equally across the batch). On the other hand, replacing the sequence-level average with division by a fixed constant in Dr. GRPO effectively decouples aggregation from response lengths and, in turn, protects against length-based optimization bias. </p><p>The question-level difficulty bias is handled by removing the standard deviation term from the advantage estimator; see below. By making these two changes, Dr. GRPO improves training stability and efficiency, while making the resulting model more token efficient (i.e., responses are not artificially long). </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wrhA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdee6e2d-9cfe-4d20-b84c-56831fe0dc43_2007x909.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wrhA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdee6e2d-9cfe-4d20-b84c-56831fe0dc43_2007x909.png 424w, https://substackcdn.com/image/fetch/$s_!wrhA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdee6e2d-9cfe-4d20-b84c-56831fe0dc43_2007x909.png 848w, https://substackcdn.com/image/fetch/$s_!wrhA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdee6e2d-9cfe-4d20-b84c-56831fe0dc43_2007x909.png 1272w, https://substackcdn.com/image/fetch/$s_!wrhA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdee6e2d-9cfe-4d20-b84c-56831fe0dc43_2007x909.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wrhA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdee6e2d-9cfe-4d20-b84c-56831fe0dc43_2007x909.png" width="1456" height="659" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cdee6e2d-9cfe-4d20-b84c-56831fe0dc43_2007x909.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:659,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wrhA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdee6e2d-9cfe-4d20-b84c-56831fe0dc43_2007x909.png 424w, https://substackcdn.com/image/fetch/$s_!wrhA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdee6e2d-9cfe-4d20-b84c-56831fe0dc43_2007x909.png 848w, https://substackcdn.com/image/fetch/$s_!wrhA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdee6e2d-9cfe-4d20-b84c-56831fe0dc43_2007x909.png 1272w, https://substackcdn.com/image/fetch/$s_!wrhA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdee6e2d-9cfe-4d20-b84c-56831fe0dc43_2007x909.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [7])</figcaption></figure></div><p><strong>Truncated Importance Sampling (TIS) [9]</strong> attempts to address mismatches in token probabilities introduced by efficient RL training frameworks. As we know, there are two main operations that occur during RL training: <em>i)</em> sampling rollouts and <em>ii)</em> computing policy updates. In modern RL frameworks, these operations are usually handled via separate engines:</p><ul><li><p>Optimized inference engines like <a href="https://docs.vllm.ai/en/latest/">vLLM</a> or <a href="https://docs.sglang.io/">SGLang</a>&#8212;<em>often with lower precision inference (e.g., </em><code>int8</code><em> or </em><code>fp8</code><em>) for extra efficiency&#8212;</em>are used to generate rollouts.</p></li><li><p>Distributed training frameworks like <a href="https://engineering.fb.com/2021/07/15/open-source/fsdp/">FSDP</a> or <a href="https://www.deepspeed.ai/training/">DeepSpeed</a> are used to compute policy updates.</p></li></ul><p>Given that generating rollouts consumes the majority of compute during RL<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>, this approach is usually necessary&#8212;<em>we want the inference process to be as efficient as possible</em>. However, the use of separate engines can also introduce non-negligible differences in the token probabilities produced by each engine; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YoVu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f305b8-7ab7-4ea5-ac11-feb3c5a477e9_1226x836.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YoVu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f305b8-7ab7-4ea5-ac11-feb3c5a477e9_1226x836.png 424w, https://substackcdn.com/image/fetch/$s_!YoVu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f305b8-7ab7-4ea5-ac11-feb3c5a477e9_1226x836.png 848w, https://substackcdn.com/image/fetch/$s_!YoVu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f305b8-7ab7-4ea5-ac11-feb3c5a477e9_1226x836.png 1272w, https://substackcdn.com/image/fetch/$s_!YoVu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f305b8-7ab7-4ea5-ac11-feb3c5a477e9_1226x836.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YoVu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f305b8-7ab7-4ea5-ac11-feb3c5a477e9_1226x836.png" width="418" height="285.0309951060359" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c1f305b8-7ab7-4ea5-ac11-feb3c5a477e9_1226x836.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:836,&quot;width&quot;:1226,&quot;resizeWidth&quot;:418,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YoVu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f305b8-7ab7-4ea5-ac11-feb3c5a477e9_1226x836.png 424w, https://substackcdn.com/image/fetch/$s_!YoVu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f305b8-7ab7-4ea5-ac11-feb3c5a477e9_1226x836.png 848w, https://substackcdn.com/image/fetch/$s_!YoVu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f305b8-7ab7-4ea5-ac11-feb3c5a477e9_1226x836.png 1272w, https://substackcdn.com/image/fetch/$s_!YoVu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f305b8-7ab7-4ea5-ac11-feb3c5a477e9_1226x836.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [9])</figcaption></figure></div><p>Additionally, this difference in token probabilities is not easy to fix by simply standardizing implementations across engines. Authors in [9] investigate several code interventions to decrease the gap in token probabilities with little success, and this process would have to be repeated for every combination of engines used for RL training. Instead, a more flexible approach is proposed in [9] that uses an <a href="https://cameronrwolfe.substack.com/i/181791956/your-efficient-rl-framework-secretly-brings-you-off-policy-rl-training-4">importance sampling term</a> to automatically correct for this engine mismatch during RL within the policy gradient expression. The exact expression is shown below and is formulated as a <a href="https://cameronrwolfe.substack.com/p/reinforce">REINFORCE-style policy update</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HUpB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47a9bbe0-ad07-45de-82d2-6444addf8d96_2446x694.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HUpB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47a9bbe0-ad07-45de-82d2-6444addf8d96_2446x694.png 424w, https://substackcdn.com/image/fetch/$s_!HUpB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47a9bbe0-ad07-45de-82d2-6444addf8d96_2446x694.png 848w, https://substackcdn.com/image/fetch/$s_!HUpB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47a9bbe0-ad07-45de-82d2-6444addf8d96_2446x694.png 1272w, https://substackcdn.com/image/fetch/$s_!HUpB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47a9bbe0-ad07-45de-82d2-6444addf8d96_2446x694.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HUpB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47a9bbe0-ad07-45de-82d2-6444addf8d96_2446x694.png" width="1456" height="413" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/47a9bbe0-ad07-45de-82d2-6444addf8d96_2446x694.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:413,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:288143,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/192734052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47a9bbe0-ad07-45de-82d2-6444addf8d96_2446x694.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HUpB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47a9bbe0-ad07-45de-82d2-6444addf8d96_2446x694.png 424w, https://substackcdn.com/image/fetch/$s_!HUpB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47a9bbe0-ad07-45de-82d2-6444addf8d96_2446x694.png 848w, https://substackcdn.com/image/fetch/$s_!HUpB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47a9bbe0-ad07-45de-82d2-6444addf8d96_2446x694.png 1272w, https://substackcdn.com/image/fetch/$s_!HUpB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47a9bbe0-ad07-45de-82d2-6444addf8d96_2446x694.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Policy gradient with different engines and TIS</figcaption></figure></div><p>The above expression explicitly uses a different engine for sampling rollouts (<code>sampler</code>) and computing policy updates (<code>learner</code>). The importance ratio between these two engines is simply the quotient of probabilities from learner and sampler engines. In [9], authors truncate this importance ratio by capping it at a maximum size of <code>&#961;</code>. Compared to the clipping operation used in PPO or GRPO, this truncation operation has a few differences:</p><ul><li><p>We are directly truncating the importance ratio. Clipping is also applied to the importance ratio, but it is followed by a minimum of clipped and unclipped objectives, making the operation two-sided.</p></li><li><p>We truncate the importance ratio with a maximum value of <code>&#961;</code>, a one-sided truncation that simply prevents extreme up-weighting. </p></li></ul><p>However, the practical application of TIS is quite simple&#8212;we just compute the truncated importance ratio and multiply our policy gradient expression by this ratio. As shown below, including this importance ratio in the policy gradient has a huge impact on RL training stability and model performance, leading to quick adoption of TIS in popular training frameworks (e.g., <a href="https://verl.readthedocs.io/en/latest/algo/rollout_corr_math.html">verl</a> and <a href="https://github.com/allenai/open-instruct/blob/main/open_instruct/grpo_fast.py">OpenInstruct</a>).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!s35R!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00672cb3-fa64-4514-9c6e-728659e36bc4_1908x1308.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!s35R!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00672cb3-fa64-4514-9c6e-728659e36bc4_1908x1308.png 424w, https://substackcdn.com/image/fetch/$s_!s35R!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00672cb3-fa64-4514-9c6e-728659e36bc4_1908x1308.png 848w, https://substackcdn.com/image/fetch/$s_!s35R!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00672cb3-fa64-4514-9c6e-728659e36bc4_1908x1308.png 1272w, https://substackcdn.com/image/fetch/$s_!s35R!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00672cb3-fa64-4514-9c6e-728659e36bc4_1908x1308.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!s35R!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00672cb3-fa64-4514-9c6e-728659e36bc4_1908x1308.png" width="1456" height="998" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/00672cb3-fa64-4514-9c6e-728659e36bc4_1908x1308.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:998,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!s35R!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00672cb3-fa64-4514-9c6e-728659e36bc4_1908x1308.png 424w, https://substackcdn.com/image/fetch/$s_!s35R!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00672cb3-fa64-4514-9c6e-728659e36bc4_1908x1308.png 848w, https://substackcdn.com/image/fetch/$s_!s35R!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00672cb3-fa64-4514-9c6e-728659e36bc4_1908x1308.png 1272w, https://substackcdn.com/image/fetch/$s_!s35R!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00672cb3-fa64-4514-9c6e-728659e36bc4_1908x1308.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [9])</figcaption></figure></div><p>We formulated TIS above using a sequence-level importance ratio with REINFORCE. However, we can also create a token-level formulation with PPO or GRPO; see below. As we can see, the truncated importance ratio is computed in addition to the other components of the PPO-style policy gradient expression. We then multiply the existing expression by this correction term, and this can be done either at the sequence level&#8212;<em>similarly to the gradient expression used by GSPO</em>&#8212;or at a token level&#8212;<em>as in the normal expression for PPO or GRPO</em>. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9naH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ddd67e1-c51f-4318-8bf5-47cd5f0c2498_3124x340.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9naH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ddd67e1-c51f-4318-8bf5-47cd5f0c2498_3124x340.png 424w, https://substackcdn.com/image/fetch/$s_!9naH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ddd67e1-c51f-4318-8bf5-47cd5f0c2498_3124x340.png 848w, https://substackcdn.com/image/fetch/$s_!9naH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ddd67e1-c51f-4318-8bf5-47cd5f0c2498_3124x340.png 1272w, https://substackcdn.com/image/fetch/$s_!9naH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ddd67e1-c51f-4318-8bf5-47cd5f0c2498_3124x340.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9naH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ddd67e1-c51f-4318-8bf5-47cd5f0c2498_3124x340.png" width="1456" height="158" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2ddd67e1-c51f-4318-8bf5-47cd5f0c2498_3124x340.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:158,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9naH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ddd67e1-c51f-4318-8bf5-47cd5f0c2498_3124x340.png 424w, https://substackcdn.com/image/fetch/$s_!9naH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ddd67e1-c51f-4318-8bf5-47cd5f0c2498_3124x340.png 848w, https://substackcdn.com/image/fetch/$s_!9naH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ddd67e1-c51f-4318-8bf5-47cd5f0c2498_3124x340.png 1272w, https://substackcdn.com/image/fetch/$s_!9naH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ddd67e1-c51f-4318-8bf5-47cd5f0c2498_3124x340.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(from [9])</figcaption></figure></div><p>Token-level TIS is commonly used in practice due to aligning well with PPO and GRPO objectives and is, therefore, relatively simple to integrate into existing training frameworks in addition to offering stability benefits. In recent work [10], however, authors have argued from an analytical perspective that sequence-level TIS is less biased compared to token-level TIS. Currently, there is no clear consensus on which of these approaches is superior. The best implementation in practice may differ depending upon the setup or domain being considered.</p><p><strong>Clipped Importance Sampling-Weight Policy Optimization (CISPO)</strong> [8] is another recent RL variant that, similarly to TIS, builds upon a REINFORCE-style objective with an added importance ratio. When using a PPO-style clipping approach, we know any token that is clipped from the objective has no contribution to the policy gradient. In [10], authors observe empirically that the important &#8220;fork&#8221; tokens in the model&#8217;s reasoning trace (e.g., &#8220;aha&#8221; or &#8220;wait&#8221;) are rare and are initially assigned low probabilities in the base model. Due to the importance of these tokens, their probability usually increases drastically after the first policy update, leading these tokens to have a very large importance ratio&#8212;<em>that is then clipped by the PPO objective</em>&#8212;for subsequent policy updates. </p><div class="pullquote"><p style="text-align: center;"><em>&#8220;We found that tokens associated with reflective behaviors&#8230; were typically rare and assigned low probabilities by our base model. During policy updates, these tokens were likely to exhibit high [importance ratio] values. As a result, these tokens were clipped out after the first on-policy update, preventing them from contributing to subsequent off-policy gradient updates&#8230; These low-probability tokens are often crucial for stabilizing entropy and facilitating scalable RL.&#8221; - from [10]</em></p></div><p>As a result, important fork tokens are usually masked from the PPO-style loss after the first policy update for a batch of data. Although this masking may not always be an issue (i.e., most standard RL setups perform only ~2-4 updates on each batch of sampled data), MiniMax-M1 performs 16 policy updates for each batch of data. Therefore, important tokens being masked out of the loss after only one or a few updates can significantly damage training efficiency. To solve this issue, authors adopt the modified REINFORCE-style loss shown below. As we can see, CISPO adopts some of the recommendations proposed by DAPO [6] as well, including the token-level loss formulation to correct for length biases. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vAiI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ae11b48-c050-4afa-9426-81997ece3529_2482x646.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vAiI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ae11b48-c050-4afa-9426-81997ece3529_2482x646.png 424w, https://substackcdn.com/image/fetch/$s_!vAiI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ae11b48-c050-4afa-9426-81997ece3529_2482x646.png 848w, https://substackcdn.com/image/fetch/$s_!vAiI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ae11b48-c050-4afa-9426-81997ece3529_2482x646.png 1272w, https://substackcdn.com/image/fetch/$s_!vAiI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ae11b48-c050-4afa-9426-81997ece3529_2482x646.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vAiI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ae11b48-c050-4afa-9426-81997ece3529_2482x646.png" width="2482" height="646" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4ae11b48-c050-4afa-9426-81997ece3529_2482x646.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:646,&quot;width&quot;:2482,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:274008,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/192734052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe398b044-3b48-4d9f-834e-fd909ad6a964_2482x1216.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vAiI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ae11b48-c050-4afa-9426-81997ece3529_2482x646.png 424w, https://substackcdn.com/image/fetch/$s_!vAiI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ae11b48-c050-4afa-9426-81997ece3529_2482x646.png 848w, https://substackcdn.com/image/fetch/$s_!vAiI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ae11b48-c050-4afa-9426-81997ece3529_2482x646.png 1272w, https://substackcdn.com/image/fetch/$s_!vAiI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ae11b48-c050-4afa-9426-81997ece3529_2482x646.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">CISPO loss (from [10])</figcaption></figure></div><p>This loss formulation applies a stop gradient to the clipped importance ratio, ensuring that each token contributes to the loss even when it is clipped. Put differently, the importance ratio is used as a weight that controls the contribution of a token to the policy gradient. Clipping in CISPO puts a cap on this weight, ensuring no single token is over-amplified due to a large importance ratio. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Bw8T!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93342927-dc01-4c14-9975-d231f256d047_2447x475.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Bw8T!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93342927-dc01-4c14-9975-d231f256d047_2447x475.png 424w, https://substackcdn.com/image/fetch/$s_!Bw8T!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93342927-dc01-4c14-9975-d231f256d047_2447x475.png 848w, https://substackcdn.com/image/fetch/$s_!Bw8T!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93342927-dc01-4c14-9975-d231f256d047_2447x475.png 1272w, https://substackcdn.com/image/fetch/$s_!Bw8T!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93342927-dc01-4c14-9975-d231f256d047_2447x475.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Bw8T!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93342927-dc01-4c14-9975-d231f256d047_2447x475.png" width="2447" height="475" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/93342927-dc01-4c14-9975-d231f256d047_2447x475.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:475,&quot;width&quot;:2447,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:199390,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/192734052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe398b044-3b48-4d9f-834e-fd909ad6a964_2482x1216.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Bw8T!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93342927-dc01-4c14-9975-d231f256d047_2447x475.png 424w, https://substackcdn.com/image/fetch/$s_!Bw8T!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93342927-dc01-4c14-9975-d231f256d047_2447x475.png 848w, https://substackcdn.com/image/fetch/$s_!Bw8T!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93342927-dc01-4c14-9975-d231f256d047_2447x475.png 1272w, https://substackcdn.com/image/fetch/$s_!Bw8T!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93342927-dc01-4c14-9975-d231f256d047_2447x475.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Clipping in the GRPO loss</figcaption></figure></div><p>When we look at the GRPO objective, the clipping mechanics are quite different; see above. In particular, token probabilities are only present in the importance ratio, and the gradient flows through the token probability terms inside the importance ratio. When the importance ratio is clipped, <em>the gradient for that token is zero and there is no contribution to the policy gradient</em>. The modified clipping approach used by CISPO ensures that all tokens contribute to the policy gradient, improving the stability and efficiency of RL; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eN8U!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2983668c-bf0b-49f5-b375-d99d6f40291e_1278x658.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eN8U!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2983668c-bf0b-49f5-b375-d99d6f40291e_1278x658.png 424w, https://substackcdn.com/image/fetch/$s_!eN8U!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2983668c-bf0b-49f5-b375-d99d6f40291e_1278x658.png 848w, https://substackcdn.com/image/fetch/$s_!eN8U!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2983668c-bf0b-49f5-b375-d99d6f40291e_1278x658.png 1272w, https://substackcdn.com/image/fetch/$s_!eN8U!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2983668c-bf0b-49f5-b375-d99d6f40291e_1278x658.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eN8U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2983668c-bf0b-49f5-b375-d99d6f40291e_1278x658.png" width="1278" height="658" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2983668c-bf0b-49f5-b375-d99d6f40291e_1278x658.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:658,&quot;width&quot;:1278,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eN8U!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2983668c-bf0b-49f5-b375-d99d6f40291e_1278x658.png 424w, https://substackcdn.com/image/fetch/$s_!eN8U!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2983668c-bf0b-49f5-b375-d99d6f40291e_1278x658.png 848w, https://substackcdn.com/image/fetch/$s_!eN8U!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2983668c-bf0b-49f5-b375-d99d6f40291e_1278x658.png 1272w, https://substackcdn.com/image/fetch/$s_!eN8U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2983668c-bf0b-49f5-b375-d99d6f40291e_1278x658.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [10])</figcaption></figure></div><p>The loss formulation of CISPO and TIS look quite similar, but these algorithms&#8212;<em>despite both using an importance ratio</em>&#8212;aim to solve different issues. CISPO uses the same definition of the importance ratio adopted in PPO and GRPO. This importance ratio is clipped to ensure that token probabilities do not change too much over a single batch of data, thus enforcing a trust region. CISPO simply modifies the manner in which the importance ratio is clipped to ensure that all tokens continue contributing to the policy gradient (with a capped weight) even if they are clipped. On the other hand, TIS uses an importance ratio to capture the difference in token probabilities between training and inference engines, thus correcting for the mismatch between engines during RL training.</p><p><strong>Further reading.</strong> For more details on each of these algorithms, please see the overview linked below, which covers many GRPO variants and modifications.</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;7d6be587-8e76-4911-ae17-1053fb191f4f&quot;,&quot;caption&quot;:&quot;Recent research on large language models (LLMs) has been heavily focused on reasoning and reinforcement learning (RL). At the center of this research lies Group Relative Policy Optimization (GRPO) [13], the RL optimizer used to train most open-source reasoning models. The popularity of GRPO is enhanced by its conceptual simplicity and pr&#8230;&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;GRPO++: Tricks for Making RL Actually Work&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:29736521,&quot;name&quot;:&quot;Cameron R. Wolfe, Ph.D.&quot;,&quot;bio&quot;:&quot;Research @ Netflix &#8226; Rice University PhD &#8226; I make AI understandable&quot;,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/69aba7df-b571-4609-aa47-fc2d031c11b8_1242x1595.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2026-01-05T10:33:50.056Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/168ff804-da03-4ce5-84be-4f3f7322ff70_2500x1404.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://cameronrwolfe.substack.com/p/grpo-tricks&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:181791956,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:130,&quot;comment_count&quot;:10,&quot;publication_id&quot;:1092659,&quot;publication_name&quot;:&quot;Deep (Learning) Focus&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!87xa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab9b43fb-52d5-40da-995d-5b7cd3f91064_896x896.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h4>Regularization for RL</h4><p>There are two primary regularization terms commonly added to RL training:</p><ol><li><p><em>Entropy bonus</em>: rewards the LLM for remaining uncertain and helps to avoid overly-confident token distributions.</p></li><li><p><em>KL divergence</em>: anchors the policy to a reference policy throughout training to prevent the LLM from changing too much. </p></li></ol><p>Regularization terms are less commonly used in recent RL training pipelines, but we will see examples of both strategies being applied later in the overview. To avoid future confusion, we will briefly explain each regularization strategy now. </p><p><strong>KL divergence.</strong> During RL training, we can compute the <a href="https://cameronrwolfe.substack.com/i/167254905/kullback-leibler-kl-divergence">Kullback-Leibler (KL) Divergence</a> between the current policy and a reference policy&#8212;<em>usually the policy from before RL training begins (i.e., the base model)</em>. There are several techniques that can be used to approximate the KL divergence between two models; see <a href="https://huggingface.co/blog/NormalUhr/kl-divergence-estimator-rl-llm">here</a>. The easiest&#8212;<em>and most common</em>&#8212;approximation of KL divergence [7] is the difference in token-level log probabilities between the current policy and the reference policy. This approximation and another common variant used in the original GRPO paper [12]<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a> are outlined below. Both estimators are usually supported in open RL implementations; e.g., see <a href="https://github.com/huggingface/trl/blob/main/trl/experimental/ppo/ppo_trainer.py#L411">their implementation in TRL</a>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iEi2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc30d4720-4453-41e2-ac2e-d1224f9d6837_1539x798.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iEi2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc30d4720-4453-41e2-ac2e-d1224f9d6837_1539x798.png 424w, https://substackcdn.com/image/fetch/$s_!iEi2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc30d4720-4453-41e2-ac2e-d1224f9d6837_1539x798.png 848w, https://substackcdn.com/image/fetch/$s_!iEi2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc30d4720-4453-41e2-ac2e-d1224f9d6837_1539x798.png 1272w, https://substackcdn.com/image/fetch/$s_!iEi2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc30d4720-4453-41e2-ac2e-d1224f9d6837_1539x798.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iEi2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc30d4720-4453-41e2-ac2e-d1224f9d6837_1539x798.png" width="507" height="262.9017857142857" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c30d4720-4453-41e2-ac2e-d1224f9d6837_1539x798.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:755,&quot;width&quot;:1456,&quot;resizeWidth&quot;:507,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iEi2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc30d4720-4453-41e2-ac2e-d1224f9d6837_1539x798.png 424w, https://substackcdn.com/image/fetch/$s_!iEi2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc30d4720-4453-41e2-ac2e-d1224f9d6837_1539x798.png 848w, https://substackcdn.com/image/fetch/$s_!iEi2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc30d4720-4453-41e2-ac2e-d1224f9d6837_1539x798.png 1272w, https://substackcdn.com/image/fetch/$s_!iEi2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc30d4720-4453-41e2-ac2e-d1224f9d6837_1539x798.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Common approximations of KL divergence</figcaption></figure></div><p>After the KL divergence has been computed, there are two common ways that it can be incorporated into the RL training process:</p><ol><li><p>By directly subtracting the KL divergence from the reward.</p></li><li><p>By adding the KL divergence to the loss function as a penalty term.</p></li></ol><p>Both of these approaches can be found in practice depending on the RL optimizer&#8212;<em>or exact implementation</em>&#8212;being used. PPO incorporates KL divergence into the reward, while GRPO adds it as a penalty to the objective function; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sLHR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86db64e2-cafc-4924-a790-4c115311a0bb_2188x864.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sLHR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86db64e2-cafc-4924-a790-4c115311a0bb_2188x864.png 424w, https://substackcdn.com/image/fetch/$s_!sLHR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86db64e2-cafc-4924-a790-4c115311a0bb_2188x864.png 848w, https://substackcdn.com/image/fetch/$s_!sLHR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86db64e2-cafc-4924-a790-4c115311a0bb_2188x864.png 1272w, https://substackcdn.com/image/fetch/$s_!sLHR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86db64e2-cafc-4924-a790-4c115311a0bb_2188x864.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sLHR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86db64e2-cafc-4924-a790-4c115311a0bb_2188x864.png" width="1456" height="575" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/86db64e2-cafc-4924-a790-4c115311a0bb_2188x864.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:575,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:239192,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/192734052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86db64e2-cafc-4924-a790-4c115311a0bb_2188x864.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sLHR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86db64e2-cafc-4924-a790-4c115311a0bb_2188x864.png 424w, https://substackcdn.com/image/fetch/$s_!sLHR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86db64e2-cafc-4924-a790-4c115311a0bb_2188x864.png 848w, https://substackcdn.com/image/fetch/$s_!sLHR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86db64e2-cafc-4924-a790-4c115311a0bb_2188x864.png 1272w, https://substackcdn.com/image/fetch/$s_!sLHR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86db64e2-cafc-4924-a790-4c115311a0bb_2188x864.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Two ways of incorporating the KL divergence into RL</figcaption></figure></div><p>Due to the popularity of GRPO, recent RL implementations more often include the KL divergence in the loss, but completely omitting the KL divergence&#8212;<em>and not using any regularization</em>&#8212;is becoming increasingly common. During training, the KL divergence term penalizes our policy from becoming too different from the reference policy, but drifting away from the reference policy may not be negative if we are performing large-scale, reasoning-oriented RL training.</p><p><strong>Entropy bonus.</strong> From an <a href="https://en.wikipedia.org/wiki/Entropy_(information_theory)">information theory perspective</a>, entropy captures the level of uncertainty associated with the possible states for a variable:</p><ul><li><p><em>High entropy</em>: probability mass is spread across many outcomes. </p></li><li><p><em>Low entropy</em>: probability mass is concentrated on a few outcomes. </p></li></ul><p>In the LLM domain, we can measure the entropy of a model&#8217;s token distribution&#8212;<em>low entropy means that the LLM places most of its probability into a small set of tokens and vice versa</em>. Specifically, we can compute entropy using the equation below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!L8hv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0eafef80-c3e0-4ec3-94ea-5a0ac58cee7b_2386x706.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!L8hv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0eafef80-c3e0-4ec3-94ea-5a0ac58cee7b_2386x706.png 424w, https://substackcdn.com/image/fetch/$s_!L8hv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0eafef80-c3e0-4ec3-94ea-5a0ac58cee7b_2386x706.png 848w, https://substackcdn.com/image/fetch/$s_!L8hv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0eafef80-c3e0-4ec3-94ea-5a0ac58cee7b_2386x706.png 1272w, https://substackcdn.com/image/fetch/$s_!L8hv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0eafef80-c3e0-4ec3-94ea-5a0ac58cee7b_2386x706.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!L8hv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0eafef80-c3e0-4ec3-94ea-5a0ac58cee7b_2386x706.png" width="569" height="168.43337912087912" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0eafef80-c3e0-4ec3-94ea-5a0ac58cee7b_2386x706.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:431,&quot;width&quot;:1456,&quot;resizeWidth&quot;:569,&quot;bytes&quot;:221860,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/192734052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0eafef80-c3e0-4ec3-94ea-5a0ac58cee7b_2386x706.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!L8hv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0eafef80-c3e0-4ec3-94ea-5a0ac58cee7b_2386x706.png 424w, https://substackcdn.com/image/fetch/$s_!L8hv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0eafef80-c3e0-4ec3-94ea-5a0ac58cee7b_2386x706.png 848w, https://substackcdn.com/image/fetch/$s_!L8hv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0eafef80-c3e0-4ec3-94ea-5a0ac58cee7b_2386x706.png 1272w, https://substackcdn.com/image/fetch/$s_!L8hv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0eafef80-c3e0-4ec3-94ea-5a0ac58cee7b_2386x706.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Entropy of an LLM token distribution</figcaption></figure></div><p>Usually, entropy is computed for each token (i.e., at each decoding step) and then averaged across the generated trajectory. After computing the entropy, we can turn it into an entropy bonus and use it as a regularization term by simply scaling it with a coefficient &#946; and incorporating it into either the reward&#8212;<em>this is done in the <a href="https://arxiv.org/abs/1707.06347">original PPO paper</a></em>&#8212;or the objective function. The purpose of the entropy bonus is to prevent the LLM from becoming overly confident in its token distribution and, in turn, avoid <a href="https://cameronrwolfe.substack.com/i/181791956/assessing-the-health-of-rl-training">entropy collapse</a> that prevents the policy from exploring during training. Similarly to the KL divergence, entropy bonuses are now more commonly incorporated into the loss function. In fact, we will soon study a paper that adds an entropy bonus to the GRPO loss [3]. </p><h2>Scaling the RL Training Process</h2><blockquote><p><em>&#8220;While RL compute for LLMs has scaled massively, our understanding of how to scale RL has not kept pace; the methodology remains more art than science.&#8221;</em> - from [1]</p></blockquote><p>Scaling laws provide researchers with the ability to extrapolate the performance of expensive training runs from those that require less compute. Despite the expanding role of RL in training frontier models, however, our understanding its fundamental scaling properties remain somewhat rudimentary, especially relative to pretraining. In this section, we will take a look at several notable papers that are trying to solve this issue. As we will see, RL scaling laws are very different from those used for pretraining, and many of these differences arise from the massive design space of RL training. Put simply, <em>RL is complicated</em>, and we are far from a single standardized approach for handling RL &#8220;correctly&#8221;. However, there are still useful scaling insights that can be gleaned from this work that will help us to allocate available compute for RL experiments more effectively. </p><h4><a href="https://arxiv.org/abs/2510.13786">The Art of Scaling Reinforcement Learning Compute for LLMs</a> [1]</h4><p>Unlike pretraining, RL has no established predictive scaling laws for reliably estimating performance trends. Best practices for RL are found in <a href="https://arxiv.org/abs/2503.14476">new algorithm proposals</a>, but these findings may not generalize at scale. <a href="https://arxiv.org/abs/2506.13585">Model reports</a> also frequently provide practical recommendations for RL training, but these methods are often anecdotal and dependent upon training settings. As a result, we must test RL design choices the hard way&#8212;<em>by running large-scale experiments and seeing what works</em>. Given the computation cost of modern RL, this approach is a major bottleneck that limits iteration speed and hinders technical progress. We need a standardized approach to identify strong RL candidates at smaller scales. </p><p><strong>RL scaling.</strong> In [1], authors model the RL training process with <a href="https://en.wikipedia.org/wiki/Sigmoid_function">sigmoidal</a> compute-performance curves. We fit such a curve separately for each RL training run to model the relationship between expected reward&#8212;<em>calculated over a validation set at regular intervals during training</em>&#8212;and compute (in units of GPU hours); see below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ABQR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79fdf63c-edc2-4cc9-a0b4-cc629cacc4a6_2360x786.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ABQR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79fdf63c-edc2-4cc9-a0b4-cc629cacc4a6_2360x786.png 424w, https://substackcdn.com/image/fetch/$s_!ABQR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79fdf63c-edc2-4cc9-a0b4-cc629cacc4a6_2360x786.png 848w, https://substackcdn.com/image/fetch/$s_!ABQR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79fdf63c-edc2-4cc9-a0b4-cc629cacc4a6_2360x786.png 1272w, https://substackcdn.com/image/fetch/$s_!ABQR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79fdf63c-edc2-4cc9-a0b4-cc629cacc4a6_2360x786.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ABQR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79fdf63c-edc2-4cc9-a0b4-cc629cacc4a6_2360x786.png" width="1456" height="485" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/79fdf63c-edc2-4cc9-a0b4-cc629cacc4a6_2360x786.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:485,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:294195,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/192734052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79fdf63c-edc2-4cc9-a0b4-cc629cacc4a6_2360x786.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ABQR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79fdf63c-edc2-4cc9-a0b4-cc629cacc4a6_2360x786.png 424w, https://substackcdn.com/image/fetch/$s_!ABQR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79fdf63c-edc2-4cc9-a0b4-cc629cacc4a6_2360x786.png 848w, https://substackcdn.com/image/fetch/$s_!ABQR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79fdf63c-edc2-4cc9-a0b4-cc629cacc4a6_2360x786.png 1272w, https://substackcdn.com/image/fetch/$s_!ABQR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79fdf63c-edc2-4cc9-a0b4-cc629cacc4a6_2360x786.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Saturating S-curve for RL scaling (from [1])</figcaption></figure></div><p>This curve models the relationship between two quantities:</p><ol><li><p><em>Reward gain:</em> the difference between the reward after RL training with compute <code>C</code> and the initial reward before RL training.</p></li><li><p><em>Asymptotic reward ceiling</em>: the maximum possible gain in reward we can achieve by spending unlimited compute on RL training. </p></li></ol><p>The relationship between these quantities is controlled by the term <code>1</code> <code>/</code> <code>[1</code> <code>+</code> <code>(C_mid&#8203;/C)^B</code>&#8203;<code>]</code>. This term includes <em>i)</em> the compute level at which we reach the midpoint of the curve <code>C_mid</code>, <em>ii)</em> an efficiency exponent <code>B</code> for the steepness of the curve, and <em>iii)</em> the current compute level <code>C</code>. Intuitively, this term captures how much of the total possible performance gain has been unlocked by running RL with compute <code>C</code>. The shape of this curve is visualized in the figure below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!d1XI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbe7c18d-59a0-453c-9a4e-99d7abc19add_1762x732.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!d1XI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbe7c18d-59a0-453c-9a4e-99d7abc19add_1762x732.png 424w, https://substackcdn.com/image/fetch/$s_!d1XI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbe7c18d-59a0-453c-9a4e-99d7abc19add_1762x732.png 848w, https://substackcdn.com/image/fetch/$s_!d1XI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbe7c18d-59a0-453c-9a4e-99d7abc19add_1762x732.png 1272w, https://substackcdn.com/image/fetch/$s_!d1XI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbe7c18d-59a0-453c-9a4e-99d7abc19add_1762x732.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!d1XI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbe7c18d-59a0-453c-9a4e-99d7abc19add_1762x732.png" width="1456" height="605" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bbe7c18d-59a0-453c-9a4e-99d7abc19add_1762x732.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:605,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:237546,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/192734052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbe7c18d-59a0-453c-9a4e-99d7abc19add_1762x732.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!d1XI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbe7c18d-59a0-453c-9a4e-99d7abc19add_1762x732.png 424w, https://substackcdn.com/image/fetch/$s_!d1XI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbe7c18d-59a0-453c-9a4e-99d7abc19add_1762x732.png 848w, https://substackcdn.com/image/fetch/$s_!d1XI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbe7c18d-59a0-453c-9a4e-99d7abc19add_1762x732.png 1272w, https://substackcdn.com/image/fetch/$s_!d1XI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbe7c18d-59a0-453c-9a4e-99d7abc19add_1762x732.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>According to this structure, RL training is flat in terms of reward during the early phase of training, then undergoes a phase of fast improvement before reaching a plateau. Authors find in [1] that these saturating curves between compute and reward robustly model the RL training process in practice. As we will see, this structure has also been validated and adopted by other work on RL scaling. </p><p>We fit this curve to the results of each RL training run, allowing us to compare results obtained from multiple runs with different training setups. There are two ways that these scaling curves may differ:</p><ol><li><p>Their value of <code>A</code> may be different, indicating that one training setting achieves better asymptotic performance.</p></li><li><p>Their value of <code>B</code> (or <code>C_mid</code>) may be different, meaning that one training setting is more compute efficient than the other. </p></li></ol><p>However, not all training settings yield a benefit in both <code>A</code> and <code>B</code>. In such cases, authors prioritize asymptotic improvements over efficiency improvements, arguing that improving the model&#8217;s asymptotic performance is more valuable because a degradation in efficiency can be offset by just training for longer.</p><p><strong>Applying RL scaling laws.</strong> The RL scaling laws proposed in [1] allow us to extrapolate the performance of a training run without incurring the full cost. <em>We can use the early phase of training to predict what would be the final performance after training with more compute.</em> The scaling law is fit using validation performance measured at regular intervals during training<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>, allowing us to efficiently assess the scalability of different changes to the RL training process. </p><div class="pullquote"><p><em>&#8220;We propose a best-practice recipe, ScaleRL, and demonstrate its effectiveness by successfully scaling and predicting validation performance on a single RL run scaled up to 100,000 GPU-hours. Our work provides both a scientific framework for analyzing scaling in RL and a practical recipe that brings RL training closer to the predictability long achieved in pre-training.&#8221; - from [1]</em></p></div><p>Authors in [1] use this approach to derive an optimal training recipe, called ScaleRL. Beginning with a baseline setup, authors test interventions to the RL training process in multiple phases of increasing scale&#8212;<em>4K, 8K, 16K, and 100K GPU hours</em>. In each phase, scaling laws are fit to extrapolate the performance of each setting, allowing authors to both <em>i)</em> verify the accuracy of their scaling law formulation and <em>ii)</em> efficiently discover scalable design choices for RL. </p><blockquote><p><em>&#8220;We present the first large-scale systematic study, amounting to more than 400,000 GPU-hours, that defines a principled framework for analyzing and predicting RL scaling in LLMs.&#8221;</em> - from [1]</p></blockquote><p><strong>Baseline RL setup.</strong> RL experiments in [1] primarily use the <a href="https://huggingface.co/datasets/POLARIS-Project/Polaris-Dataset-53K">Polaris-53K</a> math-focused reasoning dataset. Analysis begins with a baseline RL recipe that uses the GRPO loss with <a href="https://cameronrwolfe.substack.com/i/167254905/kullback-leibler-kl-divergence">no KL divergence</a> and the clip higher approach from DAPO [6]. All models produce a reasoning trace before their final output. A context length of 16K tokens is used&#8212;<em>12K reasoning tokens, 2K input tokens, and 2K output tokens</em>&#8212;as well as a batch size of 768&#8212;<em>a total of 48 prompts with 16 rollouts each</em>. </p><p>To enforce the 12K-token reasoning budget, <a href="https://docs.vllm.ai/en/latest/features/reasoning_outputs/#thinking-budget-control">interruptions</a> are used during training. When a reasoning trace reaches 12K tokens, we append a static end-of-reasoning phrase <code>&#8220;Okay, time is up. Let me stop thinking and formulate a final answer. &lt;/think&gt;"</code> to the model&#8217;s output so that the model can stop reasoning and begin generating a final answer.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sCD0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F335341b7-fdaa-4707-9dcb-f6831d8fd659_2226x984.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sCD0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F335341b7-fdaa-4707-9dcb-f6831d8fd659_2226x984.png 424w, https://substackcdn.com/image/fetch/$s_!sCD0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F335341b7-fdaa-4707-9dcb-f6831d8fd659_2226x984.png 848w, https://substackcdn.com/image/fetch/$s_!sCD0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F335341b7-fdaa-4707-9dcb-f6831d8fd659_2226x984.png 1272w, https://substackcdn.com/image/fetch/$s_!sCD0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F335341b7-fdaa-4707-9dcb-f6831d8fd659_2226x984.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sCD0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F335341b7-fdaa-4707-9dcb-f6831d8fd659_2226x984.png" width="1456" height="644" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/335341b7-fdaa-4707-9dcb-f6831d8fd659_2226x984.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:644,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:493929,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/192734052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F335341b7-fdaa-4707-9dcb-f6831d8fd659_2226x984.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sCD0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F335341b7-fdaa-4707-9dcb-f6831d8fd659_2226x984.png 424w, https://substackcdn.com/image/fetch/$s_!sCD0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F335341b7-fdaa-4707-9dcb-f6831d8fd659_2226x984.png 848w, https://substackcdn.com/image/fetch/$s_!sCD0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F335341b7-fdaa-4707-9dcb-f6831d8fd659_2226x984.png 1272w, https://substackcdn.com/image/fetch/$s_!sCD0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F335341b7-fdaa-4707-9dcb-f6831d8fd659_2226x984.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>From an engineering perspective, authors adopt a split generator-trainer approach, where a subset of GPUs run optimized inference engines (e.g., vLLM) for generating rollouts and remaining GPUs run a training backend (e.g., FSDP) to update policy parameters. To improve efficiency, an <a href="https://yumoxu.notion.site/async-grpo-in-the-wild">asynchronous RL training approach</a> is adopted. Two algorithms are considered: asynchronous PPO and PipelineRL [11]. As shown above, both approaches achieve similar asymptotic performance <code>A</code>, but PipelineRL has significantly improved compute efficiency <code>B</code> due to its design principles that minimize GPU idle time (e.g., <a href="https://cameronrwolfe.substack.com/i/179769076/rlvr-with-grpo">in-flight weight updates</a>). Authors also find in [1] that it is important to bound the degree of asynchrony by ensuring policy updates are not more than <code>K</code> steps ahead of the generated rollouts. <code>K = 8</code> is found to be optimal for this setting; see above. </p><p><strong>Ablating RL modifications.</strong> To build upon the baseline RL recipe, authors run small-scale RL experiments (i.e., ~4-8K GPU hours) to test the impact of various RL design choices. The baseline loss is first compared with both GSPO and CISPO loss formulations; see below. We see that both GSPO and CISPO noticeably outperform the baseline formulation in terms of asymptotic performance. CISPO also has marginally improved compute efficiency relative to GSPO, leading authors to use CISPO for remaining experiments in [1]. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LkRI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15ed8b61-5a8c-47ce-9e71-2201fd6d73fa_1622x736.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LkRI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15ed8b61-5a8c-47ce-9e71-2201fd6d73fa_1622x736.png 424w, https://substackcdn.com/image/fetch/$s_!LkRI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15ed8b61-5a8c-47ce-9e71-2201fd6d73fa_1622x736.png 848w, https://substackcdn.com/image/fetch/$s_!LkRI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15ed8b61-5a8c-47ce-9e71-2201fd6d73fa_1622x736.png 1272w, https://substackcdn.com/image/fetch/$s_!LkRI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15ed8b61-5a8c-47ce-9e71-2201fd6d73fa_1622x736.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LkRI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15ed8b61-5a8c-47ce-9e71-2201fd6d73fa_1622x736.png" width="1456" height="661" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/15ed8b61-5a8c-47ce-9e71-2201fd6d73fa_1622x736.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:661,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:294045,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/192734052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15ed8b61-5a8c-47ce-9e71-2201fd6d73fa_1622x736.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LkRI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15ed8b61-5a8c-47ce-9e71-2201fd6d73fa_1622x736.png 424w, https://substackcdn.com/image/fetch/$s_!LkRI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15ed8b61-5a8c-47ce-9e71-2201fd6d73fa_1622x736.png 848w, https://substackcdn.com/image/fetch/$s_!LkRI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15ed8b61-5a8c-47ce-9e71-2201fd6d73fa_1622x736.png 1272w, https://substackcdn.com/image/fetch/$s_!LkRI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15ed8b61-5a8c-47ce-9e71-2201fd6d73fa_1622x736.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>As we learned from the explanation of TIS, using different engines for generating rollouts and computing policy updates can lead to a non-negligible mismatch in log probabilities between training and inference. One change that we can make to minimize this mismatch is using full (<code>float32</code>) precision in the LLM&#8217;s language modeling head&#8212;<em>the final linear layer that predicts token probabilities</em>. As shown above, using a full precision head in training and inference engines significantly improves both asymptotic performance and compute efficiency. We should note, however, that authors do not adopt any approach for correcting the trainer-generator mismatch (e.g., TIS), which could also help to solve this issue. </p><p>Authors also test different loss aggregation strategies, including vanilla GRPO aggregation versus DAPO-style aggregation, finding that the loss aggregation proposed by DAPO [6] tends to perform the best; see below. In a similar vein, several advantage normalization techniques are tested. Specifically, authors test dividing mean-centered rewards by the standard deviation of rewards in a group&#8212;<em>as in vanilla GRPO</em>&#8212;or the standard deviation of rewards in the entire batch, as well as not dividing the mean-centered reward by anything&#8212;<em>as in Dr. GRPO [7]</em>. All techniques perform comparably, indicating that advantage normalization does not significantly impact asymptotic performance; see below. Remaining experiments normalize the advantage using the standard deviation of rewards across the batch due to the slight boost observed in asymptotic performance. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3KPF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa0f2c97-e2c5-4cfc-8415-a7a282d8363e_1612x652.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3KPF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa0f2c97-e2c5-4cfc-8415-a7a282d8363e_1612x652.png 424w, https://substackcdn.com/image/fetch/$s_!3KPF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa0f2c97-e2c5-4cfc-8415-a7a282d8363e_1612x652.png 848w, https://substackcdn.com/image/fetch/$s_!3KPF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa0f2c97-e2c5-4cfc-8415-a7a282d8363e_1612x652.png 1272w, https://substackcdn.com/image/fetch/$s_!3KPF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa0f2c97-e2c5-4cfc-8415-a7a282d8363e_1612x652.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3KPF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa0f2c97-e2c5-4cfc-8415-a7a282d8363e_1612x652.png" width="1456" height="589" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fa0f2c97-e2c5-4cfc-8415-a7a282d8363e_1612x652.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:589,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:241682,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/192734052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa0f2c97-e2c5-4cfc-8415-a7a282d8363e_1612x652.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3KPF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa0f2c97-e2c5-4cfc-8415-a7a282d8363e_1612x652.png 424w, https://substackcdn.com/image/fetch/$s_!3KPF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa0f2c97-e2c5-4cfc-8415-a7a282d8363e_1612x652.png 848w, https://substackcdn.com/image/fetch/$s_!3KPF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa0f2c97-e2c5-4cfc-8415-a7a282d8363e_1612x652.png 1272w, https://substackcdn.com/image/fetch/$s_!3KPF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa0f2c97-e2c5-4cfc-8415-a7a282d8363e_1612x652.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>Authors also discover data curation and filtering strategies that benefit the asymptotic performance of RL. Prompts with zero variance in rewards across a group have zero advantage and, therefore, no contribution to the policy gradient. Filtering these zero variance prompts from the batch benefits asymptotic performance; see below. Notably, this approach is different from the dynamic sampling method proposed in DAPO [6], as we do not continue sampling prompts until the batch is full. Rather, we just filter zero-variance prompts from the batch, forming a smaller effective batch. By doing this, we avoid dampening the policy gradient signal, as we are averaging the policy gradient over a smaller effective batch instead of the full batch that includes prompts with no gradient.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Rt54!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4eeca3e-a669-4bc2-b301-490e3ca5207c_1624x728.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Rt54!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4eeca3e-a669-4bc2-b301-490e3ca5207c_1624x728.png 424w, https://substackcdn.com/image/fetch/$s_!Rt54!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4eeca3e-a669-4bc2-b301-490e3ca5207c_1624x728.png 848w, https://substackcdn.com/image/fetch/$s_!Rt54!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4eeca3e-a669-4bc2-b301-490e3ca5207c_1624x728.png 1272w, https://substackcdn.com/image/fetch/$s_!Rt54!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4eeca3e-a669-4bc2-b301-490e3ca5207c_1624x728.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Rt54!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4eeca3e-a669-4bc2-b301-490e3ca5207c_1624x728.png" width="1456" height="653" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a4eeca3e-a669-4bc2-b301-490e3ca5207c_1624x728.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:653,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:246462,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/192734052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4eeca3e-a669-4bc2-b301-490e3ca5207c_1624x728.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Rt54!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4eeca3e-a669-4bc2-b301-490e3ca5207c_1624x728.png 424w, https://substackcdn.com/image/fetch/$s_!Rt54!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4eeca3e-a669-4bc2-b301-490e3ca5207c_1624x728.png 848w, https://substackcdn.com/image/fetch/$s_!Rt54!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4eeca3e-a669-4bc2-b301-490e3ca5207c_1624x728.png 1272w, https://substackcdn.com/image/fetch/$s_!Rt54!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4eeca3e-a669-4bc2-b301-490e3ca5207c_1624x728.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>Many data curriculum strategies have been proposed for RL training, but we learn in [1] that simple approaches can be quite effective. During training, the number of prompts that are solved easily by the current policy increases, and these prompts usually remain easy for the model throughout the rest of training. As shown above, dynamically removing these prompts from the training process improves asymptotic performance. To do this, authors maintain a history of pass rates for each prompt and permanently remove prompts that exceed a pass rate of 90%. This approach, called no positive resampling in [1], avoids wasting compute on prompts that the model already knows how to correctly solve. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VfDT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33cc5372-90b1-4d1a-8e96-58fdf305ff11_1936x694.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VfDT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33cc5372-90b1-4d1a-8e96-58fdf305ff11_1936x694.png 424w, https://substackcdn.com/image/fetch/$s_!VfDT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33cc5372-90b1-4d1a-8e96-58fdf305ff11_1936x694.png 848w, https://substackcdn.com/image/fetch/$s_!VfDT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33cc5372-90b1-4d1a-8e96-58fdf305ff11_1936x694.png 1272w, https://substackcdn.com/image/fetch/$s_!VfDT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33cc5372-90b1-4d1a-8e96-58fdf305ff11_1936x694.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VfDT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33cc5372-90b1-4d1a-8e96-58fdf305ff11_1936x694.png" width="1456" height="522" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/33cc5372-90b1-4d1a-8e96-58fdf305ff11_1936x694.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:522,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:285991,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/192734052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33cc5372-90b1-4d1a-8e96-58fdf305ff11_1936x694.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VfDT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33cc5372-90b1-4d1a-8e96-58fdf305ff11_1936x694.png 424w, https://substackcdn.com/image/fetch/$s_!VfDT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33cc5372-90b1-4d1a-8e96-58fdf305ff11_1936x694.png 848w, https://substackcdn.com/image/fetch/$s_!VfDT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33cc5372-90b1-4d1a-8e96-58fdf305ff11_1936x694.png 1272w, https://substackcdn.com/image/fetch/$s_!VfDT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33cc5372-90b1-4d1a-8e96-58fdf305ff11_1936x694.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p><strong>The ScaleRL recipe</strong>, which combines all best practices identified in the smaller-scale experiments outlined above, uses the loss formulation shown above. As mentioned before, a PipelineRL setup, forced interruptions for reasoning, and a full precision language modeling head are also used for ScaleRL. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LLC8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F892ba933-6d22-4854-aaf4-b7d346bbca5d_1498x706.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LLC8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F892ba933-6d22-4854-aaf4-b7d346bbca5d_1498x706.png 424w, https://substackcdn.com/image/fetch/$s_!LLC8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F892ba933-6d22-4854-aaf4-b7d346bbca5d_1498x706.png 848w, https://substackcdn.com/image/fetch/$s_!LLC8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F892ba933-6d22-4854-aaf4-b7d346bbca5d_1498x706.png 1272w, https://substackcdn.com/image/fetch/$s_!LLC8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F892ba933-6d22-4854-aaf4-b7d346bbca5d_1498x706.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LLC8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F892ba933-6d22-4854-aaf4-b7d346bbca5d_1498x706.png" width="1456" height="686" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/892ba933-6d22-4854-aaf4-b7d346bbca5d_1498x706.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:686,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:448567,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/192734052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F892ba933-6d22-4854-aaf4-b7d346bbca5d_1498x706.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LLC8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F892ba933-6d22-4854-aaf4-b7d346bbca5d_1498x706.png 424w, https://substackcdn.com/image/fetch/$s_!LLC8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F892ba933-6d22-4854-aaf4-b7d346bbca5d_1498x706.png 848w, https://substackcdn.com/image/fetch/$s_!LLC8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F892ba933-6d22-4854-aaf4-b7d346bbca5d_1498x706.png 1272w, https://substackcdn.com/image/fetch/$s_!LLC8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F892ba933-6d22-4854-aaf4-b7d346bbca5d_1498x706.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>To validate this recipe, larger-scale experiments are performed with up to 16K GPU hours. Authors perform leave-one-out ablations by removing individual components of the ScaleRL recipe to determine if they still have an impact when used in tandem with other components. When fitting sigmoidal scaling curves up to 8K GPU hours, we see that extrapolated results accurately predict performance up to the end of the 16K GPU hour run. In these experiments, the full ScaleRL recipe is found to yield the best performance; see above. Not all components significantly benefit performance in the leave-one-out analysis, but authors argue that these design choices still tend to benefit training stability.</p><blockquote><p><em>&#8220;Even when individual design choices appear redundant within the combined recipe, they often enhance training stability, robustness, or efficiency in ways that generalize across models and setups. ScaleRL retains such components not just for marginal gains in a specific configuration, but because they address recurring sources of instability and variance that arise across RL regimes.&#8221; - from [1]</em></p></blockquote><p><strong>Scaling up.</strong> Based on the analysis in [1], authors perform a final training run of ScaleRL up to 100K GPU hours, finding that the extrapolated performance continues to match actual performance in extended RL training runs; see below. Prior to this large-scale experiment, different methods for scaling up the RL training process (e.g., longer context, larger batch size, larger models, etc.) are considered in [1]. By analyzing these options in this extended run of ScaleRL, we learn the following:</p><ul><li><p>The scaling laws proposed in [1] are also found to accurately extrapolate the performance of <a href="https://cameronrwolfe.substack.com/p/moe-llms">Mixture-of-Experts models</a>, indicating generalizability to larger models with different architectures. </p></li><li><p>Using a longer context window during RL slows down training progress initially but yields higher asymptotic performance in the long run.</p></li><li><p>Increasing the batch size improves the asymptotic performance of RL and prevents stagnation on downstream benchmarks.</p></li><li><p>How we allocate the batch in terms of number of prompts and number of rollouts per prompt is less impactful&#8212;<em>the total batch size matters most</em>. </p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hoV0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0eef4341-82cb-499d-93ed-dc49a878b50f_1498x860.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hoV0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0eef4341-82cb-499d-93ed-dc49a878b50f_1498x860.png 424w, https://substackcdn.com/image/fetch/$s_!hoV0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0eef4341-82cb-499d-93ed-dc49a878b50f_1498x860.png 848w, https://substackcdn.com/image/fetch/$s_!hoV0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0eef4341-82cb-499d-93ed-dc49a878b50f_1498x860.png 1272w, https://substackcdn.com/image/fetch/$s_!hoV0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0eef4341-82cb-499d-93ed-dc49a878b50f_1498x860.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hoV0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0eef4341-82cb-499d-93ed-dc49a878b50f_1498x860.png" width="1456" height="836" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0eef4341-82cb-499d-93ed-dc49a878b50f_1498x860.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:836,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:403777,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/192734052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0eef4341-82cb-499d-93ed-dc49a878b50f_1498x860.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hoV0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0eef4341-82cb-499d-93ed-dc49a878b50f_1498x860.png 424w, https://substackcdn.com/image/fetch/$s_!hoV0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0eef4341-82cb-499d-93ed-dc49a878b50f_1498x860.png 848w, https://substackcdn.com/image/fetch/$s_!hoV0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0eef4341-82cb-499d-93ed-dc49a878b50f_1498x860.png 1272w, https://substackcdn.com/image/fetch/$s_!hoV0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0eef4341-82cb-499d-93ed-dc49a878b50f_1498x860.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p><strong>Key takeaways.</strong> The empirical analysis in [1] is extensive and contains a wide variety of practical details that are incredibly useful for those working on RL. For this reason, those who are interested in gaining a practical grasp of RL training should definitely read the full paper. However, the comprehensive empirical analysis presented in [1] can be largely summarized as follows:</p><ul><li><p>Asynchronous RL (i.e., PipelineRL) with a split generator-trainer setup is highly efficient and yields models that perform well, so long as we bound the level of asynchronicity during training. </p></li><li><p>The proposed ScaleRL training recipe combines all of the practical GRPO modifications that were found to be useful across experiments in [1].</p></li><li><p>The performance ceiling of RL <code>A</code> can be impacted by changes to the RL setup (e.g., loss type or batch size). However, many common RL interventions (e.g., loss aggregation, data curriculum, or advantage normalization) impact compute efficiency <code>B</code> rather than asymptotic performance.</p></li><li><p>The methods that appear superior in smaller-scale RL runs do not always generalize to the high-compute regime. However, we can still identify the recipes that are most scalable by fitting a sigmoidal scaling curve and estimating scaling parameters <code>A</code> and <code>B</code> from early training dynamics. <em>This approach is used constantly throughout the analysis in [1] to judge the scalability of RL recipes without performing full training runs (i.e., 16K-100K GPU hours)</em>. </p></li></ul><h4><strong><a href="https://arxiv.org/abs/2509.25300">Scaling Behaviors of LLM RL Post-Training</a> [2]</strong></h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1Qs7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f98edc9-ecf9-4aad-9246-a571a1e15e17_2498x1128.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1Qs7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f98edc9-ecf9-4aad-9246-a571a1e15e17_2498x1128.png 424w, https://substackcdn.com/image/fetch/$s_!1Qs7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f98edc9-ecf9-4aad-9246-a571a1e15e17_2498x1128.png 848w, https://substackcdn.com/image/fetch/$s_!1Qs7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f98edc9-ecf9-4aad-9246-a571a1e15e17_2498x1128.png 1272w, https://substackcdn.com/image/fetch/$s_!1Qs7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f98edc9-ecf9-4aad-9246-a571a1e15e17_2498x1128.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1Qs7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f98edc9-ecf9-4aad-9246-a571a1e15e17_2498x1128.png" width="1456" height="657" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7f98edc9-ecf9-4aad-9246-a571a1e15e17_2498x1128.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:657,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:532942,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/192734052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f98edc9-ecf9-4aad-9246-a571a1e15e17_2498x1128.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1Qs7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f98edc9-ecf9-4aad-9246-a571a1e15e17_2498x1128.png 424w, https://substackcdn.com/image/fetch/$s_!1Qs7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f98edc9-ecf9-4aad-9246-a571a1e15e17_2498x1128.png 848w, https://substackcdn.com/image/fetch/$s_!1Qs7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f98edc9-ecf9-4aad-9246-a571a1e15e17_2498x1128.png 1272w, https://substackcdn.com/image/fetch/$s_!1Qs7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f98edc9-ecf9-4aad-9246-a571a1e15e17_2498x1128.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p>In [2], authors investigate scaling behaviors of RL post-training using the full Qwen-2.5 model suite&#8212;<em>both base and instruct models</em>&#8212;ranging from 0.5B to 72B parameters. As in [1], this paper studies the impact of factors like model size, data volume, and compute on the performance of models trained with RL. However, this analysis focuses specifically on the mathematical reasoning domain, uses only the vanilla GRPO algorithm, and adopts a different scaling formulation. From the analysis in [2], we learn that RL follows a predictive power-law relationship between test loss and compute or data; see above.</p><p><strong>Scaling formulation.</strong> The scaling law formulation in [2] fits a relationship between test loss&#8212;<em>defined as the error rate (i.e., </em><code>error rate = 1 - accuracy</code><em>) on an in-domain validation set</em>&#8212;and compute or data. As shown below, RL scaling behavior is modeled using a log-linear power law between the test loss <code>L</code>, model size <code>N</code>, and a resource budget <code>X</code>. Here, the resource budget can either be the amount of compute <code>C</code> or the amount of data <code>D</code> used during RL training. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!40BS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3331ac31-bc98-4935-adc8-4ea475aba219_1486x874.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!40BS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3331ac31-bc98-4935-adc8-4ea475aba219_1486x874.png 424w, https://substackcdn.com/image/fetch/$s_!40BS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3331ac31-bc98-4935-adc8-4ea475aba219_1486x874.png 848w, https://substackcdn.com/image/fetch/$s_!40BS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3331ac31-bc98-4935-adc8-4ea475aba219_1486x874.png 1272w, https://substackcdn.com/image/fetch/$s_!40BS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3331ac31-bc98-4935-adc8-4ea475aba219_1486x874.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!40BS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3331ac31-bc98-4935-adc8-4ea475aba219_1486x874.png" width="584" height="343.34065934065933" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3331ac31-bc98-4935-adc8-4ea475aba219_1486x874.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:856,&quot;width&quot;:1456,&quot;resizeWidth&quot;:584,&quot;bytes&quot;:214977,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/192734052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3331ac31-bc98-4935-adc8-4ea475aba219_1486x874.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!40BS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3331ac31-bc98-4935-adc8-4ea475aba219_1486x874.png 424w, https://substackcdn.com/image/fetch/$s_!40BS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3331ac31-bc98-4935-adc8-4ea475aba219_1486x874.png 848w, https://substackcdn.com/image/fetch/$s_!40BS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3331ac31-bc98-4935-adc8-4ea475aba219_1486x874.png 1272w, https://substackcdn.com/image/fetch/$s_!40BS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3331ac31-bc98-4935-adc8-4ea475aba219_1486x874.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Core scaling formula (from [2])</figcaption></figure></div><p>In the figure below, we plot this power law&#8212;<em>using both log-log scale and linear scale to make interpreting the plots easier</em>&#8212;for different values of the learning efficiency. We use a fixed value of <code>E(N) = 1.0</code> in this plot for simplicity. As we can see, performance improves log-linearly as the resource budget <code>X</code> increases, and higher learning efficiency <code>K(N)</code> leads to a steeper decrease in test loss.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7hZH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72dfd7ad-f8c5-48e9-96d4-054f5cd87798_1389x490.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7hZH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72dfd7ad-f8c5-48e9-96d4-054f5cd87798_1389x490.png 424w, https://substackcdn.com/image/fetch/$s_!7hZH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72dfd7ad-f8c5-48e9-96d4-054f5cd87798_1389x490.png 848w, https://substackcdn.com/image/fetch/$s_!7hZH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72dfd7ad-f8c5-48e9-96d4-054f5cd87798_1389x490.png 1272w, https://substackcdn.com/image/fetch/$s_!7hZH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72dfd7ad-f8c5-48e9-96d4-054f5cd87798_1389x490.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7hZH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72dfd7ad-f8c5-48e9-96d4-054f5cd87798_1389x490.png" width="1389" height="490" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/72dfd7ad-f8c5-48e9-96d4-054f5cd87798_1389x490.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:490,&quot;width&quot;:1389,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:134778,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/192734052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72dfd7ad-f8c5-48e9-96d4-054f5cd87798_1389x490.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7hZH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72dfd7ad-f8c5-48e9-96d4-054f5cd87798_1389x490.png 424w, https://substackcdn.com/image/fetch/$s_!7hZH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72dfd7ad-f8c5-48e9-96d4-054f5cd87798_1389x490.png 848w, https://substackcdn.com/image/fetch/$s_!7hZH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72dfd7ad-f8c5-48e9-96d4-054f5cd87798_1389x490.png 1272w, https://substackcdn.com/image/fetch/$s_!7hZH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72dfd7ad-f8c5-48e9-96d4-054f5cd87798_1389x490.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Plotting the scaling law with varying learning efficiencies </figcaption></figure></div><p><strong>Performance extrapolation.</strong> As we might have inferred, the scaling formulation used in [1] is quite different from the pretraining scaling laws that we learned about before. More specifically, the scaling trends in [1] can only extrapolate the results of a specific training run to a higher compute regime&#8212;<em>we are predicting what will happen if we continue the RL training process for longer</em>. In contrast, the power law in [2] enables multiple extrapolation regimes:</p><ol><li><p><em>Inter-model</em>: fit the scaling law using data from training runs with smaller models (i.e., 0.5B to 32B Qwen-2.5 models) and predict the performance of a larger model (i.e., Qwen-2.5-72B).</p></li><li><p><em>Intra-model</em>: fit the scaling law using the early training trajectory of a model and predict its performance for the remainder of training.</p></li></ol><p>Both kinds of extrapolation are validated in [2] across base and instruct model variants of several sizes, demonstrating that RL training follows predictable scaling trends across model size <code>N</code>, compute <code>C</code>, and data volume <code>D</code>. Scaling plots shown in [2] always provide both inter and intra-model extrapolation results. </p><p><strong>More on learning efficiency.</strong> In our above scaling law expression, we should notice that the learning efficiency term has a dependence upon <code>N</code>&#8212;<em>learning efficiency follows a saturating trend with model size</em>. Put simply, this means that <em>i)</em> larger models have higher learning efficiency, but <em>ii)</em> marginal efficiency gains begin to diminish with increasing model size. As shown in the plot below, this formulation matches empirical observations. In practice, learning efficiency follows a saturating S-curve&#8212;<em>similar in structure to the scaling law formulation proposed in [1]</em><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a>&#8212;that plateaus at a maximum learning efficiency of <code>K_max</code>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2XIo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3eb552f-4b0f-4676-b858-31490da937e6_2604x1140.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2XIo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3eb552f-4b0f-4676-b858-31490da937e6_2604x1140.png 424w, https://substackcdn.com/image/fetch/$s_!2XIo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3eb552f-4b0f-4676-b858-31490da937e6_2604x1140.png 848w, https://substackcdn.com/image/fetch/$s_!2XIo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3eb552f-4b0f-4676-b858-31490da937e6_2604x1140.png 1272w, https://substackcdn.com/image/fetch/$s_!2XIo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3eb552f-4b0f-4676-b858-31490da937e6_2604x1140.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2XIo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3eb552f-4b0f-4676-b858-31490da937e6_2604x1140.png" width="1456" height="637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c3eb552f-4b0f-4676-b858-31490da937e6_2604x1140.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:637,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:452175,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/192734052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3eb552f-4b0f-4676-b858-31490da937e6_2604x1140.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2XIo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3eb552f-4b0f-4676-b858-31490da937e6_2604x1140.png 424w, https://substackcdn.com/image/fetch/$s_!2XIo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3eb552f-4b0f-4676-b858-31490da937e6_2604x1140.png 848w, https://substackcdn.com/image/fetch/$s_!2XIo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3eb552f-4b0f-4676-b858-31490da937e6_2604x1140.png 1272w, https://substackcdn.com/image/fetch/$s_!2XIo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3eb552f-4b0f-4676-b858-31490da937e6_2604x1140.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p><strong>Experimental setup. </strong>All experiments in [2] use vanilla GRPO&#8212;<em>with a <a href="https://cameronrwolfe.substack.com/i/177823868/group-relative-policy-optimization-grpo">KL divergence term</a></em>&#8212;and the <a href="https://github.com/verl-project/verl">verl</a> training framework. Scaling laws are empirically fit and validated on results from over 60 models, including base and instruct variants from the Qwen-2.5 series with sizes ranging from 0.5B to 72B parameters. The model family is fixed to ensure that only parameter count <code>N</code> and data volume <code>D</code> are changing. RL training is conducted over 50K samples taken from the mathematics subset of  <a href="https://huggingface.co/datasets/LLM360/guru-RL-92k">guru-RL-92K</a>, which performs extensive deduplication and difficulty filtering. Additionally, authors in [2] sort problems by increasing difficulty&#8212;<em>as assessed by the pass rate of <a href="https://huggingface.co/Qwen/Qwen2.5-7B-Instruct">Qwen2.5-7B-Instruct</a></em>&#8212;to form a data curriculum of increasing problem difficulty as RL training progresses. Following standard practice for fitting scaling laws, we compute test loss on an in-domain dataset of 500 held-out problems sampled from the training distribution.</p><p><strong>Compute-constrained regime.</strong> The power law formulation in [2] can be used to characterize scaling behavior under a fixed compute budget. Given a compute budget <code>C</code>, we are interested in the optimal model size <code>N</code> that minimizes the test loss. In [2], the compute budget is estimated via cumulative training FLOPs <code>C</code> <code>=</code> <code>6</code> <code>&#215;</code> <code>N</code> <code>&#215;</code> <code>T</code>, where <code>T</code> is the number of tokens processed during training. <code>T</code> is related to the data volume <code>D</code>, but whereas <code>D</code> counts data samples, <code>T</code> measures the total token volume. <code>T</code> is inferred from fixed values of <code>C</code> and <code>N</code>. We can study the relationship between compute <code>C</code> and model size <code>N</code> by running RL training with various model sizes and compute budgets, then fitting scaling laws on the results. We use compute as our resource budget (i.e., <code>X = C</code>) for these scaling laws; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lcTN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9970f116-d54b-4ce7-9a5a-e4b45a14e47b_1504x1302.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lcTN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9970f116-d54b-4ce7-9a5a-e4b45a14e47b_1504x1302.png 424w, https://substackcdn.com/image/fetch/$s_!lcTN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9970f116-d54b-4ce7-9a5a-e4b45a14e47b_1504x1302.png 848w, https://substackcdn.com/image/fetch/$s_!lcTN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9970f116-d54b-4ce7-9a5a-e4b45a14e47b_1504x1302.png 1272w, https://substackcdn.com/image/fetch/$s_!lcTN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9970f116-d54b-4ce7-9a5a-e4b45a14e47b_1504x1302.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lcTN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9970f116-d54b-4ce7-9a5a-e4b45a14e47b_1504x1302.png" width="1456" height="1260" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9970f116-d54b-4ce7-9a5a-e4b45a14e47b_1504x1302.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1260,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:493804,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/192734052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9970f116-d54b-4ce7-9a5a-e4b45a14e47b_1504x1302.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lcTN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9970f116-d54b-4ce7-9a5a-e4b45a14e47b_1504x1302.png 424w, https://substackcdn.com/image/fetch/$s_!lcTN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9970f116-d54b-4ce7-9a5a-e4b45a14e47b_1504x1302.png 848w, https://substackcdn.com/image/fetch/$s_!lcTN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9970f116-d54b-4ce7-9a5a-e4b45a14e47b_1504x1302.png 1272w, https://substackcdn.com/image/fetch/$s_!lcTN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9970f116-d54b-4ce7-9a5a-e4b45a14e47b_1504x1302.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p>In these plots, we can observe the results of inter-model (top plot) and intra-model (bottom plot) extrapolation using the scaling law formulation proposed in [2]. When studying scaling trends for smaller models (i.e., 0.5B to 32B parameters), we see that the best performance under a fixed compute budget is usually achieved by using the largest model. Larger models (i.e., 32B and 72B parameters) violate this trend: <em>the 32B model performs best at lower compute budgets, but a crossover occurs at higher compute budgets after which the 72B model performs better</em>. </p><div class="pullquote"><p>&#8220;In contrast to the immediate dominance of larger models in smaller parameter regimes, the 32B model outperforms the 72B counterpart initially under equivalent compute budgets, as the smaller model size inherently enables more training steps. We believe this observation reveals a latent trade-off between model scale and training steps in compute constrained scenarios.&#8221; - from [2]</p></div><p>This crossover arises from the fact that learning efficiency <code>k(N)</code> saturates for larger models. Given a fixed compute budget <code>C</code>, a smaller model can train for a larger number of steps relative to a larger model. Therefore, the larger model must have significantly improved learning efficiency in order to outperform the smaller model. As we see in the scaling analysis above, this is true until we reach 72B scale, at which point efficiency gains saturate and the 32B model is able to exceed the performance of the larger model under tight compute constraints.</p><p><strong>Data-optimal scaling.</strong> Given that LLM training is usually bottlenecked by the availability of high-quality data, we also want to understand the optimal model size for RL training given a fixed data budget <code>D</code><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a>. To do this, we can train models with various model sizes <code>N</code> and data budgets <code>D</code>, then fit the scaling laws&#8212;<em>where data is our resource budget (i.e., </em><code>X = D</code><em>)</em>&#8212;to these results; see below. The conclusion from this analysis is simple: <em>for a fixed amount of data, larger models demonstrate superior sample efficiency and consistently achieve lower test loss</em>. We also see that scaling laws accurately extrapolate performance in all regimes considered. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ctx9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d47571c-1740-4dfe-b044-b003a0a515a6_1602x1342.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ctx9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d47571c-1740-4dfe-b044-b003a0a515a6_1602x1342.png 424w, https://substackcdn.com/image/fetch/$s_!ctx9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d47571c-1740-4dfe-b044-b003a0a515a6_1602x1342.png 848w, https://substackcdn.com/image/fetch/$s_!ctx9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d47571c-1740-4dfe-b044-b003a0a515a6_1602x1342.png 1272w, https://substackcdn.com/image/fetch/$s_!ctx9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d47571c-1740-4dfe-b044-b003a0a515a6_1602x1342.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ctx9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d47571c-1740-4dfe-b044-b003a0a515a6_1602x1342.png" width="1456" height="1220" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1d47571c-1740-4dfe-b044-b003a0a515a6_1602x1342.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1220,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:585389,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/192734052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d47571c-1740-4dfe-b044-b003a0a515a6_1602x1342.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ctx9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d47571c-1740-4dfe-b044-b003a0a515a6_1602x1342.png 424w, https://substackcdn.com/image/fetch/$s_!ctx9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d47571c-1740-4dfe-b044-b003a0a515a6_1602x1342.png 848w, https://substackcdn.com/image/fetch/$s_!ctx9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d47571c-1740-4dfe-b044-b003a0a515a6_1602x1342.png 1272w, https://substackcdn.com/image/fetch/$s_!ctx9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d47571c-1740-4dfe-b044-b003a0a515a6_1602x1342.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p>If we remove data and compute constraints (i.e., train models to convergence on sufficiently large datasets), test loss monotonically decreases with model size&#8212;<em>bigger models are better given enough data and compute</em>. However, this trend does not follow a power law; see below. Smaller models show weaker gains, indicating diminishing returns for training smaller models to convergence.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bpa6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d06b1b7-7dc7-4b6b-af70-9302c3df4f90_778x640.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bpa6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d06b1b7-7dc7-4b6b-af70-9302c3df4f90_778x640.png 424w, https://substackcdn.com/image/fetch/$s_!bpa6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d06b1b7-7dc7-4b6b-af70-9302c3df4f90_778x640.png 848w, https://substackcdn.com/image/fetch/$s_!bpa6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d06b1b7-7dc7-4b6b-af70-9302c3df4f90_778x640.png 1272w, https://substackcdn.com/image/fetch/$s_!bpa6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d06b1b7-7dc7-4b6b-af70-9302c3df4f90_778x640.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bpa6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d06b1b7-7dc7-4b6b-af70-9302c3df4f90_778x640.png" width="423" height="347.96915167095113" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7d06b1b7-7dc7-4b6b-af70-9302c3df4f90_778x640.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:640,&quot;width&quot;:778,&quot;resizeWidth&quot;:423,&quot;bytes&quot;:116550,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/192734052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d06b1b7-7dc7-4b6b-af70-9302c3df4f90_778x640.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bpa6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d06b1b7-7dc7-4b6b-af70-9302c3df4f90_778x640.png 424w, https://substackcdn.com/image/fetch/$s_!bpa6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d06b1b7-7dc7-4b6b-af70-9302c3df4f90_778x640.png 848w, https://substackcdn.com/image/fetch/$s_!bpa6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d06b1b7-7dc7-4b6b-af70-9302c3df4f90_778x640.png 1272w, https://substackcdn.com/image/fetch/$s_!bpa6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d06b1b7-7dc7-4b6b-af70-9302c3df4f90_778x640.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p><strong>Data reuse.</strong> In addition to the scaling analysis, authors in [2] test if repeating data during training is problematic. These experiments fix the total data budget <code>D_total</code> but vary the number of unique data samples such that <code>D_total = &#964; &#215; D_unique</code>, where <code>&#964;</code> is a data reuse factor. As shown below, we learn in [2] that performance is primarily determined by <code>D_total</code> rather than <code>D_unique</code>. In fact, test loss is relatively insensitive to <code>&#964;</code>, and we see that there is no significant degradation in performance until larger reuse factors (i.e., <code>&#964; = 25</code>). </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!idme!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2851fdbf-c57d-42e8-8f55-0ec97b5adb81_1606x690.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!idme!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2851fdbf-c57d-42e8-8f55-0ec97b5adb81_1606x690.png 424w, https://substackcdn.com/image/fetch/$s_!idme!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2851fdbf-c57d-42e8-8f55-0ec97b5adb81_1606x690.png 848w, https://substackcdn.com/image/fetch/$s_!idme!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2851fdbf-c57d-42e8-8f55-0ec97b5adb81_1606x690.png 1272w, https://substackcdn.com/image/fetch/$s_!idme!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2851fdbf-c57d-42e8-8f55-0ec97b5adb81_1606x690.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!idme!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2851fdbf-c57d-42e8-8f55-0ec97b5adb81_1606x690.png" width="1456" height="626" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2851fdbf-c57d-42e8-8f55-0ec97b5adb81_1606x690.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:626,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:310649,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/192734052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2851fdbf-c57d-42e8-8f55-0ec97b5adb81_1606x690.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!idme!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2851fdbf-c57d-42e8-8f55-0ec97b5adb81_1606x690.png 424w, https://substackcdn.com/image/fetch/$s_!idme!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2851fdbf-c57d-42e8-8f55-0ec97b5adb81_1606x690.png 848w, https://substackcdn.com/image/fetch/$s_!idme!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2851fdbf-c57d-42e8-8f55-0ec97b5adb81_1606x690.png 1272w, https://substackcdn.com/image/fetch/$s_!idme!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2851fdbf-c57d-42e8-8f55-0ec97b5adb81_1606x690.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>However, unique data is not sampled randomly in these experiments. To ensure that data subsets are sufficiently diverse, authors partition the training set into difficulty subsets and preserve the data difficulty distribution across subsets of different sizes.  Additionally, the same data curriculum is maintained by ordering data based upon difficulty. The robustness of RL training to data reuse is likely dependent upon the diversity, quality, and difficulty of the unique samples. </p><h4><strong><a href="https://arxiv.org/abs/2603.12151">Optimally Scaling Sampling Compute for LLM RL</a> [3]</strong></h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!imS2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29782b1d-c2db-4057-bd5d-391391423286_2654x1128.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!imS2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29782b1d-c2db-4057-bd5d-391391423286_2654x1128.png 424w, https://substackcdn.com/image/fetch/$s_!imS2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29782b1d-c2db-4057-bd5d-391391423286_2654x1128.png 848w, https://substackcdn.com/image/fetch/$s_!imS2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29782b1d-c2db-4057-bd5d-391391423286_2654x1128.png 1272w, https://substackcdn.com/image/fetch/$s_!imS2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29782b1d-c2db-4057-bd5d-391391423286_2654x1128.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!imS2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29782b1d-c2db-4057-bd5d-391391423286_2654x1128.png" width="1456" height="619" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/29782b1d-c2db-4057-bd5d-391391423286_2654x1128.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:619,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:715125,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/192734052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29782b1d-c2db-4057-bd5d-391391423286_2654x1128.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!imS2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29782b1d-c2db-4057-bd5d-391391423286_2654x1128.png 424w, https://substackcdn.com/image/fetch/$s_!imS2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29782b1d-c2db-4057-bd5d-391391423286_2654x1128.png 848w, https://substackcdn.com/image/fetch/$s_!imS2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29782b1d-c2db-4057-bd5d-391391423286_2654x1128.png 1272w, https://substackcdn.com/image/fetch/$s_!imS2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29782b1d-c2db-4057-bd5d-391391423286_2654x1128.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><p>Scaling laws can be applied to RL training in many ways. The work we have seen so far shows that performance during RL follows a sigmoidal trajectory [1] and scales in a predictable manner with model size under fixed compute budgets [2]. However, these results, while informative, do not directly recommend how we can practically allocate a fixed compute budget for RL training in a similar manner to pretraining scaling laws. Inspired by this, authors in [3] perform a prescriptive analysis of optimal compute allocations for RL. Specifically, the analysis in [3] focuses on understanding how to optimally allocate sampling compute&#8212;<em>or the amount of compute spent generating completions for on-policy RL</em>. </p><blockquote><p><em>&#8220;We study the compute-optimal allocation of sampling compute for on-policy RL methods in LLMs, framing scaling as a compute-constrained optimization over three resources: parallel rollouts per problem, number of problems per batch, and number of update steps.&#8221;</em> - from [3]</p></blockquote><p><strong>Sampling compute.</strong> The relationship between compute and performance is less straightforward in RL relative to pretraining. For both pretraining and RL, the training process involves a sequence of training&#8212;<em>or model update</em>&#8212;steps. At each pretraining step, a single forward and backward pass is performed. On the other hand, an RL training step includes multiple components:</p><ul><li><p><em>Data collection</em>: sampling completions from the current policy.</p></li><li><p><em>Optimization</em>: updating the policy over collected data. </p></li></ul><p>With this in mind, we can model the total compute cost of an RL training run as <code>C</code> <code>=</code> <code>B_p&#8203;</code> <code>&#215;</code> <code>n</code> <code>&#215;</code> <code>M</code>, where <code>B_p</code> is the number of unique prompts per batch, <code>n</code> is the number of rollouts generated per prompt, and <code>M</code> is the number of steps taken during RL training. The analysis in [3] primarily focuses on compute spent on sampling completions (<code>B_p</code> and <code>n</code>) rather than sequential training steps (<code>M</code>).</p><p><strong>Scaling laws for sampling.</strong> Given the compute footprint for RL outlined above, our goal is to better understand how varying the allocation of a fixed compute budget <code>C_0</code> across the three factors <code>B_p</code>, <code>n</code>, and <code>M</code> impacts model performance. The scaling analysis is conducted in [3] by sweeping over settings of <code>B_p</code> <code>&#8712;</code> <code>{2^5,</code> <code>2^6,</code> <code>&#8230;,</code> <code>2^10}</code> and <code>n</code> <code>&#8712;</code> <code>{2^3,</code> <code>2^4,</code> <code>&#8230;,</code> <code>2^11}</code>, where both of these parameters are uniformly sampled via grid search on a log scale. Due to hardware constraints, a maximum effective batch size (<code>B_p&#8203;</code> <code>&#215;</code> <code>n</code> &#8804; <code>B_max</code>) is also enforced. </p><p>From a given <code>(B_p,</code> <code>n)</code> setting, we can perform a single RL training run to capture increasing settings of <code>M</code> throughout training. Following scaling law best practices, model performance is evaluated during training by measuring reward on an in-domain validation set. The evaluation results in a training run are sub-sampled to only include record-breaking points&#8212;<em>defined as points in the reward curve that exceed all prior rewards</em>&#8212;along the learning trajectory; see below. By considering only record-breaking points when modeling an RL training run, we fit our scaling law to the frontier of the reward trajectory for this run.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!n6hm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68efcf35-a59c-4ae9-8601-5e739900315d_1670x1590.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!n6hm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68efcf35-a59c-4ae9-8601-5e739900315d_1670x1590.png 424w, https://substackcdn.com/image/fetch/$s_!n6hm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68efcf35-a59c-4ae9-8601-5e739900315d_1670x1590.png 848w, https://substackcdn.com/image/fetch/$s_!n6hm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68efcf35-a59c-4ae9-8601-5e739900315d_1670x1590.png 1272w, https://substackcdn.com/image/fetch/$s_!n6hm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68efcf35-a59c-4ae9-8601-5e739900315d_1670x1590.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!n6hm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68efcf35-a59c-4ae9-8601-5e739900315d_1670x1590.png" width="429" height="408.375" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/68efcf35-a59c-4ae9-8601-5e739900315d_1670x1590.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1386,&quot;width&quot;:1456,&quot;resizeWidth&quot;:429,&quot;bytes&quot;:614426,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/192734052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68efcf35-a59c-4ae9-8601-5e739900315d_1670x1590.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!n6hm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68efcf35-a59c-4ae9-8601-5e739900315d_1670x1590.png 424w, https://substackcdn.com/image/fetch/$s_!n6hm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68efcf35-a59c-4ae9-8601-5e739900315d_1670x1590.png 848w, https://substackcdn.com/image/fetch/$s_!n6hm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68efcf35-a59c-4ae9-8601-5e739900315d_1670x1590.png 1272w, https://substackcdn.com/image/fetch/$s_!n6hm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68efcf35-a59c-4ae9-8601-5e739900315d_1670x1590.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><p>To robustly identify record-breaking points in the reward trajectory, rewards are separated into discrete bins, and the first point at which the reward enters a new bin is selected as the record-breaking point. Once the cleaned reward trajectory is available, we model this single training run with a sigmoidal scaling law&#8212;<em>similar to the scaling law formulation used in [1].</em> This gives us a collection of scaling laws for RL training curves with different settings of <code>B_p</code> and <code>n</code>, allowing the optimal configuration to be identified at each compute level from run-specific curves. </p><p>Building on these run-specific scaling laws, we can also fit a scaling law on the optimal settings identified for each compute level. Namely, we can use a similar sigmoidal scaling law to model how the optimal value of <code>n</code> varies according to our compute budget <code>C</code>, allowing us to extrapolate optimal training settings at higher compute budgets. In theory, the same approach can be used for <code>B_p</code>, but no clear pattern is observed in practice for the optimal value of <code>B_p</code>. </p><p><strong>Experimental settings.</strong> The scaling analysis described above is conducted using <a href="https://huggingface.co/Qwen/Qwen2.5-7B-Instruct">Qwen2.5-7B-Instruct</a>, <a href="https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507">Qwen3-4B-Instruct</a>, and <a href="https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct">Llama3.1-8B-Instruct</a> as base models. All RL training runs use binary outcome rewards and the vanilla GRPO optimizer. <a href="https://arxiv.org/abs/2506.14965">Guru-Math</a> is used as the primary dataset and is split into easy and hard subsets by assessing the difficulty of each prompt&#8212;<em>judged by the accuracy of the base model over 16 rollouts (i.e., Avg@16)</em>. The difficulty distribution is shown below with the easy and hard subsets shaded in blue and orange, respectively. Empirical scaling analysis is performed separately on both easy and hard data subsets in [3] to observe how difficulty distributions impact trends in scaling.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!N2WZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b6abf39-6334-4879-85d5-36bb45cec349_1776x1154.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!N2WZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b6abf39-6334-4879-85d5-36bb45cec349_1776x1154.png 424w, https://substackcdn.com/image/fetch/$s_!N2WZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b6abf39-6334-4879-85d5-36bb45cec349_1776x1154.png 848w, https://substackcdn.com/image/fetch/$s_!N2WZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b6abf39-6334-4879-85d5-36bb45cec349_1776x1154.png 1272w, https://substackcdn.com/image/fetch/$s_!N2WZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b6abf39-6334-4879-85d5-36bb45cec349_1776x1154.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!N2WZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b6abf39-6334-4879-85d5-36bb45cec349_1776x1154.png" width="455" height="295.625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0b6abf39-6334-4879-85d5-36bb45cec349_1776x1154.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:946,&quot;width&quot;:1456,&quot;resizeWidth&quot;:455,&quot;bytes&quot;:380355,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/192734052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b6abf39-6334-4879-85d5-36bb45cec349_1776x1154.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!N2WZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b6abf39-6334-4879-85d5-36bb45cec349_1776x1154.png 424w, https://substackcdn.com/image/fetch/$s_!N2WZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b6abf39-6334-4879-85d5-36bb45cec349_1776x1154.png 848w, https://substackcdn.com/image/fetch/$s_!N2WZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b6abf39-6334-4879-85d5-36bb45cec349_1776x1154.png 1272w, https://substackcdn.com/image/fetch/$s_!N2WZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b6abf39-6334-4879-85d5-36bb45cec349_1776x1154.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><p>In order to make the RL training process stable, the correct regularization strategy is needed. Interestingly, we see in [3] that optimal regularization is difficulty-dependent. Authors consider adding both a KL divergence and entropy bonus to the RL training objective. On easy problems, the entropy bonus helps to prevent premature entropy collapse in the policy. However, using an entropy bonus on difficult problems can actually cause an entropy explosion by pushing the policy towards rare but successful reasoning trajectories, making it better to remove regularization entirely. As shown below, the following regularization strategy is found to yield the most stable results<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a>:</p><ul><li><p>Apply both the entropy bonus and KL divergence&#8212;<em>which helps to delay entropy explosion</em>&#8212;in tandem when training on the easy dataset.</p></li><li><p>Use no regularization when training on the hard dataset.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IqX8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47d1dc5e-f949-4b7c-bfc7-d2cc6bab57e4_2386x1336.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IqX8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47d1dc5e-f949-4b7c-bfc7-d2cc6bab57e4_2386x1336.png 424w, https://substackcdn.com/image/fetch/$s_!IqX8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47d1dc5e-f949-4b7c-bfc7-d2cc6bab57e4_2386x1336.png 848w, https://substackcdn.com/image/fetch/$s_!IqX8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47d1dc5e-f949-4b7c-bfc7-d2cc6bab57e4_2386x1336.png 1272w, https://substackcdn.com/image/fetch/$s_!IqX8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47d1dc5e-f949-4b7c-bfc7-d2cc6bab57e4_2386x1336.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IqX8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47d1dc5e-f949-4b7c-bfc7-d2cc6bab57e4_2386x1336.png" width="471" height="263.64354395604397" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/47d1dc5e-f949-4b7c-bfc7-d2cc6bab57e4_2386x1336.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:815,&quot;width&quot;:1456,&quot;resizeWidth&quot;:471,&quot;bytes&quot;:748513,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/192734052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47d1dc5e-f949-4b7c-bfc7-d2cc6bab57e4_2386x1336.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IqX8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47d1dc5e-f949-4b7c-bfc7-d2cc6bab57e4_2386x1336.png 424w, https://substackcdn.com/image/fetch/$s_!IqX8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47d1dc5e-f949-4b7c-bfc7-d2cc6bab57e4_2386x1336.png 848w, https://substackcdn.com/image/fetch/$s_!IqX8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47d1dc5e-f949-4b7c-bfc7-d2cc6bab57e4_2386x1336.png 1272w, https://substackcdn.com/image/fetch/$s_!IqX8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47d1dc5e-f949-4b7c-bfc7-d2cc6bab57e4_2386x1336.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><p>In addition to difficulty-dependent regularization, the learning rate must be increased with the batch size to ensure stable training. In particular, a square root scaling rule is used for the learning rate in [3], which increases the learning rate proportionally to the square root of the batch size; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QIyT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26361e2f-3d9f-421b-a8a3-3a69cec09e4e_2328x1204.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QIyT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26361e2f-3d9f-421b-a8a3-3a69cec09e4e_2328x1204.png 424w, https://substackcdn.com/image/fetch/$s_!QIyT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26361e2f-3d9f-421b-a8a3-3a69cec09e4e_2328x1204.png 848w, https://substackcdn.com/image/fetch/$s_!QIyT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26361e2f-3d9f-421b-a8a3-3a69cec09e4e_2328x1204.png 1272w, https://substackcdn.com/image/fetch/$s_!QIyT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26361e2f-3d9f-421b-a8a3-3a69cec09e4e_2328x1204.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QIyT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26361e2f-3d9f-421b-a8a3-3a69cec09e4e_2328x1204.png" width="524" height="270.99725274725273" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/26361e2f-3d9f-421b-a8a3-3a69cec09e4e_2328x1204.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:753,&quot;width&quot;:1456,&quot;resizeWidth&quot;:524,&quot;bytes&quot;:939552,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/192734052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26361e2f-3d9f-421b-a8a3-3a69cec09e4e_2328x1204.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QIyT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26361e2f-3d9f-421b-a8a3-3a69cec09e4e_2328x1204.png 424w, https://substackcdn.com/image/fetch/$s_!QIyT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26361e2f-3d9f-421b-a8a3-3a69cec09e4e_2328x1204.png 848w, https://substackcdn.com/image/fetch/$s_!QIyT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26361e2f-3d9f-421b-a8a3-3a69cec09e4e_2328x1204.png 1272w, https://substackcdn.com/image/fetch/$s_!QIyT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26361e2f-3d9f-421b-a8a3-3a69cec09e4e_2328x1204.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><p><strong>How should we allocate compute?</strong> The primary takeaway from the scaling analysis in [3] focuses upon the number of rollouts to sample (<code>n</code>) for each prompt in a batch. As the compute budget increases, the optimal setting of <code>n</code> increases as well, eventually saturating at higher compute budgets; see below. In other words, allocating increased compute towards sampling more rollouts per prompt yields better results compared to just training the model for longer. Interestingly, the exact scaling law also depends on the problem difficulty&#8212;<em>smaller optimal values of </em><code>n</code><em> are observed when training on a harder dataset</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nTLN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86fb9d9e-28f1-4fd7-83f1-91168d8be896_2566x766.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nTLN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86fb9d9e-28f1-4fd7-83f1-91168d8be896_2566x766.png 424w, https://substackcdn.com/image/fetch/$s_!nTLN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86fb9d9e-28f1-4fd7-83f1-91168d8be896_2566x766.png 848w, https://substackcdn.com/image/fetch/$s_!nTLN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86fb9d9e-28f1-4fd7-83f1-91168d8be896_2566x766.png 1272w, https://substackcdn.com/image/fetch/$s_!nTLN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86fb9d9e-28f1-4fd7-83f1-91168d8be896_2566x766.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nTLN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86fb9d9e-28f1-4fd7-83f1-91168d8be896_2566x766.png" width="1456" height="435" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/86fb9d9e-28f1-4fd7-83f1-91168d8be896_2566x766.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:435,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1003753,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/192734052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86fb9d9e-28f1-4fd7-83f1-91168d8be896_2566x766.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nTLN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86fb9d9e-28f1-4fd7-83f1-91168d8be896_2566x766.png 424w, https://substackcdn.com/image/fetch/$s_!nTLN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86fb9d9e-28f1-4fd7-83f1-91168d8be896_2566x766.png 848w, https://substackcdn.com/image/fetch/$s_!nTLN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86fb9d9e-28f1-4fd7-83f1-91168d8be896_2566x766.png 1272w, https://substackcdn.com/image/fetch/$s_!nTLN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F86fb9d9e-28f1-4fd7-83f1-91168d8be896_2566x766.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><p>This trend holds for all base models across both easy and hard training datasets. Intuitively, scaling <code>n</code> has a different impact depending on the problem difficulty:</p><ul><li><p>Sampling more rollouts on easy problems can sharpen performance (i.e., improve Avg@K) for problems that are already solvable and improve policy robustness by lowering the probability of an incorrect rollout.</p></li><li><p>Increasing the number of rollouts increases exploration and, in turn, aids in discovering rare solutions to difficult problems in order to improve the ratio of problems that can be solved (i.e., Pass@K) by the policy.</p></li></ul><p>Interestingly, <code>B_p</code> has only a moderate performance impact when kept within a reasonable range and is found to primarily influence training stability.</p><p>Although the optimal setting of <code>n</code> scales with increasing compute, the exact shape of this scaling law&#8212;<em>and the point at which it saturates</em>&#8212;changes depending on the exact training setup. Therefore, while the scaling trends hold across different settings, the exact scaling parameters must be fit to the particular RL training setup being used. Practically, authors in [3] recommend the following approach for determining an optimal compute allocation in RL:</p><ol><li><p>Execute RL training runs at lower compute budgets by varying the value of <code>B_p</code> and <code>n</code> but restricting the value of <code>M</code>.</p></li><li><p>Fit a scaling law, using the approach described above, from these results.</p></li><li><p>Infer the optimal value of <code>n</code> from the scaling law.</p></li><li><p>Choose the minimum value of <code>B_p</code> that yields stable training.</p></li><li><p>Invest the remaining compute budget into additional training steps <code>M</code>.</p></li></ol><p>This approach provides us with a predictable process for extrapolating the optimal compute allocation for RL from lower-budget experiments.</p><h2>Comparing RL and Pretraining Scaling Laws</h2><p>We now have a detailed understanding of scaling laws for both pretraining and RL. However, one of the primary takeaways from this overview is the fact that a &#8220;scaling law&#8221; is completely different between these two domains. To close, we will briefly discuss the key ways that scaling laws differ between pretraining and RL, explain why these differences exist, and outline key takeaways from research on RL scaling that remain useful despite the overall messiness of RL.</p><p><strong>Measuring performance. </strong>Pretraining scaling laws predict a particular metric:<em> the cross entropy loss (or another related entropy metric) measured over an in-domain, held-out validation set</em>. This performance metric is stable and is typically computed over a large, diverse dataset (i.e., a random sample of the pretraining corpus). Such a stable, diverse, and specific metric provides the perfect y-axis for fitting a scaling law and allows us to clearly define the impact of specific design decisions on the resulting model. RL scaling laws make an attempt to retain this robustness; e.g., performance is computed over an in-domain validation set. However, RL scaling laws typically use the reward (or accuracy) of the policy<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-9" href="#footnote-9" target="_self">9</a> as the underlying metric to which scaling laws are fit. This is a downstream performance metric that can fluctuate substantially depending on the domain being studied, the benchmark being used, and the composition of data in that benchmark. As a result, scaling laws for RL tend to be more noisy and domain specific relative to those used for pretraining, which capture a more general trend in model performance. </p><p><strong>Defining compute.</strong> Pretraining has a very clean compute footprint that is usually estimated with the number of training FLOPs <code>C</code> <code>=</code> <code>6</code> <code>&#215;</code> <code>N</code> <code>&#215;</code> <code>D</code>. This clean definition of compute provides an obvious x-axis for our scaling law. In contrast, RL compute is difficult to define due to the presence of both sampling and policy updates. The exact definition of compute used in RL scaling laws may change depending on the paper we are reading. For example, some papers derive a FLOP-like metric similarly to pretraining [3], while others rely on the number of GPU hours used [1]. Either way, the wall-clock time of RL training varies substantially depending on the framework being used, which means that the relationship between GPU hours and raw compute is not consistent. These factors must be considered when fitting a scaling law for RL because they cause the details of scaling laws to change depending on the exact setup being used.</p><p><strong>Intra and inter-model extrapolation. </strong>Pretraining scaling laws are used to fit trends in performance across many model training runs with different settings to understand how model size, data volume, and compute impact the results of training. Such an approach allows us to cleanly extrapolate the results of costly training runs and use these predictions to reason about how compute should be optimally allocated. In RL, we actually fit two kinds of scaling laws that are used to extrapolate performance in different ways (i.e., inter-model and intra-model extrapolation). Inter-model extrapolation is the primary focus of pretraining scaling laws, whereas intra-model extrapolation is not usually addressed. The main reason intra-model extrapolation is necessary for RL is the sensitivity of the training process. In addition to understanding inter-model trends, we need to be able to predict whether a particular training configuration is viable or not. <br><br><strong>Lack of standardization.</strong> The design space for RL algorithms is quite large: <em>there are simply more &#8220;knobs&#8221; to tweak relative to pretraining</em>. Additionally, we lack a comprehensive understanding of which design decisions meaningfully impact the scaling properties of RL. Although we have seen several papers that study the impact of design decisions on RL scaling, the findings from these papers&#8212;<em>despite being informative</em>&#8212;do not address the fact that scaling trends for RL are coupled to the exact training setup being used. Slight changes in the configuration for RL can completely change the scaling trends we observe. For this reason, most RL scaling laws are bespoke&#8212;<em>the recommendations offered by one specific analysis may not hold in a different environment</em>. As a result, findings can be difficult to replicate or extend, thus slowing scientific progress on the topic. </p><p><strong>Practical takeaways.</strong> Despite the fact that RL scaling laws tend to be messy and bespoke, there are still several useful trends that we can learn from the papers in this overview:</p><ul><li><p>The scaling behavior of RL is predictable within a given setup. Intra-model extrapolation works well and can be used to judge the viability of your setup during the early training phases. Inter-model extrapolation is also effective and can yield useful insights, though these insights may not always transfer across different training configurations.</p></li><li><p>The impact of design decisions on RL is not singular. Some decisions impact learning efficiency, while others impact asymptotic model performance. This distinction is important because degradations in efficiency can be solved by simply training for longer, while a degradation in asymptotic performance may not be trivially recoverable. Interestingly, many recent GRPO variants seem to primarily benefit learning efficiency and stability [1]. </p></li><li><p>Using larger models yields consistently positive results in the RL scaling laws we have seen, though the presence of compute constraints can create interesting tradeoffs. When training with less data or compute, we may actually benefit from using a smaller model due to the fact that learning efficiency saturates with model size. </p></li><li><p>To invest more compute into RL, we can <em>i)</em> run training for more steps or <em>ii)</em> use more inference compute at each step. Interestingly, even though the compute cost of RL is dominated by inference, most scaling laws suggest that allocating more compute to sampling completions is helpful. RL training is surprisingly robust to data reuse, benefits from large batch sizes, and scales predictably as we sample more completions per prompt in a batch.</p></li></ul><h4>New to the newsletter?</h4><p>Hi! I&#8217;m <a href="https://cameronrwolfe.me/">Cameron R. Wolfe</a>, Deep Learning Ph.D. and Staff Research Scientist at <a href="https://research.netflix.com/research-area/nlp-and-conversations">Netflix</a>. This is the Deep (Learning) Focus newsletter, where I help readers better understand important topics in AI research. The newsletter will always be free and open to read. If you like the newsletter, please subscribe, consider a paid subscription, share it, or follow me on <a href="https://twitter.com/cwolferesearch">X</a> and <a href="https://www.linkedin.com/in/cameron-r-wolfe-ph-d-04744a238/">LinkedIn</a>!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://cameronrwolfe.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://cameronrwolfe.substack.com/subscribe?"><span>Subscribe now</span></a></p><h4>Bibliography</h4><p>[1] Khatri, Devvrit, et al. &#8220;The art of scaling reinforcement learning compute for llms.&#8221; <em>arXiv preprint arXiv:2510.13786</em> (2025).</p><p>[2] Tan, Zelin, et al. &#8220;Scaling behaviors of llm reinforcement learning post-training: An empirical study in mathematical reasoning.&#8221; <em>arXiv preprint arXiv:2509.25300</em> (2025).</p><p>[3] Cheng, Zhoujun, et al. &#8220;IsoCompute Playbook: Optimally Scaling Sampling Compute for LLM RL.&#8221; <em>arXiv preprint arXiv:2603.12151</em> (2026).</p><p>[4] Shao, Zhihong, et al. &#8220;Deepseekmath: Pushing the limits of mathematical reasoning in open language models.&#8221; <em>arXiv preprint arXiv:2402.03300</em> (2024).</p><p>[5] Zheng, Chujie, et al. &#8220;Group sequence policy optimization.&#8221; <em>arXiv preprint arXiv:2507.18071</em> (2025).</p><p>[6] Yu, Qiying, et al. &#8220;Dapo: An open-source llm reinforcement learning system at scale, 2025.&#8221; <em>URL https://arxiv. org/abs/2503.14476</em> 1 (2025): 2.</p><p>[7] Liu, Zichen, et al. &#8220;Understanding r1-zero-like training: A critical perspective.&#8221; <em>arXiv preprint arXiv:2503.20783</em> (2025).</p><p>[8] Chen, Aili, et al. &#8220;Minimax-m1: Scaling test-time compute efficiently with lightning attention.&#8221; <em>arXiv preprint arXiv:2506.13585</em> (2025).</p><p>[9] F. Yao, L. Liu, D. Zhang, C. Dong, J. Shang, and J. Gao. Your efficient rl framework secretly brings you off-policy rl training, Aug. 2025. URL <a href="https://fengyao.notion.site/off-policy-rl">https://fengyao.notion.site/off-policy-rl</a>.</p><p>[10] Chen, Aili, et al. &#8220;Minimax-m1: Scaling test-time compute efficiently with lightning attention.&#8221; <em>arXiv preprint arXiv:2506.13585</em> (2025).</p><p>[11] Pich&#233;, Alexandre, et al. &#8220;Pipelinerl: Faster on-policy reinforcement learning for long sequence generation.&#8221; <em>arXiv preprint arXiv:2509.19128</em> (2025).</p><p>[12] Shao, Zhihong, et al. &#8220;Deepseekmath: Pushing the limits of mathematical reasoning in open language models.&#8221; <em>arXiv preprint arXiv:2402.03300</em> (2024).</p><p>[13] Kaplan, Jared, et al. &#8220;Scaling laws for neural language models.&#8221; <em>arXiv preprint arXiv:2001.08361</em> (2020).</p><p>[14] Hoffmann, Jordan, et al. &#8220;Training compute-optimal large language models.&#8221; <em>arXiv preprint arXiv:2203.15556</em> 10 (2022).</p><p>[15] Guo, Daya, et al. &#8220;Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.&#8221; <em>arXiv preprint arXiv:2501.12948</em> (2025).</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Although it is true that GRPO is dominant in open research, it is probable that closed frontier labs are using different algorithm variants.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>For example, <a href="https://cameronrwolfe.substack.com/p/olmo-3">Olmo 3</a> uses a total batch size of either 512 or 1,024 for RL training with 8 rollouts per prompt and either 64 or 128 prompts per batch. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>For example, <a href="https://cameronrwolfe.substack.com/p/olmo-3">Olmo 3</a> mentions that models use 5-14&#215; more compute for inference compared to policy updates during RL training. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>More details on why this particular KL divergence term was adopted for GRPO can be found in <a href="https://cameronrwolfe.substack.com/i/177823868/deepseekmath-pushing-the-limits-of-mathematical-reasoning-in-open-language-models-1">this discussion</a> of the DeepSeekMath paper. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>Performance is measured as an average pass rate computed over 16 generations per prompt over a validation set of 1,000 prompts. Validation performance is measured after every 100 RL training steps.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>The main difference between these formulations is the fact that the S-curve used for the learning efficiency in [2] has a fixed steepness exponent of <code>B = 1</code>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>Here, the value of <code>D</code> corresponds to the number of unique examples in the dataset.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>Authors note in [3] that scaling law trends are robust to the regularization strategy&#8212;<em>proper regularization only helps to keep training stable</em>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-9" href="#footnote-anchor-9" class="footnote-number" contenteditable="false" target="_self">9</a><div class="footnote-content"><p>Several different variants of accuracy can be used as well; e.g., Pass@K or Avg@K.</p></div></div>]]></content:encoded></item><item><title><![CDATA[The Anatomy of an LLM Benchmark]]></title><description><![CDATA[Common patterns used to create the most effective LLM evaluation datasets...]]></description><link>https://cameronrwolfe.substack.com/p/llm-bench</link><guid isPermaLink="false">https://cameronrwolfe.substack.com/p/llm-bench</guid><dc:creator><![CDATA[Cameron R. Wolfe, Ph.D.]]></dc:creator><pubDate>Mon, 30 Mar 2026 09:33:10 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/56cd7776-e590-4fe6-82ca-34a65900b409_2124x1192.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!614Q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2928f193-21fa-42d9-ab00-b7257a4e28b5_2494x1396.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!614Q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2928f193-21fa-42d9-ab00-b7257a4e28b5_2494x1396.png 424w, https://substackcdn.com/image/fetch/$s_!614Q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2928f193-21fa-42d9-ab00-b7257a4e28b5_2494x1396.png 848w, https://substackcdn.com/image/fetch/$s_!614Q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2928f193-21fa-42d9-ab00-b7257a4e28b5_2494x1396.png 1272w, https://substackcdn.com/image/fetch/$s_!614Q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2928f193-21fa-42d9-ab00-b7257a4e28b5_2494x1396.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!614Q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2928f193-21fa-42d9-ab00-b7257a4e28b5_2494x1396.png" width="1456" height="815" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2928f193-21fa-42d9-ab00-b7257a4e28b5_2494x1396.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:815,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1888304,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2928f193-21fa-42d9-ab00-b7257a4e28b5_2494x1396.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!614Q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2928f193-21fa-42d9-ab00-b7257a4e28b5_2494x1396.png 424w, https://substackcdn.com/image/fetch/$s_!614Q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2928f193-21fa-42d9-ab00-b7257a4e28b5_2494x1396.png 848w, https://substackcdn.com/image/fetch/$s_!614Q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2928f193-21fa-42d9-ab00-b7257a4e28b5_2494x1396.png 1272w, https://substackcdn.com/image/fetch/$s_!614Q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2928f193-21fa-42d9-ab00-b7257a4e28b5_2494x1396.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2, 3, 4, 10, 12])</figcaption></figure></div><p>Throughout the history of AI research, progress has been measured&#8212;<em>and accelerated</em>&#8212;by high-quality benchmarks. AI is an empirical field that is driven by discovering interventions that improve performance on key benchmarks. For large language models (LLMs) in particular, creating useful benchmarks is hard due to rapidly advancing model capabilities. Tough evaluations are regularly saturated as new models are released, creating the need for continual evolution toward harder problems and new dimensions of performance. Despite the pivotal role of benchmarking in driving progress, evaluation has traditionally received less attention compared to core modeling research. Additionally, creating high-quality benchmarks requires unique skills that are emphasized less heavily in the literature. This overview aims to solve these problems by providing an extensive survey of useful LLM benchmarks and the techniques&#8212;<em>including both practical tricks and more recent directions of research</em>&#8212;used to create them. </p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://cameronrwolfe.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Join 65,000 others who use Deep (Learning) Focus to understand AI research. Consider a paid subscription if you would like to help support the newsletter.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p><strong>Disclaimer.</strong> Agent and coding benchmarks are notably absent from this overview. These domains are rapidly advancing and require unique evaluation techniques that have led to the creation of completely new areas of research in LLM evaluation. Due to their depth, these topics will require an overview of their own, and <a href="https://epoch.ai/blog/what-do-economic-value-benchmarks-tell-us">several</a> <a href="https://epoch.ai/gradient-updates/why-benchmarking-is-hard">useful</a> <a href="https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents">resources</a> on these topics are already available. </p><h2>Dissecting Popular LLM Benchmarks</h2><p>The best way to understand how LLM benchmarks are created&#8212;<em>and how we can create a useful benchmark for our own task of interest</em>&#8212;is to simply study details of the most popular and effective LLM benchmarks. In this section, we will select a wide variety of LLM benchmarks, including both recent benchmarks and those that have been around for a while, and outline the following characteristics:</p><ul><li><p>How the data is sourced</p></li><li><p>How data quality is ensured</p></li><li><p>How model performance is measured</p></li><li><p>How each benchmark has evolved as models have improved</p></li></ul><p>Admittedly, this section is far from comprehensive&#8212;<em>a vast number of LLM benchmarks exist, and surveying them all would be impossible</em>. Instead, this section optimizes for diversity and aims to provide a wide view of the different kinds of benchmarks that exist and the various strategies that are commonly used to create useful evaluation datasets across these many different domains. </p><h4><a href="https://arxiv.org/abs/2009.03300">Massive Multitask Language Understanding (MMLU)</a> [1]</h4><blockquote><p><em>&#8220;To succeed at our test, future models should be well-rounded, possess extensive world knowledge, and develop expert-level problem solving ability. These properties make the test likely to be an enduring and informative goalpost.&#8221;</em> - from [1]</p></blockquote><p>MMLU is one of the most widely used general knowledge benchmarks for LLMs. The data curation strategy for MMLU is simple: <em>questions are sourced from freely available online sources and manually curated by graduate and undergraduate students</em>. The benchmark contains ~16K questions divided into 57 subjects<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> that span various topics like STEM, humanities, social sciences, and more. The full MMLU benchmark contains a development set of five examples per subject (i.e., used for few-shot prompting), a validation set of 1.5K questions, and the main test set. For each task, we have a minimum of 100 questions in the test set. </p><p><strong>Data format.</strong> The questions within the MMLU benchmark use a multiple choice format, and models are evaluated using a zero or few-shot prompting strategy. Authors of the benchmark specifically avoid open-ended generation due to the increased evaluation complexity. Multiple choice correctness can be validated with string matching, allowing MMLU to be evaluated using accuracy. Several example questions from MMLU are provided below for reference.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AhTX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc163096-0bc2-44bc-8506-6c57dcc08502_1526x1010.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AhTX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc163096-0bc2-44bc-8506-6c57dcc08502_1526x1010.png 424w, https://substackcdn.com/image/fetch/$s_!AhTX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc163096-0bc2-44bc-8506-6c57dcc08502_1526x1010.png 848w, https://substackcdn.com/image/fetch/$s_!AhTX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc163096-0bc2-44bc-8506-6c57dcc08502_1526x1010.png 1272w, https://substackcdn.com/image/fetch/$s_!AhTX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc163096-0bc2-44bc-8506-6c57dcc08502_1526x1010.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AhTX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc163096-0bc2-44bc-8506-6c57dcc08502_1526x1010.png" width="1456" height="964" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dc163096-0bc2-44bc-8506-6c57dcc08502_1526x1010.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:964,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:238686,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc163096-0bc2-44bc-8506-6c57dcc08502_1526x1010.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AhTX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc163096-0bc2-44bc-8506-6c57dcc08502_1526x1010.png 424w, https://substackcdn.com/image/fetch/$s_!AhTX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc163096-0bc2-44bc-8506-6c57dcc08502_1526x1010.png 848w, https://substackcdn.com/image/fetch/$s_!AhTX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc163096-0bc2-44bc-8506-6c57dcc08502_1526x1010.png 1272w, https://substackcdn.com/image/fetch/$s_!AhTX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc163096-0bc2-44bc-8506-6c57dcc08502_1526x1010.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p><strong>Difficulty.</strong> Some subjects are separated into sub-tasks based on their difficulty level. More specifically, MMLU defines subjects at elementary, high school, college, and professional levels, where difficulty is inferred from the source of the questions. For example, the professional subset of the Psychology domain pulls from the exam for professional practice in Psychology, whereas the high school subset pulls from advanced placement exams (i.e., tests for high school students). Notably, not all subjects have a task for each difficulty level.</p><div class="pullquote"><p>&#8220;Human-level accuracy on this test varies. Unspecialized humans from Amazon Mechanical Turk obtain 34.5% accuracy on this test. Meanwhile, expert-level performance can be far higher. For example, real-world test-taker human accuracy at the 95th percentile is around 87% for US Medical Licensing Examinations&#8230; We estimate that expert-level accuracy is approximately 89.8%.&#8221; - from [1]</p></div><p>As we might expect, human-level accuracy on MMLU varies significantly based on the human, domain, and level of difficulty being considered. Given that MMLU is still popular even today, several extensions have been proposed (e.g., MMLU-Pro [2] and MMLU-Redux [3]) to diagnose quality issues and to keep the benchmark from becoming saturated by newly-released LLMs over time. </p><blockquote><p><em>&#8220;[Benchmark performance] has begun to plateau, making it increasingly difficult to discern differences in model capabilities. This paper introduces MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. Additionally, MMLU-Pro eliminates the trivial and noisy questions in MMLU.&#8221;</em> - from [2]</p></blockquote><p><strong>MMLU-Pro. </strong>We learn in [2] that MMLU has a non-negligible ratio of easy (i.e., knowledge-only or low reasoning) questions, as well as some questions that are flawed or incorrect. To avoid saturation and reduce noise, <a href="https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro">MMLU-Pro</a> [2] reconstructs the benchmark in order to make it more accurate, difficult, and discriminative. The 57 subjects from MMLU are consolidated into a set of 14 broader domains, and the majority of easy questions are removed from MMLU-Pro using model-based difficulty filtering. A pool of eight models is tested on each question, and any question that the majority of models answer correctly&#8212;<em>5,886 questions in total</em>&#8212;is removed. From here, the remaining MMLU questions are supplemented with harder questions from sources like <a href="https://arxiv.org/abs/2305.12524">TheoremQA</a> and <a href="https://arxiv.org/abs/2307.10635">SciBench</a>, yielding a final benchmark of ~12K questions; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HV_2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d77703-4937-4a8b-a240-60ec24bb4107_1378x448.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HV_2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d77703-4937-4a8b-a240-60ec24bb4107_1378x448.png 424w, https://substackcdn.com/image/fetch/$s_!HV_2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d77703-4937-4a8b-a240-60ec24bb4107_1378x448.png 848w, https://substackcdn.com/image/fetch/$s_!HV_2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d77703-4937-4a8b-a240-60ec24bb4107_1378x448.png 1272w, https://substackcdn.com/image/fetch/$s_!HV_2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d77703-4937-4a8b-a240-60ec24bb4107_1378x448.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HV_2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d77703-4937-4a8b-a240-60ec24bb4107_1378x448.png" width="1378" height="448" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c1d77703-4937-4a8b-a240-60ec24bb4107_1378x448.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:448,&quot;width&quot;:1378,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:150386,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d77703-4937-4a8b-a240-60ec24bb4107_1378x448.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HV_2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d77703-4937-4a8b-a240-60ec24bb4107_1378x448.png 424w, https://substackcdn.com/image/fetch/$s_!HV_2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d77703-4937-4a8b-a240-60ec24bb4107_1378x448.png 848w, https://substackcdn.com/image/fetch/$s_!HV_2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d77703-4937-4a8b-a240-60ec24bb4107_1378x448.png 1272w, https://substackcdn.com/image/fetch/$s_!HV_2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d77703-4937-4a8b-a240-60ec24bb4107_1378x448.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p>For new data sources, questions are converted into a multiple choice format by asking GPT-4-Turbo to extract a correct answer and generate distractor answers. The result of this process is manually verified by asking human annotators to compare extracted answers to the original solution for each question. To reduce the impact of random guessing, the number of choices for each question is also expanded from four to ten&#8212;<em>this is referred to as &#8220;option augmentation&#8221; in [2]</em>. </p><p>After data filtering and curation, MMLU-Pro undergoes an extensive quality control phase with multiple stages of verification by humans and LLMs. The quality control process aims to identify bad questions, incorrect answers, and false positive distractors. Human validation is performed first, then Gemini-1.5-Pro flags any remaining issues for a second stage of human review. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jHD4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bc8ac7a-1e55-4e66-9741-b07a8fdbcb82_1628x904.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jHD4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bc8ac7a-1e55-4e66-9741-b07a8fdbcb82_1628x904.png 424w, https://substackcdn.com/image/fetch/$s_!jHD4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bc8ac7a-1e55-4e66-9741-b07a8fdbcb82_1628x904.png 848w, https://substackcdn.com/image/fetch/$s_!jHD4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bc8ac7a-1e55-4e66-9741-b07a8fdbcb82_1628x904.png 1272w, https://substackcdn.com/image/fetch/$s_!jHD4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bc8ac7a-1e55-4e66-9741-b07a8fdbcb82_1628x904.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jHD4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bc8ac7a-1e55-4e66-9741-b07a8fdbcb82_1628x904.png" width="1456" height="808" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4bc8ac7a-1e55-4e66-9741-b07a8fdbcb82_1628x904.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:808,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:481886,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bc8ac7a-1e55-4e66-9741-b07a8fdbcb82_1628x904.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jHD4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bc8ac7a-1e55-4e66-9741-b07a8fdbcb82_1628x904.png 424w, https://substackcdn.com/image/fetch/$s_!jHD4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bc8ac7a-1e55-4e66-9741-b07a8fdbcb82_1628x904.png 848w, https://substackcdn.com/image/fetch/$s_!jHD4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bc8ac7a-1e55-4e66-9741-b07a8fdbcb82_1628x904.png 1272w, https://substackcdn.com/image/fetch/$s_!jHD4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bc8ac7a-1e55-4e66-9741-b07a8fdbcb82_1628x904.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p>The full curation pipeline for MMLU-Pro is depicted above. MMLU-Pro still uses accuracy as the main performance metric, though we can also separately examine accuracy within each specific domain. Most LLMs perform worse on MMLU-Pro relative to MMLU&#8212;<em>the benchmark is more difficult and has headroom before saturation</em>&#8212;and model capability gaps tend to be more noticeable. We also see in [2] that MMLU-Pro offers improved prompt stability and benefits from advanced reasoning techniques (e.g., <a href="https://cameronrwolfe.substack.com/p/chain-of-thought-prompting-for-llms">chain of thought prompting</a>).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hAdG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab569f99-537e-4e5b-9b31-8dfd649d1786_1614x976.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hAdG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab569f99-537e-4e5b-9b31-8dfd649d1786_1614x976.png 424w, https://substackcdn.com/image/fetch/$s_!hAdG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab569f99-537e-4e5b-9b31-8dfd649d1786_1614x976.png 848w, https://substackcdn.com/image/fetch/$s_!hAdG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab569f99-537e-4e5b-9b31-8dfd649d1786_1614x976.png 1272w, https://substackcdn.com/image/fetch/$s_!hAdG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab569f99-537e-4e5b-9b31-8dfd649d1786_1614x976.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hAdG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab569f99-537e-4e5b-9b31-8dfd649d1786_1614x976.png" width="1456" height="880" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ab569f99-537e-4e5b-9b31-8dfd649d1786_1614x976.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:880,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:260568,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab569f99-537e-4e5b-9b31-8dfd649d1786_1614x976.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hAdG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab569f99-537e-4e5b-9b31-8dfd649d1786_1614x976.png 424w, https://substackcdn.com/image/fetch/$s_!hAdG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab569f99-537e-4e5b-9b31-8dfd649d1786_1614x976.png 848w, https://substackcdn.com/image/fetch/$s_!hAdG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab569f99-537e-4e5b-9b31-8dfd649d1786_1614x976.png 1272w, https://substackcdn.com/image/fetch/$s_!hAdG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab569f99-537e-4e5b-9b31-8dfd649d1786_1614x976.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><p><strong>MMLU-Redux.</strong> An in-depth quality audit of the MMLU benchmark is performed in [3] over a subset of 100 questions randomly sampled from each MMLU task (i.e., 5,700 questions in total). Quality issues are categorized using a hierarchical error taxonomy; see above. This taxonomy contains five error categories that are used to granularly categorize questions with poor quality or incorrect ground truth. When necessary, questions are re-annotated and verified according to the original source material or, when the original source is absent, a trusted source (e.g., government websites). We see in [3] that an estimated 6.49% of MMLU questions contain errors, but the ratio of errors varies between subjects; e.g., 57% of Virology questions were flagged due to quality issues; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JIjD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7fc1765-2681-492e-a7e1-f577e92b47a0_1748x780.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JIjD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7fc1765-2681-492e-a7e1-f577e92b47a0_1748x780.png 424w, https://substackcdn.com/image/fetch/$s_!JIjD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7fc1765-2681-492e-a7e1-f577e92b47a0_1748x780.png 848w, https://substackcdn.com/image/fetch/$s_!JIjD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7fc1765-2681-492e-a7e1-f577e92b47a0_1748x780.png 1272w, https://substackcdn.com/image/fetch/$s_!JIjD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7fc1765-2681-492e-a7e1-f577e92b47a0_1748x780.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JIjD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7fc1765-2681-492e-a7e1-f577e92b47a0_1748x780.png" width="1456" height="650" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a7fc1765-2681-492e-a7e1-f577e92b47a0_1748x780.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:650,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:238030,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7fc1765-2681-492e-a7e1-f577e92b47a0_1748x780.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JIjD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7fc1765-2681-492e-a7e1-f577e92b47a0_1748x780.png 424w, https://substackcdn.com/image/fetch/$s_!JIjD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7fc1765-2681-492e-a7e1-f577e92b47a0_1748x780.png 848w, https://substackcdn.com/image/fetch/$s_!JIjD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7fc1765-2681-492e-a7e1-f577e92b47a0_1748x780.png 1272w, https://substackcdn.com/image/fetch/$s_!JIjD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7fc1765-2681-492e-a7e1-f577e92b47a0_1748x780.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><p>The result of this sampling and re-annotation procedure is <a href="https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux-2.0">MMLU-Redux</a>, a subset of 5,700 manually inspected MMLU questions. For several high-error subjects, authors monitor agreement across three separate annotators using <a href="https://en.wikipedia.org/wiki/Cohen%27s_kappa">Cohen&#8217;s Kappa</a>. Re-annotation agreement is found to be strong even on difficult subjects, providing confidence in the quality of the human-audited data. The aim of this effort is not to produce a harder version of MMLU but rather to audit (and fix or discard) existing questions for quality and accuracy&#8212;<em>MMLU-Redux is an updated subset of MMLU that can be adopted for more reliable evaluation</em>.</p><p>We see in [3] that removing incorrect evaluation data meaningfully impacts performance and model rankings; see below. For example, <a href="https://huggingface.co/meta-llama/Llama-3.1-405B">Llama-3.1-405B</a> improves from 16th to first in rank for Virology and <a href="https://huggingface.co/Qwen/Qwen2-72B-Instruct">Qwen-2-72B-Instruct</a> drops from first to eighth place for College Chemistry when only evaluating on correct instances from MMLU-Redux&#8212;<em>these results suggest improved reliability</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AA10!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5943e3e-73b0-4d2f-8dd5-58a10bc816c6_1338x1432.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AA10!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5943e3e-73b0-4d2f-8dd5-58a10bc816c6_1338x1432.png 424w, https://substackcdn.com/image/fetch/$s_!AA10!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5943e3e-73b0-4d2f-8dd5-58a10bc816c6_1338x1432.png 848w, https://substackcdn.com/image/fetch/$s_!AA10!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5943e3e-73b0-4d2f-8dd5-58a10bc816c6_1338x1432.png 1272w, https://substackcdn.com/image/fetch/$s_!AA10!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5943e3e-73b0-4d2f-8dd5-58a10bc816c6_1338x1432.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AA10!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5943e3e-73b0-4d2f-8dd5-58a10bc816c6_1338x1432.png" width="571" height="611.1150971599402" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d5943e3e-73b0-4d2f-8dd5-58a10bc816c6_1338x1432.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1432,&quot;width&quot;:1338,&quot;resizeWidth&quot;:571,&quot;bytes&quot;:626349,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5943e3e-73b0-4d2f-8dd5-58a10bc816c6_1338x1432.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AA10!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5943e3e-73b0-4d2f-8dd5-58a10bc816c6_1338x1432.png 424w, https://substackcdn.com/image/fetch/$s_!AA10!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5943e3e-73b0-4d2f-8dd5-58a10bc816c6_1338x1432.png 848w, https://substackcdn.com/image/fetch/$s_!AA10!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5943e3e-73b0-4d2f-8dd5-58a10bc816c6_1338x1432.png 1272w, https://substackcdn.com/image/fetch/$s_!AA10!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5943e3e-73b0-4d2f-8dd5-58a10bc816c6_1338x1432.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><h4><strong><a href="https://arxiv.org/abs/2311.12022">GPQA: A Graduate-Level Google-Proof Q&amp;A Benchmark</a> [4]</strong></h4><p>GPQA is another popular LLM benchmark that takes a different approach from MMLU. Namely, GPQA is a much smaller dataset: <em>the extended version contains 596 questions, while the main and diamond subsets contain 448 and 198 questions, respectively</em>. Rather than providing broad coverage, GPQA focuses on curating a small number of expert-verified questions that are difficult to solve even with internet access (i.e., a &#8220;Google-proof&#8221; benchmark). Three primary domains are covered&#8212;<em>Biology, Chemistry, and Physics</em>&#8212;each of which is divided into several sub-domains<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>. Similarly to MMLU, however, GPQA does adopt a multiple choice question format with four answers per question. </p><div class="pullquote"><p><em>&#8220;We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are &#8220;Google-proof&#8221;).&#8221; - from [4]</em></p></div><p><strong>Expert curation.</strong> The data from GPQA is manually curated by a group of 61 human experts that each have&#8212;<em>or are pursuing</em>&#8212;a PhD in a relevant field. The data curation pipeline for GPQA is depicted below. To begin, experts in each domain write a set of candidate questions. These questions are written from scratch, rather than being collected from existing exams or datasets. As a guiding principle, experts are specifically asked to write questions that are:</p><ul><li><p>Difficult.</p></li><li><p>Answerable by experts in the same domain.</p></li><li><p>Not possible for non-experts to answer, even with internet access.</p></li></ul><p>Questions are always written such that they can be answered with or without choices being presented, thus enabling GPQA to be easily extended to an open-ended generation format in the future. In addition to writing each question, a written explanation is provided for both the answer and all distractors. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wRX7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefc79479-6a1e-4a4a-9570-c3f7288bedb6_1524x1510.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wRX7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefc79479-6a1e-4a4a-9570-c3f7288bedb6_1524x1510.png 424w, https://substackcdn.com/image/fetch/$s_!wRX7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefc79479-6a1e-4a4a-9570-c3f7288bedb6_1524x1510.png 848w, https://substackcdn.com/image/fetch/$s_!wRX7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefc79479-6a1e-4a4a-9570-c3f7288bedb6_1524x1510.png 1272w, https://substackcdn.com/image/fetch/$s_!wRX7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefc79479-6a1e-4a4a-9570-c3f7288bedb6_1524x1510.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wRX7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefc79479-6a1e-4a4a-9570-c3f7288bedb6_1524x1510.png" width="665" height="659.0625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/efc79479-6a1e-4a4a-9570-c3f7288bedb6_1524x1510.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1443,&quot;width&quot;:1456,&quot;resizeWidth&quot;:665,&quot;bytes&quot;:810057,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefc79479-6a1e-4a4a-9570-c3f7288bedb6_1524x1510.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wRX7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefc79479-6a1e-4a4a-9570-c3f7288bedb6_1524x1510.png 424w, https://substackcdn.com/image/fetch/$s_!wRX7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefc79479-6a1e-4a4a-9570-c3f7288bedb6_1524x1510.png 848w, https://substackcdn.com/image/fetch/$s_!wRX7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefc79479-6a1e-4a4a-9570-c3f7288bedb6_1524x1510.png 1272w, https://substackcdn.com/image/fetch/$s_!wRX7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefc79479-6a1e-4a4a-9570-c3f7288bedb6_1524x1510.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [4])</figcaption></figure></div><p>After a question is written, two separate domain experts interact with it. The first expert solves and validates the question, then suggests possible revisions. After the writer revises the question based on suggestions, a second domain expert answers the revised question. Finally, three different non-expert validators&#8212;<em>selected from the group of experts for other, non-overlapping domains</em>&#8212;try to answer the question with unrestricted internet access, spending a minimum of 15 minutes and nearly 40 minutes on average answering each question.</p><blockquote><p><em>&#8220;The process consists of four main stages: question writing, expert validation, question revision, and non-expert validation.&#8221;</em> - from [4]</p></blockquote><p><strong>Verification principles.</strong> The GPQA curation process validates both correctness and difficulty. Correctness is handled via expert validation and revision, while difficulty is assessed based on the ability of non-experts to solve questions. The results of these two stages are used to define the different subsets of GPQA:</p><ul><li><p><em>GPQA Extended</em>: full dataset (546 questions).</p></li><li><p><em>GPQA Main</em>: questions where at least one expert agrees with the answer and at most two non-experts answer the question correctly (448 questions).</p></li><li><p><em>GPQA Diamond</em>: questions where both experts agree with the answer and at most one non-expert answers the question correctly (198 questions). </p></li></ul><p>As shown below, the resulting subsets are quite difficult, with experts achieving around 70-80% accuracy and non-experts a much lower accuracy of 30-40%. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mNfA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d5ec5a0-9c31-473e-9013-b1fdfd83e163_1522x1482.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mNfA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d5ec5a0-9c31-473e-9013-b1fdfd83e163_1522x1482.png 424w, https://substackcdn.com/image/fetch/$s_!mNfA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d5ec5a0-9c31-473e-9013-b1fdfd83e163_1522x1482.png 848w, https://substackcdn.com/image/fetch/$s_!mNfA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d5ec5a0-9c31-473e-9013-b1fdfd83e163_1522x1482.png 1272w, https://substackcdn.com/image/fetch/$s_!mNfA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d5ec5a0-9c31-473e-9013-b1fdfd83e163_1522x1482.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mNfA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d5ec5a0-9c31-473e-9013-b1fdfd83e163_1522x1482.png" width="592" height="576.5494505494505" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0d5ec5a0-9c31-473e-9013-b1fdfd83e163_1522x1482.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1418,&quot;width&quot;:1456,&quot;resizeWidth&quot;:592,&quot;bytes&quot;:577131,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d5ec5a0-9c31-473e-9013-b1fdfd83e163_1522x1482.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mNfA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d5ec5a0-9c31-473e-9013-b1fdfd83e163_1522x1482.png 424w, https://substackcdn.com/image/fetch/$s_!mNfA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d5ec5a0-9c31-473e-9013-b1fdfd83e163_1522x1482.png 848w, https://substackcdn.com/image/fetch/$s_!mNfA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d5ec5a0-9c31-473e-9013-b1fdfd83e163_1522x1482.png 1272w, https://substackcdn.com/image/fetch/$s_!mNfA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d5ec5a0-9c31-473e-9013-b1fdfd83e163_1522x1482.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [4])</figcaption></figure></div><h4><a href="https://arxiv.org/abs/2206.04615">Beyond the Imitation Game Benchmark (BIG-Bench)</a> [5]</h4><blockquote><p><em>&#8220;BIG-bench&#8230; includes a set of 204 or more language tasks. As reflected in the BIG-bench review criteria, benchmark tasks are novel, cover a diverse range of topics and languages, and are not fully solvable by current models.&#8221;</em> - from [5]</p></blockquote><p>BIG-Bench explores a community-based strategy for curating difficult LLM evaluation tasks. The benchmark was <a href="https://github.com/google/BIG-bench/tree/main">openly constructed on Github</a>, where researchers were asked to contribute tasks by creating a pull request. Each task was then manually reviewed in a corresponding PR discussion according to <a href="https://github.com/google/BIG-bench/blob/main/docs/doc.md#review-criteria-for-submissions">detailed submission criteria</a>; e.g., correctness, difficulty, decontamination, and justification (i.e., <em>why is this an important task for LLMs to solve?</em>). The version of BIG-Bench outlined in [5] contains 204 tasks that were curated by 405 authors. The set of included tasks is incredibly broad, covering topics like math, coding, reasoning, science, and more; see <a href="https://github.com/google/BIG-bench/blob/main/bigbench/benchmark_tasks/keywords_to_tasks.md#summary-table">here</a> for a summary of task domains. </p><p><strong>Task interface.</strong> Unlike the benchmarks we have seen so far, BIG-Bench does not have any unified data format&#8212;<em>tasks have varying formats ranging from multiple choice to open-ended generation and multi-turn (interactive) chat</em>. In order to handle the diversity of tasks present in BIG-Bench, authors introduce a standard API structure that is used by all tasks. This API specifies two task types:</p><ol><li><p><em>JSON</em>: defined by a JSON file containing a list of input-output examples.</p></li><li><p><em>Programmatic</em>: defined by a Python function that can interact directly with the model over multiple chat turns and compute custom metrics.</p></li></ol><p>By using these standardized structures for all tasks, we can easily evaluate any public model or onboard new tasks with minimal implementation changes. The distribution of BIG-Bench tasks follows an 80-20 split between JSON and programmatic task types. In programmatic tasks, we interact with the model via two standard functions:</p><ol><li><p><code>generate_text</code>: generate a text continuation from the model.</p></li><li><p><code>cond_log_prob</code>: compute log probabilities of a target given input.</p></li></ol><p>The model can be queried multiple times within a programmatic task, enabling support for multi-turn chat or iterative tasks within BIG-Bench. Each task must have a minimum of 32 evaluation samples, though authors are encouraged to create much larger tasks; see below for a distribution of task sizes. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!F_Y3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ef14124-0ffe-4237-858b-5f9219a11aa9_809x466.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!F_Y3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ef14124-0ffe-4237-858b-5f9219a11aa9_809x466.png 424w, https://substackcdn.com/image/fetch/$s_!F_Y3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ef14124-0ffe-4237-858b-5f9219a11aa9_809x466.png 848w, https://substackcdn.com/image/fetch/$s_!F_Y3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ef14124-0ffe-4237-858b-5f9219a11aa9_809x466.png 1272w, https://substackcdn.com/image/fetch/$s_!F_Y3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ef14124-0ffe-4237-858b-5f9219a11aa9_809x466.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!F_Y3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ef14124-0ffe-4237-858b-5f9219a11aa9_809x466.png" width="443" height="255.17676143386896" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5ef14124-0ffe-4237-858b-5f9219a11aa9_809x466.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:466,&quot;width&quot;:809,&quot;resizeWidth&quot;:443,&quot;bytes&quot;:32368,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F001b15fd-ea16-4d07-9605-07b8a46b8020_809x466.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!F_Y3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ef14124-0ffe-4237-858b-5f9219a11aa9_809x466.png 424w, https://substackcdn.com/image/fetch/$s_!F_Y3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ef14124-0ffe-4237-858b-5f9219a11aa9_809x466.png 848w, https://substackcdn.com/image/fetch/$s_!F_Y3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ef14124-0ffe-4237-858b-5f9219a11aa9_809x466.png 1272w, https://substackcdn.com/image/fetch/$s_!F_Y3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ef14124-0ffe-4237-858b-5f9219a11aa9_809x466.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [5])</figcaption></figure></div><p><strong>Performance metrics.</strong> Given that BIG-Bench tasks follow a variety of formats, we cannot evaluate all tasks with a unified performance metric like accuracy. Instead, a <a href="https://github.com/google/BIG-bench/blob/main/docs/doc.md#available-metrics">suite of standard metrics</a> is provided for all tasks, and programmatic tasks are even allowed to define their own custom metrics. In [5], authors list the following performance metrics as being used in BIG-Bench:</p><ul><li><p><em>Exact String Match</em>.</p></li><li><p><em>Multiple Choice Accuracy.</em></p></li><li><p><em>Text Similarity Metrics</em> (e.g., <a href="https://en.wikipedia.org/wiki/BLEU">BLEU</a>, <a href="https://arxiv.org/abs/2004.04696">BLEURT</a>, or <a href="https://en.wikipedia.org/wiki/ROUGE_(metric)">ROUGE</a>).</p></li><li><p><em>Multi-Category <a href="https://en.wikipedia.org/wiki/Brier_score">Brier Score</a></em>: evaluates the calibration&#8212;<em>a measure of how well confidence<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a> aligns with observed correctness</em>&#8212;of a model&#8217;s outputted probabilities on options for a multiple choice question.</p></li><li><p><em><a href="https://arxiv.org/abs/1706.04599">Expected Calibration Error</a></em>: another calibration metric that measures how well the model&#8217;s accuracy matches the probability assigned to a response in the multiple choice setting.</p></li></ul><p>Interestingly, BIG-Bench even allows multiple evaluation metrics to be defined per task, but one metric must be defined as the primary metric. Additionally, each task must specify a high and low reference score on the primary metric. Using this information, we can normalize each task&#8217;s preferred metric using the high and low reference scores. Then, we can compute aggregate performance over the entire benchmark by averaging normalized metrics across tasks&#8212;<em>this approach summarizes benchmark performance with a single score in the range [0, 100]</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!k-Eo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9112b28a-a125-40df-b890-d0e51033bd99_1262x999.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!k-Eo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9112b28a-a125-40df-b890-d0e51033bd99_1262x999.png 424w, https://substackcdn.com/image/fetch/$s_!k-Eo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9112b28a-a125-40df-b890-d0e51033bd99_1262x999.png 848w, https://substackcdn.com/image/fetch/$s_!k-Eo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9112b28a-a125-40df-b890-d0e51033bd99_1262x999.png 1272w, https://substackcdn.com/image/fetch/$s_!k-Eo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9112b28a-a125-40df-b890-d0e51033bd99_1262x999.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!k-Eo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9112b28a-a125-40df-b890-d0e51033bd99_1262x999.png" width="656" height="519.2900158478606" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9112b28a-a125-40df-b890-d0e51033bd99_1262x999.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:999,&quot;width&quot;:1262,&quot;resizeWidth&quot;:656,&quot;bytes&quot;:377960,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9112b28a-a125-40df-b890-d0e51033bd99_1262x999.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!k-Eo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9112b28a-a125-40df-b890-d0e51033bd99_1262x999.png 424w, https://substackcdn.com/image/fetch/$s_!k-Eo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9112b28a-a125-40df-b890-d0e51033bd99_1262x999.png 848w, https://substackcdn.com/image/fetch/$s_!k-Eo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9112b28a-a125-40df-b890-d0e51033bd99_1262x999.png 1272w, https://substackcdn.com/image/fetch/$s_!k-Eo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9112b28a-a125-40df-b890-d0e51033bd99_1262x999.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [5])</figcaption></figure></div><p>As shown above, all models at the time of BIG-Bench&#8217;s proposal performed well below human baseline performance; see above. Although performance improves with model scale, all models perform poorly in an absolute sense, indicating that the benchmark was quite difficult to solve for models at that time. Human performance metrics in the above plot&#8212;<em>reported as both a max and mean score across multiple annotators</em>&#8212;were collected using a team of expert annotators that were given full internet access. However, properly measuring human performance is difficult given the breadth of tasks present in BIG-Bench.</p><div class="pullquote"><p><em>&#8220;While we report mean and max human rater scores for all tasks evaluated by raters, care must be taken when interpreting these metrics. We do not claim that these scores are the best possible achievable by a human, or even that these scores are the best achievable by these particular evaluators&#8230; For example, if a task requires knowledge of programming, how do we weight scores of evaluators who do not know how to program?&#8221; - from [5]</em></p></div><p><strong>BIG-Bench Lite.</strong> The size and breadth of BIG-Bench makes it computationally expensive to run. To solve this, authors in [5] provide a smaller task subset, called BIG-Bench Lite, to use for faster evaluation. This subset is made up of 24 JSON-style tasks that are chosen via a manual selection process that considers task diversity and inclusion of specific task types (e.g. coding or non-English tasks). </p><p><strong>BIG-Bench Hard (BBH).</strong> Less than a year after the release of BIG-Bench, LLMs had already begun to surpass average human performance on the majority of tasks. BIG-Bench Hard [6], a difficult subset of the BIG-Bench dataset, was created in response to these quick improvements in capabilities. The steps used to select the tasks within BIG-Bench Hard are outlined in the table below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RcJo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faee4251b-a7e5-4da6-9678-3a4d153f30a0_1035x384.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RcJo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faee4251b-a7e5-4da6-9678-3a4d153f30a0_1035x384.png 424w, https://substackcdn.com/image/fetch/$s_!RcJo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faee4251b-a7e5-4da6-9678-3a4d153f30a0_1035x384.png 848w, https://substackcdn.com/image/fetch/$s_!RcJo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faee4251b-a7e5-4da6-9678-3a4d153f30a0_1035x384.png 1272w, https://substackcdn.com/image/fetch/$s_!RcJo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faee4251b-a7e5-4da6-9678-3a4d153f30a0_1035x384.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RcJo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faee4251b-a7e5-4da6-9678-3a4d153f30a0_1035x384.png" width="1035" height="384" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aee4251b-a7e5-4da6-9678-3a4d153f30a0_1035x384.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:384,&quot;width&quot;:1035,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:93262,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faee4251b-a7e5-4da6-9678-3a4d153f30a0_1035x384.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RcJo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faee4251b-a7e5-4da6-9678-3a4d153f30a0_1035x384.png 424w, https://substackcdn.com/image/fetch/$s_!RcJo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faee4251b-a7e5-4da6-9678-3a4d153f30a0_1035x384.png 848w, https://substackcdn.com/image/fetch/$s_!RcJo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faee4251b-a7e5-4da6-9678-3a4d153f30a0_1035x384.png 1272w, https://substackcdn.com/image/fetch/$s_!RcJo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faee4251b-a7e5-4da6-9678-3a4d153f30a0_1035x384.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [6])</figcaption></figure></div><p>All tasks in BIG-Bench Hard are derived from BIG-Bench. Initially, tasks are filtered according to several heuristics; e.g., not containing too many subtasks, having too few evaluation examples, or using evaluation metrics beyond multiple choice or exact match accuracy. Any task without a human performance baseline is also removed, and the remaining task subset is further refined by only retaining tasks where models underperform humans. From here, tasks are then manually inspected to remove any tasks that are overly difficult or out of scope<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>, leaving us with the final set of 23 tasks in BIG-Bench Hard; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QmII!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa006545-6b0c-4995-bbba-17f6cd2126f1_1188x1124.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QmII!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa006545-6b0c-4995-bbba-17f6cd2126f1_1188x1124.png 424w, https://substackcdn.com/image/fetch/$s_!QmII!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa006545-6b0c-4995-bbba-17f6cd2126f1_1188x1124.png 848w, https://substackcdn.com/image/fetch/$s_!QmII!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa006545-6b0c-4995-bbba-17f6cd2126f1_1188x1124.png 1272w, https://substackcdn.com/image/fetch/$s_!QmII!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa006545-6b0c-4995-bbba-17f6cd2126f1_1188x1124.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QmII!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa006545-6b0c-4995-bbba-17f6cd2126f1_1188x1124.png" width="1188" height="1124" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aa006545-6b0c-4995-bbba-17f6cd2126f1_1188x1124.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1124,&quot;width&quot;:1188,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:372779,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa006545-6b0c-4995-bbba-17f6cd2126f1_1188x1124.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QmII!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa006545-6b0c-4995-bbba-17f6cd2126f1_1188x1124.png 424w, https://substackcdn.com/image/fetch/$s_!QmII!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa006545-6b0c-4995-bbba-17f6cd2126f1_1188x1124.png 848w, https://substackcdn.com/image/fetch/$s_!QmII!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa006545-6b0c-4995-bbba-17f6cd2126f1_1188x1124.png 1272w, https://substackcdn.com/image/fetch/$s_!QmII!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa006545-6b0c-4995-bbba-17f6cd2126f1_1188x1124.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [6])</figcaption></figure></div><p>Despite focusing on a much smaller set of difficult tasks&#8212;<em>about 10% of the original benchmark</em>&#8212;that have a standard format, BIG-Bench Hard is mostly able to maintain the breadth of BIG-Bench. The tasks present in BIG-Bench Hard can be roughly categorized into natural language (e.g., detecting translation errors or recommending movies) and algorithmic (e.g., evaluating boolean expressions or performing multi-step arithmetic) tasks. When examining model performance on BIG-Bench Hard, we see that the models considered in [6] usually surpass average human performance but fall short of the best performance of a human. However, the best LLMs today achieve almost perfect accuracy on BIG-Bench Hard. </p><p>Given that BIG-Bench is constructed as a community effort, benchmark tasks have a high level of variance&#8212;<em>cleanliness and quality fluctuate, and each task may have different metadata.</em> Tasks are selected based on both quality and difficulty by using a combination of heuristics and manual inspection. Additionally, BIG-Bench Hard restricts the benchmark to tasks that use an exact match or multiple choice format. This choice is made to simplify the analysis of <a href="https://cameronrwolfe.substack.com/p/chain-of-thought-prompting-for-llms">chain of thought prompting</a> by enabling the use of a unified prompt format across different tasks. In this way, BIG-Bench Hard does not solely maximize difficulty&#8212;<em>it identifies a subset of hard tasks that also work well with chain of thought prompting</em>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CHdl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3bda76b-88cd-4bbd-a60d-07046c35ab8a_1190x601.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CHdl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3bda76b-88cd-4bbd-a60d-07046c35ab8a_1190x601.png 424w, https://substackcdn.com/image/fetch/$s_!CHdl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3bda76b-88cd-4bbd-a60d-07046c35ab8a_1190x601.png 848w, https://substackcdn.com/image/fetch/$s_!CHdl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3bda76b-88cd-4bbd-a60d-07046c35ab8a_1190x601.png 1272w, https://substackcdn.com/image/fetch/$s_!CHdl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3bda76b-88cd-4bbd-a60d-07046c35ab8a_1190x601.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CHdl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3bda76b-88cd-4bbd-a60d-07046c35ab8a_1190x601.png" width="617" height="311.6109243697479" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b3bda76b-88cd-4bbd-a60d-07046c35ab8a_1190x601.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:601,&quot;width&quot;:1190,&quot;resizeWidth&quot;:617,&quot;bytes&quot;:144162,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3bda76b-88cd-4bbd-a60d-07046c35ab8a_1190x601.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CHdl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3bda76b-88cd-4bbd-a60d-07046c35ab8a_1190x601.png 424w, https://substackcdn.com/image/fetch/$s_!CHdl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3bda76b-88cd-4bbd-a60d-07046c35ab8a_1190x601.png 848w, https://substackcdn.com/image/fetch/$s_!CHdl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3bda76b-88cd-4bbd-a60d-07046c35ab8a_1190x601.png 1272w, https://substackcdn.com/image/fetch/$s_!CHdl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3bda76b-88cd-4bbd-a60d-07046c35ab8a_1190x601.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [6])</figcaption></figure></div><p>As shown above, several top models at the time of release for BIG-Bench Hard noticeably underperform the average human baseline. This gap can be closed in many cases via chain of thought prompting, but benchmark performance still falls short of maximum human performance for even the largest models.</p><p><strong>BIG-Bench Extra Hard (BBEH). </strong>The BIG-Bench family is one of the few reasoning-focused evaluation suites that prioritizes general reasoning rather than math and coding. However, both BIG-Bench and BIG-Bench Hard were saturated by early 2025, with top reasoning models achieving nearly perfect scores on both benchmarks. As a solution, BIG-Bench Extra Hard was created by replacing each of the BIG-Bench Hard tasks with a corresponding task that tests a similar category of reasoning capabilities but is significantly more difficult. </p><blockquote><p><em>&#8220;BIG-Bench Extra Hard replaces each task in BIG-Bench Hard with a novel task that probes a similar reasoning capability [with] increased difficulty.&#8221;</em> - from [5]</p></blockquote><p>Examples of new reasoning skills tested by BIG-Bench Extra Hard include many-hop reasoning, long context reasoning, properly handling distractors, finding errors in reasoning traces, reasoning under constraints, and more. To perform well on BIG-Bench Extra Hard, models must command a breadth of different reasoning capabilities. An itemized list of the reasoning tasks present in BIG-Bench Extra Hard is provided in the figure below. Each task matches the general reasoning domain of some corresponding task in BIG-Bench Hard, ensuring that the diversity of BIG-Bench Hard is preserved while increasing task difficulty.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!d8R7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d88ad2b-7e27-468a-b721-0406d8a72cd8_1578x1776.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!d8R7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d88ad2b-7e27-468a-b721-0406d8a72cd8_1578x1776.png 424w, https://substackcdn.com/image/fetch/$s_!d8R7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d88ad2b-7e27-468a-b721-0406d8a72cd8_1578x1776.png 848w, https://substackcdn.com/image/fetch/$s_!d8R7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d88ad2b-7e27-468a-b721-0406d8a72cd8_1578x1776.png 1272w, https://substackcdn.com/image/fetch/$s_!d8R7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d88ad2b-7e27-468a-b721-0406d8a72cd8_1578x1776.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!d8R7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d88ad2b-7e27-468a-b721-0406d8a72cd8_1578x1776.png" width="1456" height="1639" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4d88ad2b-7e27-468a-b721-0406d8a72cd8_1578x1776.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1639,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:790908,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d88ad2b-7e27-468a-b721-0406d8a72cd8_1578x1776.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!d8R7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d88ad2b-7e27-468a-b721-0406d8a72cd8_1578x1776.png 424w, https://substackcdn.com/image/fetch/$s_!d8R7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d88ad2b-7e27-468a-b721-0406d8a72cd8_1578x1776.png 848w, https://substackcdn.com/image/fetch/$s_!d8R7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d88ad2b-7e27-468a-b721-0406d8a72cd8_1578x1776.png 1272w, https://substackcdn.com/image/fetch/$s_!d8R7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d88ad2b-7e27-468a-b721-0406d8a72cd8_1578x1776.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [7])</figcaption></figure></div><p>As seen in the middle column of the table, tasks in BIG-Bench Extra Hard are sourced from a variety of existing reasoning benchmarks and manually chosen according to their topic and difficulty. When curating the benchmark, authors aim to solve the following known issues with BIG-Bench Hard:</p><ul><li><p>Many tasks have high random chance performance due to the presence of multiple choice questions with a small number of options (e.g., ~35% of tasks have binary output and ~20% of tasks use multiple choice with &lt;5 options).</p></li><li><p>Some tasks permit shortcuts that allow the task to be &#8220;solved&#8221; without actually reasoning through a proper solution.</p></li><li><p>Task inputs tend to be very short&#8212;<em>around 700 characters on average</em>&#8212;across BIG-Bench Hard tasks, which is unrealistic compared to how LLMs are typically used in practice.</p></li><li><p>True multi-hop reasoning is rarely tested in BIG-Bench Hard due to limitations in LLM capabilities when the benchmark was created.</p></li></ul><p>Ideally, we would like to solve all of these issues while expanding the set of reasoning capabilities being tested by the benchmark. BIG-Bench Extra Hard tasks contain 200 questions&#8212;<em>except for DisambiguationQA, which has only 120</em>. Although the task selection process was mostly manual, data was curated using a combination of manual human inspection with model assistance. Two models are used&#8212;<em>a general purpose model and a reasoning model (both Gemini-based)</em>&#8212;to iteratively evaluate data that is selected for each task. Tasks that were easily solved by the reference models were either <em>i)</em> discarded and replaced with more difficult tasks or <em>ii)</em> enhanced with harder reasoning examples. This process continued until both models achieved an accuracy below 70% on each task.</p><blockquote><p><em>&#8220;In most cases, we tried to use the reference models only as a black box that provided feedback on the difficulty of our tasks. In some cases, however, making tasks more difficult required looking into the approach adopted by the model.&#8221;</em> - from [7]</p></blockquote><p>The combination of human and model oversight in BIG-Bench Extra Hard is interesting and provides motivation for unique ways in which humans can interact with LLMs to curate better evaluation data. For example, authors in [7] even mention manually inspecting reasoning traces from the models to help them think of more difficult examples that would actually challenge the model. Tasks in BIG-Bench Extra Hard have significantly expanded context compared to the prior benchmark, have negligible random chance performance, and provide a lot of headroom in performance even for top models (e.g., o3-mini); see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XHOU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a4f8b1c-32ab-4e88-b8a4-1411fb7f9414_2440x628.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XHOU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a4f8b1c-32ab-4e88-b8a4-1411fb7f9414_2440x628.png 424w, https://substackcdn.com/image/fetch/$s_!XHOU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a4f8b1c-32ab-4e88-b8a4-1411fb7f9414_2440x628.png 848w, https://substackcdn.com/image/fetch/$s_!XHOU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a4f8b1c-32ab-4e88-b8a4-1411fb7f9414_2440x628.png 1272w, https://substackcdn.com/image/fetch/$s_!XHOU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a4f8b1c-32ab-4e88-b8a4-1411fb7f9414_2440x628.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XHOU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a4f8b1c-32ab-4e88-b8a4-1411fb7f9414_2440x628.png" width="1456" height="375" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2a4f8b1c-32ab-4e88-b8a4-1411fb7f9414_2440x628.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:375,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:360450,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a4f8b1c-32ab-4e88-b8a4-1411fb7f9414_2440x628.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XHOU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a4f8b1c-32ab-4e88-b8a4-1411fb7f9414_2440x628.png 424w, https://substackcdn.com/image/fetch/$s_!XHOU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a4f8b1c-32ab-4e88-b8a4-1411fb7f9414_2440x628.png 848w, https://substackcdn.com/image/fetch/$s_!XHOU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a4f8b1c-32ab-4e88-b8a4-1411fb7f9414_2440x628.png 1272w, https://substackcdn.com/image/fetch/$s_!XHOU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a4f8b1c-32ab-4e88-b8a4-1411fb7f9414_2440x628.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [7])</figcaption></figure></div><h4><a href="https://arxiv.org/abs/2311.07911">IFEval</a> [8] and <a href="https://arxiv.org/abs/2507.02833">IFBench</a> [9]</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!peOK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f0c3b7-ec87-4bde-b2f8-9aaee986c3e3_1594x510.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!peOK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f0c3b7-ec87-4bde-b2f8-9aaee986c3e3_1594x510.png 424w, https://substackcdn.com/image/fetch/$s_!peOK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f0c3b7-ec87-4bde-b2f8-9aaee986c3e3_1594x510.png 848w, https://substackcdn.com/image/fetch/$s_!peOK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f0c3b7-ec87-4bde-b2f8-9aaee986c3e3_1594x510.png 1272w, https://substackcdn.com/image/fetch/$s_!peOK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f0c3b7-ec87-4bde-b2f8-9aaee986c3e3_1594x510.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!peOK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f0c3b7-ec87-4bde-b2f8-9aaee986c3e3_1594x510.png" width="1456" height="466" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/39f0c3b7-ec87-4bde-b2f8-9aaee986c3e3_1594x510.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:466,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:164680,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f0c3b7-ec87-4bde-b2f8-9aaee986c3e3_1594x510.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!peOK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f0c3b7-ec87-4bde-b2f8-9aaee986c3e3_1594x510.png 424w, https://substackcdn.com/image/fetch/$s_!peOK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f0c3b7-ec87-4bde-b2f8-9aaee986c3e3_1594x510.png 848w, https://substackcdn.com/image/fetch/$s_!peOK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f0c3b7-ec87-4bde-b2f8-9aaee986c3e3_1594x510.png 1272w, https://substackcdn.com/image/fetch/$s_!peOK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f0c3b7-ec87-4bde-b2f8-9aaee986c3e3_1594x510.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [8])</figcaption></figure></div><p>The <strong>IFEval</strong> [8] benchmark tests LLM instruction following capabilities, with an emphasis on instructions that are objectively verifiable (i.e., as opposed to instructions that are more subjective). For example, if we instruct an LLM to generate an output containing 100 to 200 words, we can easily verify whether this instruction was followed by using a basic script. However, verifying whether an LLM obeys a certain tone specification in its output is less straightforward. </p><blockquote><p><em>&#8220;The task of precise instruction following evaluates a language model&#8217;s ability to perform a task t, such as summarization or creative writing, while adhering to one or more output constraints c, which can be automatically verified.&#8221;</em> - from [9]</p></blockquote><p>To start, 25 instructions&#8212;<em>structured as verifiable constraint templates for the model&#8217;s output&#8212;</em>are manually curated based on practicality and verifiability; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7wbb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e087f64-63f5-4af8-997a-6b1d0038e675_1154x1704.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7wbb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e087f64-63f5-4af8-997a-6b1d0038e675_1154x1704.png 424w, https://substackcdn.com/image/fetch/$s_!7wbb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e087f64-63f5-4af8-997a-6b1d0038e675_1154x1704.png 848w, https://substackcdn.com/image/fetch/$s_!7wbb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e087f64-63f5-4af8-997a-6b1d0038e675_1154x1704.png 1272w, https://substackcdn.com/image/fetch/$s_!7wbb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e087f64-63f5-4af8-997a-6b1d0038e675_1154x1704.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7wbb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e087f64-63f5-4af8-997a-6b1d0038e675_1154x1704.png" width="1154" height="1704" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7e087f64-63f5-4af8-997a-6b1d0038e675_1154x1704.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1704,&quot;width&quot;:1154,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:449357,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e087f64-63f5-4af8-997a-6b1d0038e675_1154x1704.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7wbb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e087f64-63f5-4af8-997a-6b1d0038e675_1154x1704.png 424w, https://substackcdn.com/image/fetch/$s_!7wbb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e087f64-63f5-4af8-997a-6b1d0038e675_1154x1704.png 848w, https://substackcdn.com/image/fetch/$s_!7wbb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e087f64-63f5-4af8-997a-6b1d0038e675_1154x1704.png 1272w, https://substackcdn.com/image/fetch/$s_!7wbb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e087f64-63f5-4af8-997a-6b1d0038e675_1154x1704.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [8])</figcaption></figure></div><p>From these instructions, evaluation samples are curated as follows:</p><ol><li><p>Create a set of base prompts<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>.</p></li><li><p>Combine these base prompts with one to three randomly selected verifiable instructions by concatenating instructions to the end of the prompt. </p></li><li><p>Use few-shot prompting and manual inspection to identify instruction combinations that are illogical or contain conflicts. </p></li><li><p>Use few-shot prompting to rephrase each prompt and, in turn, improve the diversity of instructions in the benchmark. </p></li><li><p>Manually review all rephrased prompts.</p></li></ol><p>Exact details of the data curation process are not fully outlined in [8]. However, we know from the information provided that a model-in-the-loop approach is used with manual human review to ensure quality. To measure performance, a binary verification check is created for each instruction that can be used to determine if a model followed an instruction or not. Instruction-level binary verification signals can be used to compute the following strict metrics:</p><ul><li><p><em>Instruction-level strict accuracy</em>: the percentage of all individual instructions that the model follows.</p></li><li><p><em>Prompt-level strict accuracy</em>: the percentage of prompts for which the model follows all instructions. </p></li></ul><p>Additionally, several loose metrics are considered in [8] that perform verification under a variety of transformations to the model output (e.g., removing markdown and removing the first or last lines). After applying a transformation, we can compute instruction and prompt-level accuracy similarly to before, resulting in a loose version of each metric. An instruction is considered solved if it passes verification after any of the possible transformations that are tested. </p><blockquote><p><em>&#8220;The new constraints we introduce were created manually &#8211; sourced by collecting feedback from LM users beyond the authors on the types of constraints they have tried with models, or manually written to cover core instruction following skills. Then, we filtered constraints for the benchmark to those that can be easily paired with a verification function written in Python, making for reproducible evaluation and training tools.&#8221;</em> - from [9]</p></blockquote><p>The IFEval benchmark only tests 25 instructions and, therefore, risks overfitting to a small set of constraints. As a solution, <strong>IFBench</strong> [9] proposes an expanded set of 58 verifiable, manually-curated constraints. When deriving new constraints, authors <em>i)</em> inspect feedback from LLM users on instruction following issues, <em>ii)</em> focus on core areas of instruction following, <em>iii)</em> emphasize difficult constraints, and <em>iv)</em> only use constraints that can be verified with a Python function. Going further, an additional set of 29 constraints (IFTrain) are provided for training purposes. These training constraints can be used for <a href="https://cameronrwolfe.substack.com/i/177823868/reinforcement-learning-from-verifiable-rewards-rlvr">RLVR training</a>, enabling investigation into the generalization properties of instruction following. </p><p>The 58 constraints in IFBench are grouped into seven categories&#8212;<em>count, ratio, words, sentence, format, custom, and copy</em>&#8212;that cover a broad range of instruction following skills. To create prompts for these instructions, authors take unseen prompts from <a href="https://arxiv.org/abs/2405.01470">WildChat</a> and combine them with either one or two constraints from the expanded set. Every test prompt is manually inspected by a human annotator to ensure constraint compatibility, and the final benchmark consists of 300 total prompts. As shown below, performance on IFBench is noticeably lower than on IFEval, indicating some level of overfitting to specific constraints. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7MvV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8e81c15-a954-48dc-8676-263b7d860d6e_1566x982.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7MvV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8e81c15-a954-48dc-8676-263b7d860d6e_1566x982.png 424w, https://substackcdn.com/image/fetch/$s_!7MvV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8e81c15-a954-48dc-8676-263b7d860d6e_1566x982.png 848w, https://substackcdn.com/image/fetch/$s_!7MvV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8e81c15-a954-48dc-8676-263b7d860d6e_1566x982.png 1272w, https://substackcdn.com/image/fetch/$s_!7MvV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8e81c15-a954-48dc-8676-263b7d860d6e_1566x982.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7MvV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8e81c15-a954-48dc-8676-263b7d860d6e_1566x982.png" width="1456" height="913" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b8e81c15-a954-48dc-8676-263b7d860d6e_1566x982.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:913,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:183063,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8e81c15-a954-48dc-8676-263b7d860d6e_1566x982.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7MvV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8e81c15-a954-48dc-8676-263b7d860d6e_1566x982.png 424w, https://substackcdn.com/image/fetch/$s_!7MvV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8e81c15-a954-48dc-8676-263b7d860d6e_1566x982.png 848w, https://substackcdn.com/image/fetch/$s_!7MvV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8e81c15-a954-48dc-8676-263b7d860d6e_1566x982.png 1272w, https://substackcdn.com/image/fetch/$s_!7MvV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8e81c15-a954-48dc-8676-263b7d860d6e_1566x982.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [9])</figcaption></figure></div><p>Authors in [9] provide a potential reason for the overfitting to IFEval constraints. Many LLMs have curated training data that specifically targets instruction following capabilities. Most of this training data is synthetically generated because precise instruction following can be deterministically verified. Given the popularity of IFEval, model developers often adopt the same constraint taxonomy when generating synthetic instruction following data; see <a href="https://arxiv.org/abs/2406.11704">Nemotron-4 340B</a> as an example. As a result, some models may be explicitly trained to follow the same constraints being tested by IFEval, leading to inflated performance metrics<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a>.</p><h4><a href="https://arxiv.org/abs/2404.04475">AlpacaEval</a> [13]</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sY_w!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9c543af-2299-4e8a-81a4-1210ca8222d3_2082x1384.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sY_w!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9c543af-2299-4e8a-81a4-1210ca8222d3_2082x1384.png 424w, https://substackcdn.com/image/fetch/$s_!sY_w!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9c543af-2299-4e8a-81a4-1210ca8222d3_2082x1384.png 848w, https://substackcdn.com/image/fetch/$s_!sY_w!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9c543af-2299-4e8a-81a4-1210ca8222d3_2082x1384.png 1272w, https://substackcdn.com/image/fetch/$s_!sY_w!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9c543af-2299-4e8a-81a4-1210ca8222d3_2082x1384.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sY_w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9c543af-2299-4e8a-81a4-1210ca8222d3_2082x1384.png" width="1456" height="968" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d9c543af-2299-4e8a-81a4-1210ca8222d3_2082x1384.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:968,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sY_w!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9c543af-2299-4e8a-81a4-1210ca8222d3_2082x1384.png 424w, https://substackcdn.com/image/fetch/$s_!sY_w!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9c543af-2299-4e8a-81a4-1210ca8222d3_2082x1384.png 848w, https://substackcdn.com/image/fetch/$s_!sY_w!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9c543af-2299-4e8a-81a4-1210ca8222d3_2082x1384.png 1272w, https://substackcdn.com/image/fetch/$s_!sY_w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9c543af-2299-4e8a-81a4-1210ca8222d3_2082x1384.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Judge prompt from AlpacaEval (<a href="https://github.com/tatsu-lab/alpaca_eval/blob/main/src/alpaca_eval/evaluators_configs/alpaca_eval_gpt4/alpaca_eval.txt">source</a>)</figcaption></figure></div><p>AlpacaEval is a pairwise instruction following benchmark that measures model performance by using an <a href="https://cameronrwolfe.substack.com/p/llm-as-a-judge">LLM judge</a> to compare candidate model completions to those of a baseline model; see above. The most recent version of AlpacaEval uses GPT-4-Turbo as both the baseline and judge model. The data used in AlpacaEval is sourced from the earlier <a href="https://arxiv.org/abs/2305.14387">AlpacaFarm</a> dataset, which contains a total of 805 prompts derived by combining the evaluation sets from:</p><ul><li><p><a href="https://arxiv.org/abs/2212.10560">Self-Instruct</a></p></li><li><p><a href="https://arxiv.org/abs/2304.07327">Open Assistant</a></p></li><li><p><a href="https://arxiv.org/abs/2204.05862">Anthropic Helpfulness</a></p></li><li><p><a href="https://lmsys.org/blog/2023-03-30-vicuna/">Vicuna</a></p></li><li><p><a href="https://arxiv.org/abs/2408.08146https://bair.berkeley.edu/blog/2023/04/03/koala/">Koala</a></p></li></ul><p>Despite the variety of data sources, most of this data is curated using a similar approach. For example, Self-Instruct proposes a synthetic data generation strategy for instruction tuning, but prompts from the evaluation dataset for Self-Instruct are manually written by human experts. Similarly, Anthropic helpfulness is a human preference dataset, while the Vicuna and Koala test sets are manually curated by researchers working on the projects. The only outlier of these evaluation sets is Open Assistant, which is derived from crowdsourced human conversations with an LLM, rather than being curated by experts. </p><div class="pullquote"><p>&#8220;AlpacaEval is an LLM-based automatic evaluation that is fast, cheap, and reliable. It is based on the AlpacaFarm evaluation set, which tests the ability of models to follow general user instructions. Responses are compared to reference responses by the provided GPT-4 based auto-annotators [to compute a win rate]. AlpacaEval displays a high agreement rate with ground truth human annotations.&#8221; - <a href="https://tatsu-lab.github.io/alpaca_eval/">AlpacaEval</a></p></div><p>After the initial release of AlpacaEval, several follow-up versions of the benchmark were published, but the underlying evaluation data did not change much. Instead, subsequent improvements to AlpacaEval focused on changing the reference and judge models to improve the benchmark&#8217;s correlation with human preferences. Full code and updates to AlpacaEval can be found <a href="https://github.com/tatsu-lab/alpaca_eval">here</a>. </p><h4>Math Evaluation</h4><p>Many evaluation datasets exist in the math domain, and most of them are either <em>i)</em> expert-curated or <em>ii)</em> drawn from test banks for math competitions. For example, <a href="http://huggingface.co/datasets/openai/gsm8k">GSM-8K</a> contains 8.5K human-written grade school math problems, while <a href="https://huggingface.co/datasets/EleutherAI/hendrycks_math">MATH</a> contains 12.5K questions compiled from high school math tests. Additionally, the <a href="https://en.wikipedia.org/wiki/American_Invitational_Mathematics_Examination">American Invitational Mathematics Examination (AIME)</a>, which is commonly used to evaluate LLMs, is released every year with a set of 15 new questions. Questions from the <a href="http://en.wikipedia.org/wiki/American_Mathematics_Competitions">American Mathematics Competitions (AMC)</a> are also commonly used for LLM evaluation. Solutions to questions in these benchmarks are usually graded with an <a href="https://github.com/huggingface/Math-Verify">automatic verifier</a> or exact string matching. </p><p>The benchmarks outlined above have been saturated by modern LLMs, but many frontier-level math benchmarks have been recently proposed:</p><ul><li><p><a href="https://epoch.ai/frontiermath">FrontierMath</a> contains hundreds of expert-crafted problems at the cutting edge of mathematical research that require hours or days to be solved by an expert-level researcher. </p></li><li><p><a href="https://arxiv.org/abs/2505.12575">RealMath</a> is a continuously-evolving benchmark that automatically updates with new problems derived from research papers and discussion forums.</p></li><li><p><a href="https://arxiv.org/abs/2505.23281">MathArena</a> is an evolving benchmark that evaluates LLMs on math competition problems soon after their release to avoid contamination risk.</p></li><li><p><a href="https://arxiv.org/abs/2410.07985">OmniMath</a> contains 4.5K competition-level math problems that have been annotated by human experts, covering a diverse range of topics (i.e., over 30 sub-domains) and difficulty levels.</p></li></ul><p>Solutions to questions in these benchmarks are still commonly evaluated with automatic verifiers, but this is not always the case. For example, proof-based questions in MathArena are manually checked by human experts. Despite the impressive math capabilities of modern LLMs, most of these frontier-level math benchmarks have not yet been fully saturated. However, LLMs are advancing rapidly in their capabilities, so several of these datasets are designed in a way that enables continual evolution in order to avoid contamination and saturation.</p><h4>Iteratively Improving a Benchmark</h4><p>When studying the benchmarks outlined above, we see several examples of iterative benchmark refinement. Benchmarks become saturated and less informative over time, which is usually addressed by releasing an improved benchmark. To create such an improved benchmark, there are several common techniques and directions that are usually followed, such as:</p><ul><li><p><em>Difficulty-based refinement</em>: curating more difficult tasks or data to use for evaluation within a benchmark.</p></li><li><p><em>Quality-based refinement</em>: identifying and fixing issues in the benchmark (e.g., mislabeled data, vague or unrealistic questions, poor format, etc.).</p></li><li><p><em>Diversity-based refinement</em>: expanding the scope of questions and topics covered by a particular benchmark. </p></li></ul><p>Usually, these directions of improvement are handled via manual human review, a model-in-the-loop approach, or some combination of both. In some cases, we can even design a benchmark in a way that continually evolves over time without too much manual effort (e.g., RealMath and MathArena). However, the range of techniques that can be used for iterative benchmark improvement is vast&#8212;<em>there is a lot to learn in this area</em>. To provide pointers for future learning, a set of useful resources for benchmark improvement is listed below:</p><ul><li><p><em><a href="https://arxiv.org/abs/2502.03461">Do Large Language Model Benchmarks Test Reliability?</a></em>: corrects labeling errors in common LLM benchmarks to better measure LLM reliability.</p></li><li><p><em><a href="https://arxiv.org/abs/2410.20245">Improving Model Evaluation using SMART Filtering of Benchmark Datasets</a></em>: a framework for systematically identifying and filtering evaluation data that is too easy, similar to other questions, or possibly contaminated. </p></li><li><p><em><a href="https://arxiv.org/abs/2406.11939">From Crowdsourced Data to High-Quality Benchmarks</a></em>: an LLM-based approach for post-processing crowdsourced data into high-quality evaluation samples.</p></li><li><p><em><a href="https://arxiv.org/abs/2503.13335">Reliable and Efficient Amortized Model-based Evaluation</a></em>: a model-based approach for difficulty filtering and difficult question generation. </p></li><li><p><em><a href="https://arxiv.org/abs/2406.08723">Evidence-Centered Benchmark Design for NLP</a></em>: an evidence-backed framework for properly designing evaluation benchmarks. </p></li><li><p><em><a href="https://huggingface.co/spaces/OpenEvals/evaluation-guidebook">Evaluation Guidebook (from Hugging Face)</a></em>: a practical field guide for evaluating LLMs, assessing benchmark quality, and curating evaluation data.</p></li></ul><p>There are also many papers that have been proposed for optimally selecting subsets of benchmark data to improve efficiency [14, 15, 16, 17]. </p><h2>Advanced Benchmarking for LLMs</h2><p>Now that we understand practical details for constructing LLM benchmarks, we will take a deeper look at some advanced techniques for LLM evaluation that have been proposed in recent research. Specifically, we will focus on a set of papers that use <a href="https://en.wikipedia.org/wiki/Item_response_theory">Item Response Theory (IRT)</a> to select the most informative data for evaluation. Coming from the field of <a href="https://en.wikipedia.org/wiki/Psychometrics">psychometrics</a>, IRT uses statistical modeling to dynamically measure how an individual&#8217;s latent abilities interact with the properties of an item (or question) to determine the probability of a correct response. Although IRT is commonly applied in standardized testing environments, the same concepts have been adopted by LLM researchers. We can directly apply techniques from IRT to LLM evaluations by considering the LLM as our individual and the evaluation dataset as our standardized test!</p><p>In the context of LLM evaluations, IRT considers a model <code>l</code>, dataset items <code>i</code>, and the probability <code>p_il</code> that model <code>l</code> gets item <code>i</code> correct. We can use a variety of different models&#8212;<em>usually just different variants of logistic regression</em>&#8212;to predict this probability. IRT models include parameters for both the model and the item being evaluated. Whereas model parameters capture the capabilities of a given model, item parameters capture the following properties:</p><ul><li><p><em>Difficulty</em>: whether the item is easy or difficult to answer correctly.</p></li><li><p><em>Discrimination</em>: whether answer correctness has a strong relationship with the capability level of a model.</p></li></ul><p>By capturing these properties within our IRT model, we gain a rich description of our evaluation data that can be directly applied to benchmark improvement. For example, items with low discrimination are often problematic (e.g., due to mislabeling), and we can consider filtering out items that are too easy from the evaluation process. Within this section, we will see several IRT formulations that demonstrate a broad set of potential applications to the evaluation process. </p><h4><a href="https://arxiv.org/abs/2402.14992">tinyBenchmarks: Evaluating LLMs with Fewer Examples</a> [11]</h4><blockquote><p><em>&#8220;Evaluating the performance of a single LLM on HELM costs over 4K GPU hours (or over $10K for APIs). Benchmarks like AlpacaEval also require a commercial LLM as a judge to perform evaluation, further increasing the costs&#8230; evaluation of a single model is often performed many times to&#8230; explore different prompting strategies or a wider range of hyperparameters.&#8221;</em> - from [11]</p></blockquote><p>To mitigate excessive inference costs during evaluation, an IRT-based approach called tinyBenchmarks is proposed in [11] that intelligently samples evaluation data in a way that maintains the accuracy of a model&#8217;s performance metrics. We assume access to a dataset of historical evaluation results that can be used for selection and performance estimation. More specifically, this dataset contains items <code>i</code> and models <code>l</code>, where each item and model combination has a binary score <code>Y_il &#8712; {0, 1}</code>. We can also handle continuous evaluation results in the range [0, 1]&#8212;<em>nearly any evaluation setting can be converted into this format by normalizing scores</em>&#8212;by simply binarizing scores according to a fixed threshold. </p><p><strong>Baselines.</strong> There are a few simple and effective approaches that can be adopted to sample a subset of data from an evaluation dataset:</p><ol><li><p><em>Stratified random sampling</em>: ensure proportional representation across benchmark sub-domains by randomly sampling a subset of evaluation samples separately within each subdomain.</p></li><li><p><em>Correctness-based clustering</em>: sample evaluation data based on patterns in correctness by representing each item <code>i</code> as a vector of correctness scores for each model <code>l</code>, performing <a href="https://en.wikipedia.org/wiki/K-means_clustering">K-means clustering</a> on these vectors, and selecting the evaluation samples closest to each cluster centroid.</p></li></ol><p>Despite their simplicity, these techniques have notable drawbacks. Stratified sampling leads to high variance and uncertainty when the number of samples is small, while correctness-based clustering tends to suffer from the <a href="https://en.wikipedia.org/wiki/Curse_of_dimensionality">curse of dimensionality</a> if we have evaluation results from a large model pool. </p><p><strong>IRT model.</strong> In [11], IRT is used to derive a much smaller representation of our evaluation data that can be more effectively used to both select samples and estimate performance. We define item <code>i</code> using two parameters:</p><ul><li><p><code>&#945;_i</code>: captures the skills required to solve item <code>i</code>.</p></li><li><p><code>&#946;_i</code>: captures the overall difficulty of item <code>i</code>.</p></li></ul><p>Similarly, we describe model <code>l</code> with the parameter <code>&#952;_l</code>, which captures model capabilities. From here, we define a multidimensional IRT model, which predicts the probability <code>p_il</code> that item <code>i</code> will be answered correctly by model <code>l</code>; see below. We can fit the IRT model&#8212;<em>or learn the correct values for all of the model and item parameters</em>&#8212;by using our historical evaluation dataset as training data. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!T5WX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e8ef8ba-e2f9-431c-aac1-475aa760020c_1882x606.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!T5WX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e8ef8ba-e2f9-431c-aac1-475aa760020c_1882x606.png 424w, https://substackcdn.com/image/fetch/$s_!T5WX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e8ef8ba-e2f9-431c-aac1-475aa760020c_1882x606.png 848w, https://substackcdn.com/image/fetch/$s_!T5WX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e8ef8ba-e2f9-431c-aac1-475aa760020c_1882x606.png 1272w, https://substackcdn.com/image/fetch/$s_!T5WX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e8ef8ba-e2f9-431c-aac1-475aa760020c_1882x606.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!T5WX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e8ef8ba-e2f9-431c-aac1-475aa760020c_1882x606.png" width="1456" height="469" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6e8ef8ba-e2f9-431c-aac1-475aa760020c_1882x606.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:469,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:217426,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e8ef8ba-e2f9-431c-aac1-475aa760020c_1882x606.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!T5WX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e8ef8ba-e2f9-431c-aac1-475aa760020c_1882x606.png 424w, https://substackcdn.com/image/fetch/$s_!T5WX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e8ef8ba-e2f9-431c-aac1-475aa760020c_1882x606.png 848w, https://substackcdn.com/image/fetch/$s_!T5WX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e8ef8ba-e2f9-431c-aac1-475aa760020c_1882x606.png 1272w, https://substackcdn.com/image/fetch/$s_!T5WX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e8ef8ba-e2f9-431c-aac1-475aa760020c_1882x606.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Two parameter multidimensional IRT model (from [11])</figcaption></figure></div><p>As we can see, the center point of this equation is the inner product of the item and model parameter, which captures how well the capabilities of a model match those needed for an item. Intuitively, a model is more likely to answer an item correctly if it has strong capabilities in the same directions required to solve an item and vice versa. Additionally, we add an extra bias term to this inner product to account for overall item difficulty before passing the full expression through a sigmoid (or logistic) function to yield a probability in the range <code>[0, 1]</code>.</p><blockquote><p><em>&#8220;The IRT model creates a meaningful representation for each example i based on their difficulty and the abilities required to respond to those examples correctly. This approach immediately solves the dimensionality problem, since E_i is low-dimensional&#8230; IRT should represent which examples have similar difficulty and require similar abilities.&#8221;</em> - from [11]</p></blockquote><p>Once fitted, parameters of the IRT model naturally provide a <code>d + 1</code> dimensional vector <code>E_i = (&#945;_i, &#946;_i)</code> that can be used to represent items in our evaluation dataset. This representation is low dimension (<code>d &lt; 16</code> in [11]) compared to vectors used for correctness-based clustering, thus solving issues related to the curse of dimensionality. The IRT model is used in two ways in [11]:</p><ol><li><p>To perform cluster-based sampling, similarly to correctness-based clustering (but with embeddings from the IRT model <code>E_i</code>).</p></li><li><p>To predict model performance over items&#8212;<em>this is more efficient than actually running the evaluation itself</em>. </p></li></ol><p><strong>p-IRT estimator.</strong> In [11], the two approaches described above are used in tandem to efficiently estimate model performance on an evaluation set. Assume we want to evaluate a new model <code>l&#8217;</code> on an existing evaluation set for which we already have an IRT model fitted. We can use clustering to identify &#8220;anchor points&#8221;&#8212;<em>or high-signal evaluation samples</em>&#8212;in the data and evaluate our model only on these samples. The number of anchor points is a hyperparameter that can change with our evaluation budget. We can then keep our existing item parameters fixed in the IRT model and only train the parameter for our new model <code>&#952;_l&#8217;</code>, using real evaluation results on our anchor points as training data. After obtaining <code>&#952;_l&#8217;</code>, we can predict performance on the remaining items by using our IRT model. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nrB8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0332c9db-87fa-4508-be90-01d425a7697f_1816x1020.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nrB8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0332c9db-87fa-4508-be90-01d425a7697f_1816x1020.png 424w, https://substackcdn.com/image/fetch/$s_!nrB8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0332c9db-87fa-4508-be90-01d425a7697f_1816x1020.png 848w, https://substackcdn.com/image/fetch/$s_!nrB8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0332c9db-87fa-4508-be90-01d425a7697f_1816x1020.png 1272w, https://substackcdn.com/image/fetch/$s_!nrB8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0332c9db-87fa-4508-be90-01d425a7697f_1816x1020.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nrB8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0332c9db-87fa-4508-be90-01d425a7697f_1816x1020.png" width="1456" height="818" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0332c9db-87fa-4508-be90-01d425a7697f_1816x1020.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:818,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:241432,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0332c9db-87fa-4508-be90-01d425a7697f_1816x1020.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nrB8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0332c9db-87fa-4508-be90-01d425a7697f_1816x1020.png 424w, https://substackcdn.com/image/fetch/$s_!nrB8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0332c9db-87fa-4508-be90-01d425a7697f_1816x1020.png 848w, https://substackcdn.com/image/fetch/$s_!nrB8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0332c9db-87fa-4508-be90-01d425a7697f_1816x1020.png 1272w, https://substackcdn.com/image/fetch/$s_!nrB8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0332c9db-87fa-4508-be90-01d425a7697f_1816x1020.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Efficiently estimating evaluation metrics with p-IRT (from [11])</figcaption></figure></div><p>A formal description of this approach, called the p-IRT estimator in [11], is outlined above. Put simply, we are interested in measuring the model&#8217;s actual performance on the full evaluation set, but running an entire benchmark is expensive. Instead, we use IRT model parameters to obtain <code>K</code> anchor points via clustering&#8212;<em>where </em><code>K</code><em> is much smaller than the full dataset size</em>&#8212;and only evaluate our model on these anchor points. Then, we can estimate performance on the rest of the evaluation dataset using the IRT model and derive an overall performance estimate by averaging real and predicted evaluation results; see above.</p><p>Beyond the p-IRT estimator, we can estimate performance with a sample average of the model&#8217;s performance on the anchor points only. This sample average has low bias because we are using correctness values obtained from our model on the actual evaluation data. However, the variance of the sample average is high when the number of anchor points <code>K</code> is small. On the other hand, the p-IRT estimator is biased&#8212;<em>due to the fact that our IRT model is not perfectly accurate</em>&#8212;but has low variance. Therefore, we can create an estimator that combines the strengths of both approaches by taking a <a href="https://en.wikipedia.org/wiki/Convex_combination">convex combination</a> of each estimate; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vdFm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09d89e90-caec-4762-a1c6-76930164e678_1036x582.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vdFm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09d89e90-caec-4762-a1c6-76930164e678_1036x582.png 424w, https://substackcdn.com/image/fetch/$s_!vdFm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09d89e90-caec-4762-a1c6-76930164e678_1036x582.png 848w, https://substackcdn.com/image/fetch/$s_!vdFm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09d89e90-caec-4762-a1c6-76930164e678_1036x582.png 1272w, https://substackcdn.com/image/fetch/$s_!vdFm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09d89e90-caec-4762-a1c6-76930164e678_1036x582.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vdFm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09d89e90-caec-4762-a1c6-76930164e678_1036x582.png" width="393" height="220.77799227799227" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/09d89e90-caec-4762-a1c6-76930164e678_1036x582.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:582,&quot;width&quot;:1036,&quot;resizeWidth&quot;:393,&quot;bytes&quot;:101225,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09d89e90-caec-4762-a1c6-76930164e678_1036x582.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vdFm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09d89e90-caec-4762-a1c6-76930164e678_1036x582.png 424w, https://substackcdn.com/image/fetch/$s_!vdFm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09d89e90-caec-4762-a1c6-76930164e678_1036x582.png 848w, https://substackcdn.com/image/fetch/$s_!vdFm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09d89e90-caec-4762-a1c6-76930164e678_1036x582.png 1272w, https://substackcdn.com/image/fetch/$s_!vdFm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09d89e90-caec-4762-a1c6-76930164e678_1036x582.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">IRT++ estimator (from [11])</figcaption></figure></div><p>This revised estimator is referred to as IRT++ in [11]. The per-item weight in this expression is optional but can be used to assign non-uniform weights to anchor points. For example, this weight can correspond to the ratio of evaluation samples present in the cluster used to derive a given anchor point. In [11], <code>&#955;</code> lies in the range <code>[0, 1]</code>, and the optimal value of <code>&#955;</code> depends upon several factors (e.g., the number of anchor points and the variance of our performance estimate). The value of <code>&#955;</code> is derived in [11] by using a heuristic proposed in <a href="https://ieeexplore.ieee.org/document/716194">prior work</a>. </p><p><strong>Efficient evaluation.</strong> To test the efficacy of IRT-based performance estimation, four benchmarks are considered&#8212;<em><a href="https://huggingface.co/open-llm-leaderboard">Open LLM Leaderboard</a>, MMLU [1], <a href="https://arxiv.org/abs/2211.09110">HELM</a>, AlpacaEval 2.0 [13]</em>&#8212;and we compare the estimated and actual performance on each benchmark. Training data for the IRT model is collected from a large number of LLMs&#8212;<em>395 models for Open LLM leaderboard and MMLU, 30 models for HELM, and 100 models for AlpacaEval 2.0</em>&#8212;to ensure the quality of the IRT model. The LLMs are split into training and test sets using two approaches:</p><ul><li><p><em>Random</em>: randomly sample a subset of LLMs to use for testing.</p></li><li><p><em>Date-based</em>: use the most recent LLMs for testing.</p></li></ul><p>As shown below, the proposed IRT-based estimators perform well across all scenarios considered. With as few as 100 anchor points per sub-domain of the evaluation set&#8212;<em>a reduction of  140&#215; for MMLU and 160&#215; for the Open LLM Leaderboard</em>&#8212;we can estimate performance with less than 2% error. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bB4o!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c595ad2-9754-45f8-9c35-11314e274528_1928x1052.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bB4o!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c595ad2-9754-45f8-9c35-11314e274528_1928x1052.png 424w, https://substackcdn.com/image/fetch/$s_!bB4o!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c595ad2-9754-45f8-9c35-11314e274528_1928x1052.png 848w, https://substackcdn.com/image/fetch/$s_!bB4o!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c595ad2-9754-45f8-9c35-11314e274528_1928x1052.png 1272w, https://substackcdn.com/image/fetch/$s_!bB4o!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c595ad2-9754-45f8-9c35-11314e274528_1928x1052.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bB4o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c595ad2-9754-45f8-9c35-11314e274528_1928x1052.png" width="1456" height="794" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0c595ad2-9754-45f8-9c35-11314e274528_1928x1052.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:794,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:430805,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c595ad2-9754-45f8-9c35-11314e274528_1928x1052.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bB4o!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c595ad2-9754-45f8-9c35-11314e274528_1928x1052.png 424w, https://substackcdn.com/image/fetch/$s_!bB4o!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c595ad2-9754-45f8-9c35-11314e274528_1928x1052.png 848w, https://substackcdn.com/image/fetch/$s_!bB4o!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c595ad2-9754-45f8-9c35-11314e274528_1928x1052.png 1272w, https://substackcdn.com/image/fetch/$s_!bB4o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c595ad2-9754-45f8-9c35-11314e274528_1928x1052.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [11])</figcaption></figure></div><h4><strong><a href="https://arxiv.org/abs/2509.11106">Fluid Language Model Benchmarking</a> [12]</strong></h4><p>Most LLMs are evaluated in a static fashion (i.e., by computing accuracy on a fixed dataset). Whereas raw accuracy treats every evaluation sample equally, IRT estimates a model&#8217;s underlying capabilities, taking into account factors like the difficulty and discrimination of each question. Leveraging this insight, authors in [12] propose an approach called Fluid Benchmarking that uses an IRT model to dynamically select evaluation data for a particular model. The key idea behind this approach is that the value of an evaluation sample depends upon a model&#8217;s capabilities. Instead of assuming there is a single best subset of examples on which to evaluate an LLM, Fluid Benchmarking dynamically selects the most informative evaluation examples for a particular model and, in turn, provides a more accurate estimate of that model&#8217;s performance.</p><blockquote><p><em>&#8220;Fluid benchmarking is based on the insight that the relative value of benchmark items depends on an LM&#8217;s capability level&#8230; a hard question might be too difficult for a weak LM, but informative for a strong LM.&#8221;</em> - from [12]</p></blockquote><p><strong>Unidimensional IRT.</strong> Similarly to before, the approach in [12] fits an IRT model using a dataset of historical evaluation data derived from evaluating a large set of models on a benchmark of interest. However, a different IRT model structure is used in [12]. As shown below, this is again a two-parameter IRT model that is used to predict binary evaluation outcomes, but we use unidimensional&#8212;<em>as opposed to the multidimensional approach used in [11]</em>&#8212;model and item parameters. Authors mention testing a multidimensional IRT approach in [12] but found that this formulation performs poorly compared to a unidimensional IRT model. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!m5IS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fa30f58-6bb3-4028-9178-4cde63df84ef_1804x546.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!m5IS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fa30f58-6bb3-4028-9178-4cde63df84ef_1804x546.png 424w, https://substackcdn.com/image/fetch/$s_!m5IS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fa30f58-6bb3-4028-9178-4cde63df84ef_1804x546.png 848w, https://substackcdn.com/image/fetch/$s_!m5IS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fa30f58-6bb3-4028-9178-4cde63df84ef_1804x546.png 1272w, https://substackcdn.com/image/fetch/$s_!m5IS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fa30f58-6bb3-4028-9178-4cde63df84ef_1804x546.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!m5IS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fa30f58-6bb3-4028-9178-4cde63df84ef_1804x546.png" width="624" height="189" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0fa30f58-6bb3-4028-9178-4cde63df84ef_1804x546.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:441,&quot;width&quot;:1456,&quot;resizeWidth&quot;:624,&quot;bytes&quot;:183010,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fa30f58-6bb3-4028-9178-4cde63df84ef_1804x546.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!m5IS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fa30f58-6bb3-4028-9178-4cde63df84ef_1804x546.png 424w, https://substackcdn.com/image/fetch/$s_!m5IS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fa30f58-6bb3-4028-9178-4cde63df84ef_1804x546.png 848w, https://substackcdn.com/image/fetch/$s_!m5IS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fa30f58-6bb3-4028-9178-4cde63df84ef_1804x546.png 1272w, https://substackcdn.com/image/fetch/$s_!m5IS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fa30f58-6bb3-4028-9178-4cde63df84ef_1804x546.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Two-parameter unidimensional IRT model (from [12])</figcaption></figure></div><p>Despite the different IRT model structure used in [12], the purpose of these parameters remains the same:</p><ul><li><p><code>&#952;_l</code>: a scalar parameter that represents the capability of model <code>l</code>.</p></li><li><p><code>&#945;_i</code>: a scalar parameter that captures the discrimination<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a> of item <code>i</code>.</p></li><li><p><code>&#946;_i</code>: a scalar bias that represents the difficulty of item <code>i</code>.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xr_r!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b1b0646-b27d-442f-b58c-76fdab4eb08e_1340x1098.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xr_r!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b1b0646-b27d-442f-b58c-76fdab4eb08e_1340x1098.png 424w, https://substackcdn.com/image/fetch/$s_!xr_r!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b1b0646-b27d-442f-b58c-76fdab4eb08e_1340x1098.png 848w, https://substackcdn.com/image/fetch/$s_!xr_r!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b1b0646-b27d-442f-b58c-76fdab4eb08e_1340x1098.png 1272w, https://substackcdn.com/image/fetch/$s_!xr_r!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b1b0646-b27d-442f-b58c-76fdab4eb08e_1340x1098.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xr_r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b1b0646-b27d-442f-b58c-76fdab4eb08e_1340x1098.png" width="1340" height="1098" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6b1b0646-b27d-442f-b58c-76fdab4eb08e_1340x1098.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1098,&quot;width&quot;:1340,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:430686,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b1b0646-b27d-442f-b58c-76fdab4eb08e_1340x1098.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xr_r!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b1b0646-b27d-442f-b58c-76fdab4eb08e_1340x1098.png 424w, https://substackcdn.com/image/fetch/$s_!xr_r!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b1b0646-b27d-442f-b58c-76fdab4eb08e_1340x1098.png 848w, https://substackcdn.com/image/fetch/$s_!xr_r!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b1b0646-b27d-442f-b58c-76fdab4eb08e_1340x1098.png 1272w, https://substackcdn.com/image/fetch/$s_!xr_r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b1b0646-b27d-442f-b58c-76fdab4eb08e_1340x1098.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [12])</figcaption></figure></div><p>The <strong>Fluid Benchmarking</strong> approach proposed in [12] is depicted above. There are two main phases for obtaining a benchmark result:</p><ul><li><p>An <em>offline (or historical) phase</em>, where we fit item and model parameters in the IRT model from leaderboard-style results on our benchmark.</p></li><li><p>An <em>online phase</em>, where we learn the parameter of a new model given a subset of evaluation results for this model on our benchmark.</p></li></ul><p>The IRT model is initially fit using an offline dataset of evaluation results. Given a new model <code>l&#8217;</code>, we first evaluate this model on a subset of our evaluation set to obtain some training data for the new model parameter <code>&#952;_l&#8217;</code>. Similarly to [11], we then leave item parameters fixed and fit only the new model parameter <code>&#952;_l&#8217;</code> by using the actual evaluation data collected with our new model for training. </p><p>By examining the structure of our IRT model, we can intuitively understand how item parameter values can influence the value of <code>&#952;_l&#8217;</code>. Easy questions have a small (or negative) difficulty parameter <code>&#946;_i</code>, so answering them correctly has minimal impact on <code>&#952;_l&#8217;</code>. On the other hand, correct answers to a difficult question will meaningfully impact the value of <code>&#952;_l&#8217;</code>. The same arguments hold in reverse for incorrectly-answered questions: <em>answering a difficult question incorrectly is not a big deal, but easy questions will impact </em><code>&#952;_l&#8217;</code><em> when answered incorrectly</em>. The value of the discrimination parameter <code>&#945;_i</code> impacts the magnitude of updates to <code>&#952;_l&#8217;</code>. Highly-discriminative items have large values of  <code>&#945;_i</code>, leading them to meaningfully impact the value of <code>&#952;_l&#8217;</code> and vice versa. </p><p><strong>Estimating performance.</strong> Instead of measuring performance with accuracy metrics, Fluid Benchmarking directly uses the value of <code>&#952;_l&#8217;</code> as the performance metric for a model. While accuracy simply captures the ratio of items answered correctly in a benchmark, Fluid Benchmarking asks an inverse question: <em>What capability level of our model is most likely to produce the pattern of incorrect and correct answers we observed?</em> By answering this question, we can estimate performance in a way that meaningfully considers the difficulty and discrimination of each item in our evaluation set. Raw accuracy on a discrete evaluation dataset is a common proxy for measuring model capabilities. However, Fluid Benchmarking [12] forgoes this proxy, instead using IRT to directly estimate model capabilities.</p><div class="pullquote"><p><em>&#8220;IRT draws upon existing LM evaluation results to enrich benchmarks with information about item difficulty and discrimination, which is leveraged to dynamically select items that match an LLM&#8217;s capability level&#8230; This contrasts with&#8230; static benchmarking, which assumes a globally optimal set of evaluation items for all LMs.&#8221; - from [12]</em></p></div><p><strong>Dynamic sampling.</strong> The final detail necessary to understand Fluid Benchmarking is the data selection process. As mentioned previously, we use a subset of real evaluation results to estimate the parameter for a new model <code>&#952;_l&#8217;</code>. These items that are evaluated could be taken from a static evaluation set&#8212;<em>this is a common approach in practice</em>. However, Fluid Benchmarking argues that the set of items used for evaluation should be dynamically selected based on the model. For a weaker model, easier items will be more informative and vice versa. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!E-cu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa42768e3-2e34-420a-a1b5-4416d5d28ef3_2372x1072.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!E-cu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa42768e3-2e34-420a-a1b5-4416d5d28ef3_2372x1072.png 424w, https://substackcdn.com/image/fetch/$s_!E-cu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa42768e3-2e34-420a-a1b5-4416d5d28ef3_2372x1072.png 848w, https://substackcdn.com/image/fetch/$s_!E-cu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa42768e3-2e34-420a-a1b5-4416d5d28ef3_2372x1072.png 1272w, https://substackcdn.com/image/fetch/$s_!E-cu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa42768e3-2e34-420a-a1b5-4416d5d28ef3_2372x1072.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!E-cu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa42768e3-2e34-420a-a1b5-4416d5d28ef3_2372x1072.png" width="1456" height="658" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a42768e3-2e34-420a-a1b5-4416d5d28ef3_2372x1072.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:658,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:617858,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa42768e3-2e34-420a-a1b5-4416d5d28ef3_2372x1072.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!E-cu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa42768e3-2e34-420a-a1b5-4416d5d28ef3_2372x1072.png 424w, https://substackcdn.com/image/fetch/$s_!E-cu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa42768e3-2e34-420a-a1b5-4416d5d28ef3_2372x1072.png 848w, https://substackcdn.com/image/fetch/$s_!E-cu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa42768e3-2e34-420a-a1b5-4416d5d28ef3_2372x1072.png 1272w, https://substackcdn.com/image/fetch/$s_!E-cu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa42768e3-2e34-420a-a1b5-4416d5d28ef3_2372x1072.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [12])</figcaption></figure></div><p>Evaluation items are selected in [12] by computing the <a href="https://en.wikipedia.org/wiki/Fisher_information">Fisher information</a> of each item in the dataset. This metric prioritizes items that are most informative for a particular model by considering <em>i)</em> item discrimination and <em>ii)</em> item difficulty with respect to the capability level of the model being evaluated. Notably, the Fisher information changes depending on the capability level of a model. The figure above illustrates changes in the Fisher information during the training process. As the model continues training, it becomes more capable, leading to changes in the Fisher information that prioritize the selection of more difficult examples. </p><p>To select evaluation data based on the Fisher information, authors in [12] propose the following set of steps:</p><ol><li><p>Start with an empty evaluation set.</p></li><li><p>Compute the Fisher information of all items.</p></li><li><p>Select the item with the highest Fisher information.</p></li><li><p>Compute the true evaluation score of this item.</p></li><li><p>Re-fit the model parameter using this new data.</p></li><li><p>Repeat the above steps until your evaluation budget is reached.</p></li></ol><p>While most LLM evaluations are static, Fluid Benchmarking is dynamic&#8212;<em>the data used for evaluation is adapted based on each model being evaluated</em>. Such an approach demonstrates the incredible potential of IRT for both selecting data and measuring performance, as well as its overall versatility as a tool. Notably, a very similar data selection approach is adopted by the more recent <a href="https://arxiv.org/abs/2511.04689">ATLAS paper</a>.</p><p><strong>Does this work? </strong>In [12], authors focus on evaluating model checkpoints during the pretraining process. Six different open LLMs in the 7B parameter range are selected, and checkpoints are taken from these models evenly throughout their training process to arrive at a set of 102 LLMs for fitting the IRT model. All evaluation experiments are performed on the Open LLM leaderboard, which is a composite leaderboard comprised of six different benchmarks. A separate IRT model is fit for each benchmark in the leaderboard. As shown below, Fluid Benchmarking provides a stable and accurate estimate of model capabilities and is found to be effective for a wide range of different evaluation budgets. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ag4q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26cadd61-7f6d-41a6-b08a-50e2b85ba743_1802x832.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ag4q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26cadd61-7f6d-41a6-b08a-50e2b85ba743_1802x832.png 424w, https://substackcdn.com/image/fetch/$s_!Ag4q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26cadd61-7f6d-41a6-b08a-50e2b85ba743_1802x832.png 848w, https://substackcdn.com/image/fetch/$s_!Ag4q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26cadd61-7f6d-41a6-b08a-50e2b85ba743_1802x832.png 1272w, https://substackcdn.com/image/fetch/$s_!Ag4q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26cadd61-7f6d-41a6-b08a-50e2b85ba743_1802x832.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ag4q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26cadd61-7f6d-41a6-b08a-50e2b85ba743_1802x832.png" width="1456" height="672" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/26cadd61-7f6d-41a6-b08a-50e2b85ba743_1802x832.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:672,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:278214,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26cadd61-7f6d-41a6-b08a-50e2b85ba743_1802x832.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ag4q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26cadd61-7f6d-41a6-b08a-50e2b85ba743_1802x832.png 424w, https://substackcdn.com/image/fetch/$s_!Ag4q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26cadd61-7f6d-41a6-b08a-50e2b85ba743_1802x832.png 848w, https://substackcdn.com/image/fetch/$s_!Ag4q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26cadd61-7f6d-41a6-b08a-50e2b85ba743_1802x832.png 1272w, https://substackcdn.com/image/fetch/$s_!Ag4q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26cadd61-7f6d-41a6-b08a-50e2b85ba743_1802x832.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [12])</figcaption></figure></div><h4><strong><a href="https://arxiv.org/abs/2601.02316">DatBench: Discriminative, Faithful, and Efficient VLM Evaluations</a> [10]</strong></h4><blockquote><p><em>&#8220;We identify critical failure modes that violate faithfulness and discriminability, misrepresenting model capabilities: (i) multiple-choice formats reward guessing, do not represent downstream use-cases, and saturate early as models improve; (ii) blindly-solvable questions which can be answered without images, constitute up to 70% of some evaluations; and (iii) mislabeled or ambiguous samples compromise up to 42% of examples in certain datasets.&#8221;</em> - from [10]</p></blockquote><p>Most popular <a href="https://cameronrwolfe.substack.com/p/vision-llms">Vision-Language Model (VLM)</a> benchmarks have limitations that make research and progress difficult. Problems with these benchmarks include:</p><ul><li><p>Data quality issues (e.g., incorrect labels or low-resolution images) that make solving certain questions overly difficult or impossible.</p></li><li><p>Blindly-solvable questions that can be solved purely based upon text priors without using the actual image.</p></li><li><p>Multiple choice questions that are easily reward hacked via guessing and do not much the generative style in which most VLMs are deployed.</p></li></ul><p>Beyond these issues, the evaluation process alone is beginning to consume non-negligible compute for most models. LLM research is empirical, <em>and as much as 20% (or even more) of total model development costs can be spent running evaluations</em>. Based on this trend, we want to avoid wasted compute and ensure that the data in these benchmarks is actually useful for discerning model capabilities. Authors in [10] aim to solve these issues by developing and applying a targeted data curation approach over a wide set of VLM benchmarks to create DatBench, a composite benchmark that prioritizes high signal evaluation examples for VLMs. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-egb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9889c573-ca6e-464e-a5b9-3dcafd0b0a09_1744x500.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-egb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9889c573-ca6e-464e-a5b9-3dcafd0b0a09_1744x500.png 424w, https://substackcdn.com/image/fetch/$s_!-egb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9889c573-ca6e-464e-a5b9-3dcafd0b0a09_1744x500.png 848w, https://substackcdn.com/image/fetch/$s_!-egb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9889c573-ca6e-464e-a5b9-3dcafd0b0a09_1744x500.png 1272w, https://substackcdn.com/image/fetch/$s_!-egb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9889c573-ca6e-464e-a5b9-3dcafd0b0a09_1744x500.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-egb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9889c573-ca6e-464e-a5b9-3dcafd0b0a09_1744x500.png" width="1456" height="417" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9889c573-ca6e-464e-a5b9-3dcafd0b0a09_1744x500.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:417,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:122896,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9889c573-ca6e-464e-a5b9-3dcafd0b0a09_1744x500.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!-egb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9889c573-ca6e-464e-a5b9-3dcafd0b0a09_1744x500.png 424w, https://substackcdn.com/image/fetch/$s_!-egb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9889c573-ca6e-464e-a5b9-3dcafd0b0a09_1744x500.png 848w, https://substackcdn.com/image/fetch/$s_!-egb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9889c573-ca6e-464e-a5b9-3dcafd0b0a09_1744x500.png 1272w, https://substackcdn.com/image/fetch/$s_!-egb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9889c573-ca6e-464e-a5b9-3dcafd0b0a09_1744x500.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [10])</figcaption></figure></div><p><strong>Source data.</strong> The curation process in [10] begins with a large set of 33 evaluation datasets for VLMs that span the capability groups depicted above. A set of 27 state-of-the-art models ranging from 1-10B parameters is evaluated over these datasets, yielding a dataset of model evaluation results to use for data curation. From here, DatBench is constructed via a multi-step filtering process:</p><ol><li><p>Converting multi-choice questions into a generative format.</p></li><li><p>Removing blind-solvable questions.</p></li><li><p>Filtering examples with incorrect or ambiguous ground truth.</p></li><li><p>(Optional) Identifying examples that yield maximum discrimination.</p></li></ol><p>The last step of the pipeline is optional but can be used to sample a smaller amount of data that retains the ability to detect differences in model capabilities. Two different evaluation suites are created in [10]&#8212;<em>DatBench and DatBench-Full</em>&#8212;that cover distinct evaluation modes:</p><ul><li><p>High-efficiency evaluation over a subset of data for rapid iteration.</p></li><li><p>High-quality evaluation over all data for cases with relaxed computational constraints and a need for better coverage.</p></li></ul><p>For example, DatBench is most useful for ablation experiments, as we can lower inference costs and run faster experiments while still providing a useful capability signal. On the other hand, DatBench-Full can be used for final model reporting, which is run less often but requires comprehensively capturing the performance of a model. We will now outline each of the above curation steps in more detail.</p><p><strong>Multiple choice to generative conversion. </strong>Practically, most VLMs are used in a generative fashion, where users ask questions to a model and the model generates a response for the user. However, many benchmarks used to evaluate VLMs ask question in a multiple choice format. Such a format can artificially inflate VLM performance due to the random guessing and the fact that selecting an answer is generally easier than generating that same answer from scratch. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zvE3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c6d24e0-4a81-4bc8-ad41-470742c688cb_1734x870.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zvE3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c6d24e0-4a81-4bc8-ad41-470742c688cb_1734x870.png 424w, https://substackcdn.com/image/fetch/$s_!zvE3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c6d24e0-4a81-4bc8-ad41-470742c688cb_1734x870.png 848w, https://substackcdn.com/image/fetch/$s_!zvE3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c6d24e0-4a81-4bc8-ad41-470742c688cb_1734x870.png 1272w, https://substackcdn.com/image/fetch/$s_!zvE3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c6d24e0-4a81-4bc8-ad41-470742c688cb_1734x870.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zvE3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c6d24e0-4a81-4bc8-ad41-470742c688cb_1734x870.png" width="1456" height="731" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4c6d24e0-4a81-4bc8-ad41-470742c688cb_1734x870.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:731,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:268920,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c6d24e0-4a81-4bc8-ad41-470742c688cb_1734x870.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!zvE3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c6d24e0-4a81-4bc8-ad41-470742c688cb_1734x870.png 424w, https://substackcdn.com/image/fetch/$s_!zvE3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c6d24e0-4a81-4bc8-ad41-470742c688cb_1734x870.png 848w, https://substackcdn.com/image/fetch/$s_!zvE3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c6d24e0-4a81-4bc8-ad41-470742c688cb_1734x870.png 1272w, https://substackcdn.com/image/fetch/$s_!zvE3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c6d24e0-4a81-4bc8-ad41-470742c688cb_1734x870.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [10])</figcaption></figure></div><p>DatBench reformulates multiple choice questions into a generative format where the VLM generates an answer that is verified against a ground truth answer using an LLM judge. In cases where multiple choice is structurally necessary, authors in [10] rely upon a <a href="https://arxiv.org/abs/2307.06281">circular evaluation approach</a>. As shown in the figure above, converting multiple choice questions into a generative format leads to a noticeable drop in model performance, <em>indicating that generative evaluation is harder for current VLMs and better reflects the current state of model capabilities.</em></p><p><strong>Removing blind-solvable questions.</strong> One key insight from [10] is the fact that a surprising number of VLM evaluation samples can be solved without using any visual data; see below. Models can rely upon language priors to solve questions (or provide a high-probability guess), thus inflating the performance of VLMs with strong language backbones. To identify these cases, we can re-run evaluation while removing image inputs to identify those that are blind-solvable. In [10], the entire suite of 27 models is run in a blind fashion, and any questions that can be solved by at least one model are removed. Though this filtering approach is aggressive, the likelihood of a correct blind answer in a generative setup is relatively low, and the curation process begins with a large source dataset. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xGgZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78827448-2f98-4313-b786-ec55652771c4_1752x1022.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xGgZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78827448-2f98-4313-b786-ec55652771c4_1752x1022.png 424w, https://substackcdn.com/image/fetch/$s_!xGgZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78827448-2f98-4313-b786-ec55652771c4_1752x1022.png 848w, https://substackcdn.com/image/fetch/$s_!xGgZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78827448-2f98-4313-b786-ec55652771c4_1752x1022.png 1272w, https://substackcdn.com/image/fetch/$s_!xGgZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78827448-2f98-4313-b786-ec55652771c4_1752x1022.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xGgZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78827448-2f98-4313-b786-ec55652771c4_1752x1022.png" width="1456" height="849" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/78827448-2f98-4313-b786-ec55652771c4_1752x1022.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:849,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:242294,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78827448-2f98-4313-b786-ec55652771c4_1752x1022.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!xGgZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78827448-2f98-4313-b786-ec55652771c4_1752x1022.png 424w, https://substackcdn.com/image/fetch/$s_!xGgZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78827448-2f98-4313-b786-ec55652771c4_1752x1022.png 848w, https://substackcdn.com/image/fetch/$s_!xGgZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78827448-2f98-4313-b786-ec55652771c4_1752x1022.png 1272w, https://substackcdn.com/image/fetch/$s_!xGgZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78827448-2f98-4313-b786-ec55652771c4_1752x1022.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [10])</figcaption></figure></div><div class="pullquote"><p><em>&#8220;In the first stage, we flag examples that all evaluated models answer incorrectly. Unanimous failure across a diverse suite of models typically indicates either a data quality issue or a genuinely difficult frontier case, both of which warrant closer inspection. In the second stage, a strong VLM judge (GPT-5.2) verifies each flagged sample with access to the ground-truth answer as privileged information.&#8221; - from [10]</em></p></div><p><strong>Quality filtering.</strong> A two-stage pipeline is used in [10] to identify incorrect, low quality, and ambiguous evaluation data; see below. In the first stage, we flag any evaluation examples that are not solved by any model in the suite. These samples are usually either <em>i)</em> a data quality issue or <em>ii)</em> a valid frontier evaluation case. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ML6m!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe694a57d-c2c4-4d6d-b5c7-ac89c337e5d5_1734x1262.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ML6m!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe694a57d-c2c4-4d6d-b5c7-ac89c337e5d5_1734x1262.png 424w, https://substackcdn.com/image/fetch/$s_!ML6m!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe694a57d-c2c4-4d6d-b5c7-ac89c337e5d5_1734x1262.png 848w, https://substackcdn.com/image/fetch/$s_!ML6m!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe694a57d-c2c4-4d6d-b5c7-ac89c337e5d5_1734x1262.png 1272w, https://substackcdn.com/image/fetch/$s_!ML6m!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe694a57d-c2c4-4d6d-b5c7-ac89c337e5d5_1734x1262.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ML6m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe694a57d-c2c4-4d6d-b5c7-ac89c337e5d5_1734x1262.png" width="1456" height="1060" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e694a57d-c2c4-4d6d-b5c7-ac89c337e5d5_1734x1262.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1060,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:951070,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe694a57d-c2c4-4d6d-b5c7-ac89c337e5d5_1734x1262.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!ML6m!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe694a57d-c2c4-4d6d-b5c7-ac89c337e5d5_1734x1262.png 424w, https://substackcdn.com/image/fetch/$s_!ML6m!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe694a57d-c2c4-4d6d-b5c7-ac89c337e5d5_1734x1262.png 848w, https://substackcdn.com/image/fetch/$s_!ML6m!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe694a57d-c2c4-4d6d-b5c7-ac89c337e5d5_1734x1262.png 1272w, https://substackcdn.com/image/fetch/$s_!ML6m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe694a57d-c2c4-4d6d-b5c7-ac89c337e5d5_1734x1262.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [10])</figcaption></figure></div><p>To differentiate between these cases, we perform a second stage of filtering based upon a frontier-level VLM judge. In this stage, every flagged example is passed through the judge to determine whether it is correct an unambiguous. Such an approach is reliant upon the <a href="https://www.jasonwei.net/blog/asymmetry-of-verification-and-verifiers-law">asymmetry of verification</a> (i.e., verifying a provided solution to a problem should be easier than generating a valid solution). </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!teAi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16cd7f5f-5f18-4935-8905-b88b0b1418cb_1736x1050.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!teAi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16cd7f5f-5f18-4935-8905-b88b0b1418cb_1736x1050.png 424w, https://substackcdn.com/image/fetch/$s_!teAi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16cd7f5f-5f18-4935-8905-b88b0b1418cb_1736x1050.png 848w, https://substackcdn.com/image/fetch/$s_!teAi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16cd7f5f-5f18-4935-8905-b88b0b1418cb_1736x1050.png 1272w, https://substackcdn.com/image/fetch/$s_!teAi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16cd7f5f-5f18-4935-8905-b88b0b1418cb_1736x1050.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!teAi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16cd7f5f-5f18-4935-8905-b88b0b1418cb_1736x1050.png" width="1456" height="881" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/16cd7f5f-5f18-4935-8905-b88b0b1418cb_1736x1050.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:881,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:247149,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16cd7f5f-5f18-4935-8905-b88b0b1418cb_1736x1050.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!teAi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16cd7f5f-5f18-4935-8905-b88b0b1418cb_1736x1050.png 424w, https://substackcdn.com/image/fetch/$s_!teAi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16cd7f5f-5f18-4935-8905-b88b0b1418cb_1736x1050.png 848w, https://substackcdn.com/image/fetch/$s_!teAi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16cd7f5f-5f18-4935-8905-b88b0b1418cb_1736x1050.png 1272w, https://substackcdn.com/image/fetch/$s_!teAi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16cd7f5f-5f18-4935-8905-b88b0b1418cb_1736x1050.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [10])</figcaption></figure></div><p>In an effort to prioritize quality over quantity, any data identified as ambiguous, incorrectly labeled, or unsolvable due to insufficient image resolution is removed. As shown above, this stringent filtering policy results in relatively high ratios of discarded data in certain domains. For example, over 42% of the spatial reasoning data is removed from DatBench due ambiguity or issues with data quality</p><p><strong>Discriminative selection.</strong> Given increasing costs of evaluation, we would like to sample an evaluation subset to reduce costs without degrading discriminability&#8212;<em>or the ability to identify differences in performance</em>. One common approach is to sub-select evaluation samples while optimizing for rank correlation to find a smaller evaluation dataset that ranks models in the same way. However, this approach is prone to overfitting on a particular evaluation suite. An evaluation subset can preserve model rankings while still having noisy data that does not genuinely capture difference in model capabilities&#8212;<em>we prioritize model rankings without deeply capturing the kind of data that is actually being selected</em>. </p><blockquote><p><em>&#8220;The core optimization problem is not merely to maintain ranking stability, but to maximize total discrimination. By ensuring every sampled example possesses high discriminative power, we can implicitly guarantee robust ranking while maximizing the information content per inference token.&#8221;</em> - from [10]</p></blockquote><p>Authors in [10] propose a solution to these problems that is based upon IRT. Directly applying IRT to VLM evaluation would work poorly, as we do not have enough data. Specifically, each data point would need to be evaluated with hundreds of different models in order to fit a stable IRT model. We do not have anywhere near this amount of data&#8212;<em>only 27 models are used in [10] and getting access to hundreds of state-of-the-art VLMs would be very difficult (if not impossible)</em>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fwbS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F674078bf-fab9-43bb-a2d0-bb4a62a51339_2036x900.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fwbS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F674078bf-fab9-43bb-a2d0-bb4a62a51339_2036x900.png 424w, https://substackcdn.com/image/fetch/$s_!fwbS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F674078bf-fab9-43bb-a2d0-bb4a62a51339_2036x900.png 848w, https://substackcdn.com/image/fetch/$s_!fwbS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F674078bf-fab9-43bb-a2d0-bb4a62a51339_2036x900.png 1272w, https://substackcdn.com/image/fetch/$s_!fwbS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F674078bf-fab9-43bb-a2d0-bb4a62a51339_2036x900.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fwbS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F674078bf-fab9-43bb-a2d0-bb4a62a51339_2036x900.png" width="1456" height="644" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/674078bf-fab9-43bb-a2d0-bb4a62a51339_2036x900.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:644,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:282416,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F674078bf-fab9-43bb-a2d0-bb4a62a51339_2036x900.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!fwbS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F674078bf-fab9-43bb-a2d0-bb4a62a51339_2036x900.png 424w, https://substackcdn.com/image/fetch/$s_!fwbS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F674078bf-fab9-43bb-a2d0-bb4a62a51339_2036x900.png 848w, https://substackcdn.com/image/fetch/$s_!fwbS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F674078bf-fab9-43bb-a2d0-bb4a62a51339_2036x900.png 1272w, https://substackcdn.com/image/fetch/$s_!fwbS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F674078bf-fab9-43bb-a2d0-bb4a62a51339_2036x900.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Point-biserial correlation</figcaption></figure></div><p>Instead of directly using IRT, data in [10] is selected based on information density, as captured by the point-biserial correlation (<code>r_pb</code>); see above. Computed per evaluation example, <code>r_pb</code> captures the relationship between scores on a single data point and global performance. As explained in [10]: <em>&#8220;An item with high </em><code>r_pb</code><em> is one that strong models consistently answer correctly and weak models consistently miss; conversely, a low or negative </em><code>r_pb</code><em> indicates a noisy item.&#8221;</em> The left term in the above equation captures the relative difference in global performance of models that get a given data point correct or incorrect, while the right term captures the ratio of models that get the data point correct or incorrect. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yXHp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F442dfc48-fb21-4a61-8fae-34a5fed87a56_1736x940.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yXHp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F442dfc48-fb21-4a61-8fae-34a5fed87a56_1736x940.png 424w, https://substackcdn.com/image/fetch/$s_!yXHp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F442dfc48-fb21-4a61-8fae-34a5fed87a56_1736x940.png 848w, https://substackcdn.com/image/fetch/$s_!yXHp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F442dfc48-fb21-4a61-8fae-34a5fed87a56_1736x940.png 1272w, https://substackcdn.com/image/fetch/$s_!yXHp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F442dfc48-fb21-4a61-8fae-34a5fed87a56_1736x940.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yXHp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F442dfc48-fb21-4a61-8fae-34a5fed87a56_1736x940.png" width="1456" height="788" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/442dfc48-fb21-4a61-8fae-34a5fed87a56_1736x940.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:788,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:288633,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F442dfc48-fb21-4a61-8fae-34a5fed87a56_1736x940.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!yXHp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F442dfc48-fb21-4a61-8fae-34a5fed87a56_1736x940.png 424w, https://substackcdn.com/image/fetch/$s_!yXHp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F442dfc48-fb21-4a61-8fae-34a5fed87a56_1736x940.png 848w, https://substackcdn.com/image/fetch/$s_!yXHp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F442dfc48-fb21-4a61-8fae-34a5fed87a56_1736x940.png 1272w, https://substackcdn.com/image/fetch/$s_!yXHp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F442dfc48-fb21-4a61-8fae-34a5fed87a56_1736x940.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [10])</figcaption></figure></div><p>We select evaluation data in [10] by prioritizing examples with high <code>r_pb</code> per domain. To measure the total discriminative power of an evaluation subset, we can divide the total sum of selected <code>r_pb</code> score by the sum of <code>r_pb</code> scores across all data. As shown above, selecting data based upon <code>r_pb</code> allows us to preserve 90% of total discriminability with only 40% of the data, whereas rank correlation metrics saturate almost immediately. Interestingly, we also see that selecting all data is not optimal from the perspective of discriminative power. Noisy data (i.e., with low or negative <code>r_pb</code>) is left until the end of the selection process in [10].</p><p>The IRT-inspired approach is used to select 80% of evaluation data in [10], while the final 20% is manually reserved for frontier examples with low discriminative power. Namely, there exists a subset of evaluation data that has been validated by the LLM judge but is not answered correctly by any model. Any example in this subset will receive a low <code>r_pb</code> score because of the low ratio of correct model responses. However, such data captures legitimate frontier evaluation scenarios that should not be completely ignored within our evaluation dataset.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gUWV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fafea70c0-9cc9-4a8a-ae86-8d5d4e18bc91_1358x1574.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gUWV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fafea70c0-9cc9-4a8a-ae86-8d5d4e18bc91_1358x1574.png 424w, https://substackcdn.com/image/fetch/$s_!gUWV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fafea70c0-9cc9-4a8a-ae86-8d5d4e18bc91_1358x1574.png 848w, https://substackcdn.com/image/fetch/$s_!gUWV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fafea70c0-9cc9-4a8a-ae86-8d5d4e18bc91_1358x1574.png 1272w, https://substackcdn.com/image/fetch/$s_!gUWV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fafea70c0-9cc9-4a8a-ae86-8d5d4e18bc91_1358x1574.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gUWV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fafea70c0-9cc9-4a8a-ae86-8d5d4e18bc91_1358x1574.png" width="1358" height="1574" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/afea70c0-9cc9-4a8a-ae86-8d5d4e18bc91_1358x1574.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1574,&quot;width&quot;:1358,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:550493,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fafea70c0-9cc9-4a8a-ae86-8d5d4e18bc91_1358x1574.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!gUWV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fafea70c0-9cc9-4a8a-ae86-8d5d4e18bc91_1358x1574.png 424w, https://substackcdn.com/image/fetch/$s_!gUWV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fafea70c0-9cc9-4a8a-ae86-8d5d4e18bc91_1358x1574.png 848w, https://substackcdn.com/image/fetch/$s_!gUWV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fafea70c0-9cc9-4a8a-ae86-8d5d4e18bc91_1358x1574.png 1272w, https://substackcdn.com/image/fetch/$s_!gUWV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fafea70c0-9cc9-4a8a-ae86-8d5d4e18bc91_1358x1574.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [10])</figcaption></figure></div><p><strong>Key findings.</strong> Evaluation results on both DatBench and the original benchmarks are plotted above. Results on DatBench have a larger performance spread relative to those of the original benchmarks. For example, scores on general benchmarks range from 10-65% for DatBench versus 65-80% for original benchmarks, showing that DatBench mitigates benchmark saturation. In fact, just converting multiple choice questions to a generative format causes as much as a 35% performance drop. DatBench is found to yield a 13&#215; speedup in the evaluation process while roughly matching the discriminative power of the original benchmarks. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ho0i!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f5d9fde-8d7e-4300-a026-4b107524369a_1820x864.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ho0i!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f5d9fde-8d7e-4300-a026-4b107524369a_1820x864.png 424w, https://substackcdn.com/image/fetch/$s_!Ho0i!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f5d9fde-8d7e-4300-a026-4b107524369a_1820x864.png 848w, https://substackcdn.com/image/fetch/$s_!Ho0i!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f5d9fde-8d7e-4300-a026-4b107524369a_1820x864.png 1272w, https://substackcdn.com/image/fetch/$s_!Ho0i!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f5d9fde-8d7e-4300-a026-4b107524369a_1820x864.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ho0i!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f5d9fde-8d7e-4300-a026-4b107524369a_1820x864.png" width="1456" height="691" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2f5d9fde-8d7e-4300-a026-4b107524369a_1820x864.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:691,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:224750,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/190515363?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f5d9fde-8d7e-4300-a026-4b107524369a_1820x864.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Ho0i!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f5d9fde-8d7e-4300-a026-4b107524369a_1820x864.png 424w, https://substackcdn.com/image/fetch/$s_!Ho0i!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f5d9fde-8d7e-4300-a026-4b107524369a_1820x864.png 848w, https://substackcdn.com/image/fetch/$s_!Ho0i!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f5d9fde-8d7e-4300-a026-4b107524369a_1820x864.png 1272w, https://substackcdn.com/image/fetch/$s_!Ho0i!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f5d9fde-8d7e-4300-a026-4b107524369a_1820x864.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [10])</figcaption></figure></div><p>We can also repurpose the evaluation artifacts created by the DatBench pipeline to diagnose common failure modes of VLMs. Specifically, authors in [10] make the following observations:</p><ul><li><p>A tradeoff between perception and reasoning exists in VLMs. Models that perform well on higher-level semantic processing tasks have degraded low-level perceptual fidelity. Finding a model that balances performance on both semantic and perceptual tasks is rare. </p></li><li><p>An &#8220;overthinking&#8221; problem exists within current VLMs, meaning that significantly fewer tokens are used when answering questions correctly versus incorrectly; see above. This problem is especially pronounced in reasoning models, where we see that an average length of a correct and incorrect response is 425.2 and 1,196.9 tokens, respectively. </p></li><li><p>The dependence of VLMs upon language priors, which can be measured via the performance difference between normal and blind evaluation, varies per capability; see below. For example, counting and grounding rely heavily upon vision information, but math and spatial reasoning are found to rely more upon language priors to guess a correct answer. </p></li></ul><p>Although many VLM benchmarks are shown to be noisy and inflated in [10], we can learn a lot about current state-of-the-art by addressing these problems and selecting evaluation data that accurately captures model performance. Once we can identify shortcomings in performance (e.g., overthinking and perceptual gaps), improving model capabilities in these specific areas is much easier. </p><h2>Keys to Creating a Useful Benchmark</h2><p>We have studied a wide variety of LLM benchmarks and evaluation techniques in this overview. Given the many practical details peppered throughout the papers we have seen, we can gain a lot by considering the common concepts that continually arise across disparate benchmarks. By identifying these trends, we can (hopefully) identify key design principles for making a useful benchmark.</p><p><strong>Domain Taxonomy.</strong> Most popular LLM benchmarks categorize their data into a fixed set of domains and sub-domains. By doing this, we make it easier to debug an LLM&#8217;s performance, as we can compute domain-level metrics within the benchmark. Additionally, organizing a benchmark into such a taxonomy naturally ensures that data is diverse and covers a decent breadth of topics. Leveraging a taxonomy can also make the evolution of a benchmark simpler over time by granularly measuring saturation at a domain level and enabling researchers to individually evolve each domain (e.g., as in BIG-Bench Extra Hard). </p><p><strong>Human annotation.</strong> Despite the prevalence of synthetic data within LLM research, nearly all successful evaluation benchmarks rely on human experts to annotate data in some way. Some benchmarks begin with questions written by human experts (e.g., FrontierMath), while others leverage human opinions to measure question difficulty or accuracy (e.g., GPQA). Even when synthetic data is being used, human verification of data quality is usually helpful (e.g., IFEval and IFBench). In fact, review by human experts is even used in some cases to improve the quality of large-scale data obtained from noisy sources (e.g., crowdsourcing). Even today, <em>manual inspection is one of the most effective tools for LLM evaluation.</em></p><p><strong>Model-in-the-loop.</strong> Although humans play a massive role in the evaluation process, augmenting human efforts with an LLM can be beneficial. For example, LLMs are often used for difficulty filtering by simply identifying the questions that they get wrong. Additionally, trends in model performance allow us to fit IRT models and even identify less informative subsets of data (e.g., blind-answerable data in DatBench). Model-based approaches are helpful for identifying areas of a benchmark that may contain mistakes that can be routed to human review. We can also use LLMs to efficiently generate or reformat evaluation data that is later verified by a human annotator (e.g., MMLU-Pro adopts such a strategy). </p><p><strong>Data quality.</strong> The best evaluation benchmarks tend to pull from high-quality data sources. For example, popular math benchmarks include questions that are taken directly from recognized math competitions, and reasoning benchmarks like BIG-Bench are sourced from vetted sources such as other proven datasets (as in BIG-Bench Extra Hard) or questions that have been extensively verified with human review (as in the original BIG-Bench). In fact, manually written questions from human experts are another commonly-used source of evaluation data, but we must implement measures to ensure data quality. The GPQA curation pipeline is a great example of an effective system for ensuring data quality and difficulty. </p><p><strong>Realistic.</strong> Benchmarks are an imperfect proxy for measuring what we actually care about: <em>the capabilities of an LLM</em>. Depending on the questions that it tests, a benchmark may or may not accurately reflect the true performance of an LLM in the real world. Ideally, we want our benchmark to accurately capture an LLM&#8217;s true capabilities on a given task. To achieve this, we should make sure that evaluation data is as realistic as possible. One great example of how to achieve this goal is <a href="https://arxiv.org/abs/2603.24477">CursorBench</a>, a coding benchmark that directly sources evaluation data from real coding agent sessions in Cursor and constantly releases new benchmark versions to better capture recent trends in agent usage.</p><p><strong>Evolution.</strong> The capabilities of frontier-level LLMs are advancing rapidly, which can lead to benchmark saturation. In order to remain relevant, a good benchmark must evolve (and improve) over time. One of the best examples of this trend is BIG-Bench, which was already saturated less than a year after its initial release. Instead of simply allowing the benchmark to become irrelevant, improved versions were consistently released, such as BIG-Bench Hard and BIG-Bench Extra Hard. Many datasets can remain relevant and useful if we are willing to adjust the difficulty and scope of the benchmark as LLMs improve.</p><h4><strong>New to the newsletter?</strong></h4><p>Hi! I&#8217;m <a href="https://cameronrwolfe.me/">Cameron R. Wolfe</a>, Deep Learning Ph.D. and Staff Research Scientist at <a href="https://research.netflix.com/research-area/nlp-and-conversations">Netflix</a>. This is the Deep (Learning) Focus newsletter, where I help readers better understand important topics in AI research. The newsletter will always be free and open to read. If you like the newsletter, please subscribe, consider a paid subscription, share it, or follow me on <a href="https://twitter.com/cwolferesearch">X</a> and <a href="https://www.linkedin.com/in/cameron-r-wolfe-ph-d-04744a238/">LinkedIn</a>!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://cameronrwolfe.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://cameronrwolfe.substack.com/subscribe?"><span>Subscribe now</span></a></p><h4>Bibliography</h4><p>[1] Hendrycks, Dan, et al. &#8220;Measuring massive multitask language understanding.&#8221; <em>arXiv preprint arXiv:2009.03300</em> (2020).</p><p>[2] Wang, Yubo, et al. &#8220;Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.&#8221; <em>Advances in Neural Information Processing Systems</em> 37 (2024): 95266-95290.</p><p>[3] Gema, Aryo Pradipta, et al. &#8220;Are we done with mmlu?.&#8221; <em>Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)</em>. 2025.</p><p>[4] Rein, David, et al. &#8220;Gpqa: A graduate-level google-proof q&amp;a benchmark.&#8221; <em>First conference on language modeling</em>. 2024.</p><p>[5] Srivastava, Aarohi, et al. &#8220;Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.&#8221; <em>Transactions on machine learning research</em> (2023).</p><p>[6] Suzgun, Mirac, et al. &#8220;Challenging big-bench tasks and whether chain-of-thought can solve them.&#8221; <em>Findings of the Association for Computational Linguistics: ACL 2023</em>. 2023.</p><p>[7] Kazemi, Mehran, et al. &#8220;Big-bench extra hard.&#8221; <em>Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</em>. 2025.</p><p>[8] Zhou, Jeffrey, et al. &#8220;Instruction-following evaluation for large language models.&#8221; <em>arXiv preprint arXiv:2311.07911</em> (2023).</p><p>[9] Pyatkin, Valentina, et al. &#8220;Generalizing verifiable instruction following.&#8221; <em>arXiv preprint arXiv:2507.02833</em> (2025).</p><p>[10] Joshi, Siddharth, et al. &#8220;DatBench: Discriminative, Faithful, and Efficient VLM Evaluations.&#8221; <em>arXiv preprint arXiv:2601.02316</em> (2026).</p><p>[11] Polo, Felipe Maia, et al. &#8220;tinyBenchmarks: evaluating LLMs with fewer examples.&#8221; <em>arXiv preprint arXiv:2402.14992</em> (2024).</p><p>[12] Hofmann, Valentin, et al. &#8220;Fluid language model benchmarking.&#8221; <em>arXiv preprint arXiv:2509.11106</em> (2025).</p><p>[13] Dubois, Yann, et al. &#8220;Length-controlled alpacaeval: A simple way to debias automatic evaluators.&#8221; <em>arXiv preprint arXiv:2404.04475</em> (2024).</p><p>[14] Vivek, Rajan, et al. &#8220;Anchor points: Benchmarking models with much fewer examples.&#8221; <em>Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)</em>. 2024.</p><p>[15] Xu, Cong, et al. &#8220;Data efficient evaluation of large language models and text-to-image models via adaptive sampling.&#8221; <em>arXiv preprint arXiv:2406.15527</em> (2024).</p><p>[16] Perlitz, Yotam, et al. &#8220;Efficient benchmarking (of language models).&#8221; <em>Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)</em>. 2024.</p><p>[17] Kipnis, Alex, et al. &#8220;metabench--A Sparse Benchmark of Reasoning and Knowledge in Large Language Models.&#8221; <em>arXiv preprint arXiv:2407.12844</em> (2024).</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>See page 15 of <a href="https://arxiv.org/abs/2009.03300">the MMLU paper</a> [1] for an itemized list of all 57 tasks. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>See pages 5-6 of <a href="https://arxiv.org/abs/2311.12022">the GPQA paper</a> [5] for a list of all sub-domains. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Here, we interpret the probability score assigned by the model to a certain multiple choice answer option as the model&#8217;s confidence in that option. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>A full list of filtering criteria and associated rationales can be found in Appendix D on Page 48 of the <a href="https://arxiv.org/abs/2210.09261">BIG-Bench Hard paper</a>. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>The paper does not explicitly state how the base prompts are sourced. Authors just mention that they <em>&#8220;generate a set of base prompts&#8221;</em>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>The level of overfitting on IFEval might also be caused by the simple fact that this benchmark is constantly tested by model developers as new models are being created. Therefore, new models are naturally selected based on their performance on this benchmark (and other popular benchmarks). </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>An item that is more discriminative creates a separation between stronger and weaker models. This is an item that, if answered correctly, indicates that a model is capable.</p></div></div>]]></content:encoded></item><item><title><![CDATA[Applying Statistics to LLM Evaluations]]></title><description><![CDATA[Most LLM evaluations are conducted without a deep consideration of statistics.]]></description><link>https://cameronrwolfe.substack.com/p/stats-llm-evals</link><guid isPermaLink="false">https://cameronrwolfe.substack.com/p/stats-llm-evals</guid><dc:creator><![CDATA[Cameron R. Wolfe, Ph.D.]]></dc:creator><pubDate>Mon, 09 Mar 2026 09:33:37 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/2a61c723-9423-49ea-af11-85b7ed56b342_2498x1404.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qOla!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76e07c50-74fc-40a7-93ba-eafaf798c8b7_2487x1397.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qOla!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76e07c50-74fc-40a7-93ba-eafaf798c8b7_2487x1397.png 424w, https://substackcdn.com/image/fetch/$s_!qOla!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76e07c50-74fc-40a7-93ba-eafaf798c8b7_2487x1397.png 848w, https://substackcdn.com/image/fetch/$s_!qOla!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76e07c50-74fc-40a7-93ba-eafaf798c8b7_2487x1397.png 1272w, https://substackcdn.com/image/fetch/$s_!qOla!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76e07c50-74fc-40a7-93ba-eafaf798c8b7_2487x1397.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qOla!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76e07c50-74fc-40a7-93ba-eafaf798c8b7_2487x1397.png" width="1456" height="818" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/76e07c50-74fc-40a7-93ba-eafaf798c8b7_2487x1397.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:818,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:957003,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76e07c50-74fc-40a7-93ba-eafaf798c8b7_2487x1397.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qOla!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76e07c50-74fc-40a7-93ba-eafaf798c8b7_2487x1397.png 424w, https://substackcdn.com/image/fetch/$s_!qOla!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76e07c50-74fc-40a7-93ba-eafaf798c8b7_2487x1397.png 848w, https://substackcdn.com/image/fetch/$s_!qOla!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76e07c50-74fc-40a7-93ba-eafaf798c8b7_2487x1397.png 1272w, https://substackcdn.com/image/fetch/$s_!qOla!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76e07c50-74fc-40a7-93ba-eafaf798c8b7_2487x1397.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1, 2, 3])</figcaption></figure></div><p>Research on large language models (LLMs) is empirically driven. For this reason, model evaluations play a pivotal role in the field&#8217;s progress. We improve models by making changes, evaluating them, and iterating. Despite their foundational role, however, evaluations are usually handled in a naive manner. In most cases, we just test a model&#8217;s performance over a finite evaluation dataset and directly compare performance metrics to those of other models with no consideration for whether these results are statistically significant or not. Such an approach leads to incorrect or misleading interpretations of evaluation results. As researchers, <em>we want to avoid mistaking noise for progress and instead equip ourselves with the statistical tools needed to run informative model evaluations.</em></p><blockquote><p><em>&#8220;Language models are measured in the literature by evaluations, or evals. Evals are commonly run and reported with a highest number is best mentality; industry practice is to highlight a state-of-the-art result in bold, but not necessarily to test that result for any kind of statistical significance.&#8221; </em>- from [1]</p></blockquote><p>In this overview, we will build a statistical foundation for LLM evaluations from the ground up. To begin, we will review basic statistical ideas with a practical focus on the topics that are most useful for model evaluations. We will then take a deeper look at how these ideas can be directly used to interpret LLM evaluation results in an uncertainty-aware manner. Specifically, we will cover a set of statistical best practices for model evaluation and implement each of them to show how they can be concretely applied. Although it may seem daunting, taking a statistically grounded approach to model evaluation is not especially difficult and can help us make faster progress by avoiding spurious results.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://cameronrwolfe.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Join 65,000 others who use Deep (Learning) Focus to understand AI research. Consider a paid subscription if you would like to help support the newsletter.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Basic Statistics for LLM Evaluations</h2><p>In order to develop a statistical framework for LLM evaluations, we need to first learn about the fundamental tools from statistics that can be used to create such a framework. This section will cover a selection of topics related to the properties of random variables, such as computing the mean or variance and constructing a confidence interval. After covering the fundamentals, we learn how these ideas can be applied to properly analyze LLM evaluation results in the next section. </p><h4>Random Variables and Estimators</h4><p>A random variable <strong>X</strong> is defined as a quantity that has a value dependent upon chance. We can take several independent samples from the distribution <code>{x_1, x_2, &#8230;, x_n}</code>, and the values of these observations will be sampled from the distribution of <code>X</code> (i.e., <code>x_i ~ X</code>). We define the mean (or average) of this random variable via the <a href="https://en.wikipedia.org/wiki/Expected_value">expectation</a>, which can be computed in a continuous or discrete fashion as shown in the figure below. Additionally, we can compute a sample mean by averaging the values of <code>n</code> observations sampled from the distribution.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_Vx9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9637041-f4cb-4f82-b6b9-8c1ce654c096_1680x954.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_Vx9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9637041-f4cb-4f82-b6b9-8c1ce654c096_1680x954.png 424w, https://substackcdn.com/image/fetch/$s_!_Vx9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9637041-f4cb-4f82-b6b9-8c1ce654c096_1680x954.png 848w, https://substackcdn.com/image/fetch/$s_!_Vx9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9637041-f4cb-4f82-b6b9-8c1ce654c096_1680x954.png 1272w, https://substackcdn.com/image/fetch/$s_!_Vx9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9637041-f4cb-4f82-b6b9-8c1ce654c096_1680x954.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_Vx9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9637041-f4cb-4f82-b6b9-8c1ce654c096_1680x954.png" width="602" height="341.9326923076923" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a9637041-f4cb-4f82-b6b9-8c1ce654c096_1680x954.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:827,&quot;width&quot;:1456,&quot;resizeWidth&quot;:602,&quot;bytes&quot;:247250,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9637041-f4cb-4f82-b6b9-8c1ce654c096_1680x954.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_Vx9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9637041-f4cb-4f82-b6b9-8c1ce654c096_1680x954.png 424w, https://substackcdn.com/image/fetch/$s_!_Vx9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9637041-f4cb-4f82-b6b9-8c1ce654c096_1680x954.png 848w, https://substackcdn.com/image/fetch/$s_!_Vx9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9637041-f4cb-4f82-b6b9-8c1ce654c096_1680x954.png 1272w, https://substackcdn.com/image/fetch/$s_!_Vx9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9637041-f4cb-4f82-b6b9-8c1ce654c096_1680x954.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Mean and sample mean</figcaption></figure></div><p>Formally, the lower case letters <code>x_i</code> represent concrete values sampled from a distribution, while upper case letter <code>X_i</code> denotes the <code>i</code>-th random variable in our sample&#8212;<em>this is a notational detail, but it&#8217;s worth covering to avoid confusion</em>. For example, if we evaluate our LLM on <code>n</code> questions, <code>X_i</code> is a random variable that represents the distribution of possible scores for question <code>i<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a></code>, while <code>x_i</code> is an actual evaluation score observed for a single evaluation run. We can also define the sample mean in terms of random variables as shown in the equation below. We use an uppercase <code>X&#772;</code> in this case because we are defining a random variable.  </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!R5WK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F470b04f5-c0cb-422a-b973-b7070547ded0_1274x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!R5WK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F470b04f5-c0cb-422a-b973-b7070547ded0_1274x630.png 424w, https://substackcdn.com/image/fetch/$s_!R5WK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F470b04f5-c0cb-422a-b973-b7070547ded0_1274x630.png 848w, https://substackcdn.com/image/fetch/$s_!R5WK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F470b04f5-c0cb-422a-b973-b7070547ded0_1274x630.png 1272w, https://substackcdn.com/image/fetch/$s_!R5WK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F470b04f5-c0cb-422a-b973-b7070547ded0_1274x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!R5WK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F470b04f5-c0cb-422a-b973-b7070547ded0_1274x630.png" width="201" height="99.3956043956044" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/470b04f5-c0cb-422a-b973-b7070547ded0_1274x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1274,&quot;resizeWidth&quot;:201,&quot;bytes&quot;:129288,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F470b04f5-c0cb-422a-b973-b7070547ded0_1274x630.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!R5WK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F470b04f5-c0cb-422a-b973-b7070547ded0_1274x630.png 424w, https://substackcdn.com/image/fetch/$s_!R5WK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F470b04f5-c0cb-422a-b973-b7070547ded0_1274x630.png 848w, https://substackcdn.com/image/fetch/$s_!R5WK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F470b04f5-c0cb-422a-b973-b7070547ded0_1274x630.png 1272w, https://substackcdn.com/image/fetch/$s_!R5WK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F470b04f5-c0cb-422a-b973-b7070547ded0_1274x630.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Sample mean with random variables</figcaption></figure></div><p>The distribution of our random variable <code>X</code> also has variance <code>Var(X)</code>, which describes how &#8220;spread out&#8221; the distribution is around the mean. In this overview, we will assume that this variance is finite (i.e., less than infinity). If we have a distribution with high variance, then samples taken from this distribution will be more spread out around the mean and vice versa; see below for an illustration.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!h9L9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff61bf6c4-b793-4403-818b-497765b377dc_989x590.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!h9L9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff61bf6c4-b793-4403-818b-497765b377dc_989x590.png 424w, https://substackcdn.com/image/fetch/$s_!h9L9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff61bf6c4-b793-4403-818b-497765b377dc_989x590.png 848w, https://substackcdn.com/image/fetch/$s_!h9L9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff61bf6c4-b793-4403-818b-497765b377dc_989x590.png 1272w, https://substackcdn.com/image/fetch/$s_!h9L9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff61bf6c4-b793-4403-818b-497765b377dc_989x590.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!h9L9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff61bf6c4-b793-4403-818b-497765b377dc_989x590.png" width="989" height="590" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f61bf6c4-b793-4403-818b-497765b377dc_989x590.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:590,&quot;width&quot;:989,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:29908,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff61bf6c4-b793-4403-818b-497765b377dc_989x590.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!h9L9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff61bf6c4-b793-4403-818b-497765b377dc_989x590.png 424w, https://substackcdn.com/image/fetch/$s_!h9L9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff61bf6c4-b793-4403-818b-497765b377dc_989x590.png 848w, https://substackcdn.com/image/fetch/$s_!h9L9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff61bf6c4-b793-4403-818b-497765b377dc_989x590.png 1272w, https://substackcdn.com/image/fetch/$s_!h9L9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff61bf6c4-b793-4403-818b-497765b377dc_989x590.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The expression for <code>Var(X)</code> is provided below. Similarly to the sample mean, we can also estimate variance using a fixed set of samples from our distribution <code>X</code>&#8212;<em>this is how the variance is usually computed in practical settings</em>. We can also compute the standard deviation <code>&#963;</code> by taking the square root of the variance. The variance and standard deviation describe the variability of individual samples from <code>X</code>.  </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Amtk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F124abb6d-407c-41a0-82fe-c721da2953c0_1797x587.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Amtk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F124abb6d-407c-41a0-82fe-c721da2953c0_1797x587.png 424w, https://substackcdn.com/image/fetch/$s_!Amtk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F124abb6d-407c-41a0-82fe-c721da2953c0_1797x587.png 848w, https://substackcdn.com/image/fetch/$s_!Amtk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F124abb6d-407c-41a0-82fe-c721da2953c0_1797x587.png 1272w, https://substackcdn.com/image/fetch/$s_!Amtk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F124abb6d-407c-41a0-82fe-c721da2953c0_1797x587.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Amtk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F124abb6d-407c-41a0-82fe-c721da2953c0_1797x587.png" width="1456" height="476" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/124abb6d-407c-41a0-82fe-c721da2953c0_1797x587.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:476,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:131478,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F124abb6d-407c-41a0-82fe-c721da2953c0_1797x587.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Amtk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F124abb6d-407c-41a0-82fe-c721da2953c0_1797x587.png 424w, https://substackcdn.com/image/fetch/$s_!Amtk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F124abb6d-407c-41a0-82fe-c721da2953c0_1797x587.png 848w, https://substackcdn.com/image/fetch/$s_!Amtk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F124abb6d-407c-41a0-82fe-c721da2953c0_1797x587.png 1272w, https://substackcdn.com/image/fetch/$s_!Amtk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F124abb6d-407c-41a0-82fe-c721da2953c0_1797x587.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Variance and standard deviation</figcaption></figure></div><p>While variance measures the variability of a single random variable <code>X</code>, <strong>covariance</strong> measures how two random variables <code>X</code> and <code>Y</code> vary together. Intuitively, if these variables vary in the same direction (e.g., they are both above or below their means at the same time), then their covariance will be positive and vice versa. A covariance near zero indicates there is no clear relationship between <code>X</code> and <code>Y</code>. We can also compute a sample covariance similarly to the sample variance shown above. Expressions for covariance and sample covariance are provided below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pWt0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faca05d5a-a8a6-4679-8e0c-db181a3e970d_1750x640.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pWt0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faca05d5a-a8a6-4679-8e0c-db181a3e970d_1750x640.png 424w, https://substackcdn.com/image/fetch/$s_!pWt0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faca05d5a-a8a6-4679-8e0c-db181a3e970d_1750x640.png 848w, https://substackcdn.com/image/fetch/$s_!pWt0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faca05d5a-a8a6-4679-8e0c-db181a3e970d_1750x640.png 1272w, https://substackcdn.com/image/fetch/$s_!pWt0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faca05d5a-a8a6-4679-8e0c-db181a3e970d_1750x640.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pWt0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faca05d5a-a8a6-4679-8e0c-db181a3e970d_1750x640.png" width="689" height="251.75" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aca05d5a-a8a6-4679-8e0c-db181a3e970d_1750x640.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:532,&quot;width&quot;:1456,&quot;resizeWidth&quot;:689,&quot;bytes&quot;:142290,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faca05d5a-a8a6-4679-8e0c-db181a3e970d_1750x640.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pWt0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faca05d5a-a8a6-4679-8e0c-db181a3e970d_1750x640.png 424w, https://substackcdn.com/image/fetch/$s_!pWt0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faca05d5a-a8a6-4679-8e0c-db181a3e970d_1750x640.png 848w, https://substackcdn.com/image/fetch/$s_!pWt0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faca05d5a-a8a6-4679-8e0c-db181a3e970d_1750x640.png 1272w, https://substackcdn.com/image/fetch/$s_!pWt0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faca05d5a-a8a6-4679-8e0c-db181a3e970d_1750x640.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Covariance and sample covariance</figcaption></figure></div><p><strong>The law of total variance</strong> is a useful identity that decomposes the variance of a random variable <code>X</code> with respect to another random variable <code>Y</code>; see below. For the purposes of this overview, this law is useful because it lets us separate multiple sources of randomness in an evaluation result. Later, we will use it to decompose the variance of an evaluation score into two key components:</p><ol><li><p>Variability due to the question sampled for evaluation.</p></li><li><p>Within-question variability arising from stochastic generation by the LLM or an LLM judge.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tm_G!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa29f7533-7f57-4859-8ba3-f4d43b88a3ef_2459x162.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tm_G!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa29f7533-7f57-4859-8ba3-f4d43b88a3ef_2459x162.png 424w, https://substackcdn.com/image/fetch/$s_!tm_G!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa29f7533-7f57-4859-8ba3-f4d43b88a3ef_2459x162.png 848w, https://substackcdn.com/image/fetch/$s_!tm_G!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa29f7533-7f57-4859-8ba3-f4d43b88a3ef_2459x162.png 1272w, https://substackcdn.com/image/fetch/$s_!tm_G!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa29f7533-7f57-4859-8ba3-f4d43b88a3ef_2459x162.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tm_G!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa29f7533-7f57-4859-8ba3-f4d43b88a3ef_2459x162.png" width="520" height="34.285714285714285" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a29f7533-7f57-4859-8ba3-f4d43b88a3ef_2459x162.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:96,&quot;width&quot;:1456,&quot;resizeWidth&quot;:520,&quot;bytes&quot;:125768,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa29f7533-7f57-4859-8ba3-f4d43b88a3ef_2459x162.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tm_G!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa29f7533-7f57-4859-8ba3-f4d43b88a3ef_2459x162.png 424w, https://substackcdn.com/image/fetch/$s_!tm_G!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa29f7533-7f57-4859-8ba3-f4d43b88a3ef_2459x162.png 848w, https://substackcdn.com/image/fetch/$s_!tm_G!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa29f7533-7f57-4859-8ba3-f4d43b88a3ef_2459x162.png 1272w, https://substackcdn.com/image/fetch/$s_!tm_G!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa29f7533-7f57-4859-8ba3-f4d43b88a3ef_2459x162.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The law of total variance</figcaption></figure></div><h4>Standard Error and Sample Means</h4><p>If we repeatedly draw samples from <code>X</code> and compute the sample mean, we will get a different result every time. The resulting sample means form a sampling distribution (i.e., basically a list of sample means we have drawn). The standard deviation of this sampling distribution is called the standard error of the sample mean. Put simply, the standard error is just the standard deviation over sample means. While the standard deviation captures variability in individual data points <code>x_i</code> sampled from <code>X</code>, the standard error captures variability in the sample mean estimator (i.e.,<em> </em>the spread of sample means after computing it multiple times with different samples). A formal definition of the standard error is provided below, as well as an estimator for the standard error that uses sample standard deviation<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a> because the true value of <code>&#963;</code> is rarely known in practice and must be estimated. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1BUW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fe10133-aa64-4c67-a71b-891cda9d2785_1841x363.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1BUW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fe10133-aa64-4c67-a71b-891cda9d2785_1841x363.png 424w, https://substackcdn.com/image/fetch/$s_!1BUW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fe10133-aa64-4c67-a71b-891cda9d2785_1841x363.png 848w, https://substackcdn.com/image/fetch/$s_!1BUW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fe10133-aa64-4c67-a71b-891cda9d2785_1841x363.png 1272w, https://substackcdn.com/image/fetch/$s_!1BUW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fe10133-aa64-4c67-a71b-891cda9d2785_1841x363.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1BUW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fe10133-aa64-4c67-a71b-891cda9d2785_1841x363.png" width="1456" height="287" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6fe10133-aa64-4c67-a71b-891cda9d2785_1841x363.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:287,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:97415,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fe10133-aa64-4c67-a71b-891cda9d2785_1841x363.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1BUW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fe10133-aa64-4c67-a71b-891cda9d2785_1841x363.png 424w, https://substackcdn.com/image/fetch/$s_!1BUW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fe10133-aa64-4c67-a71b-891cda9d2785_1841x363.png 848w, https://substackcdn.com/image/fetch/$s_!1BUW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fe10133-aa64-4c67-a71b-891cda9d2785_1841x363.png 1272w, https://substackcdn.com/image/fetch/$s_!1BUW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fe10133-aa64-4c67-a71b-891cda9d2785_1841x363.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Standard error</figcaption></figure></div><p>This standard error equation makes the assumption that samples drawn from <code>X</code> are <a href="https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables">independent and identically distributed (IID)</a>. Independence implies that <code>Cov&#8289;(X_i,X_j) = 0</code> for <code>i&#8800;j</code>, and identical distribution implies that each <code>X_i</code>&#8203; has the same variance <code>Var(X)</code>. From this assumption and a few other properties of the variance, we can derive the above expression for the standard error as shown below. The assumption of IID samples is not always satisfied&#8212;<em>we should only use this expression when the samples being drawn are truly independent</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0uaj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffac4926a-e9a9-4b85-a3eb-47d7a327515c_2148x1406.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0uaj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffac4926a-e9a9-4b85-a3eb-47d7a327515c_2148x1406.png 424w, https://substackcdn.com/image/fetch/$s_!0uaj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffac4926a-e9a9-4b85-a3eb-47d7a327515c_2148x1406.png 848w, https://substackcdn.com/image/fetch/$s_!0uaj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffac4926a-e9a9-4b85-a3eb-47d7a327515c_2148x1406.png 1272w, https://substackcdn.com/image/fetch/$s_!0uaj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffac4926a-e9a9-4b85-a3eb-47d7a327515c_2148x1406.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0uaj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffac4926a-e9a9-4b85-a3eb-47d7a327515c_2148x1406.png" width="1456" height="953" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fac4926a-e9a9-4b85-a3eb-47d7a327515c_2148x1406.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:953,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:316867,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffac4926a-e9a9-4b85-a3eb-47d7a327515c_2148x1406.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0uaj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffac4926a-e9a9-4b85-a3eb-47d7a327515c_2148x1406.png 424w, https://substackcdn.com/image/fetch/$s_!0uaj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffac4926a-e9a9-4b85-a3eb-47d7a327515c_2148x1406.png 848w, https://substackcdn.com/image/fetch/$s_!0uaj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffac4926a-e9a9-4b85-a3eb-47d7a327515c_2148x1406.png 1272w, https://substackcdn.com/image/fetch/$s_!0uaj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffac4926a-e9a9-4b85-a3eb-47d7a327515c_2148x1406.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Full derivation of standard error (SE) expression</figcaption></figure></div><p>Within this derivation, we use the variance of a sum identity, which can be generally expressed as shown in the equation below. This identity allows us to capture the (non-zero) covariance terms within our variance expression.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ca53!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44425a1d-bf2e-4b12-a01f-e1746e98b362_1902x272.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ca53!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44425a1d-bf2e-4b12-a01f-e1746e98b362_1902x272.png 424w, https://substackcdn.com/image/fetch/$s_!ca53!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44425a1d-bf2e-4b12-a01f-e1746e98b362_1902x272.png 848w, https://substackcdn.com/image/fetch/$s_!ca53!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44425a1d-bf2e-4b12-a01f-e1746e98b362_1902x272.png 1272w, https://substackcdn.com/image/fetch/$s_!ca53!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44425a1d-bf2e-4b12-a01f-e1746e98b362_1902x272.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ca53!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44425a1d-bf2e-4b12-a01f-e1746e98b362_1902x272.png" width="518" height="74" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/44425a1d-bf2e-4b12-a01f-e1746e98b362_1902x272.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:208,&quot;width&quot;:1456,&quot;resizeWidth&quot;:518,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ca53!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44425a1d-bf2e-4b12-a01f-e1746e98b362_1902x272.png 424w, https://substackcdn.com/image/fetch/$s_!ca53!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44425a1d-bf2e-4b12-a01f-e1746e98b362_1902x272.png 848w, https://substackcdn.com/image/fetch/$s_!ca53!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44425a1d-bf2e-4b12-a01f-e1746e98b362_1902x272.png 1272w, https://substackcdn.com/image/fetch/$s_!ca53!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44425a1d-bf2e-4b12-a01f-e1746e98b362_1902x272.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Variance of a sum identity</figcaption></figure></div><p><strong>Bernoulli variables.</strong> Let&#8217;s assume <code>X</code> is a <a href="https://en.wikipedia.org/wiki/Bernoulli_distribution">Bernoulli random variable</a>, meaning that our scores are binary <code>x_i &#8712; {0, 1}</code>. In this case, our standard error expression can be simplified even further. To begin, we know that  <code>E[X] = 1&#215;P(X=1) + 0&#215;P(X=0) = P(X=1)</code>. Given that the values of X are either zero or one, it is also true that <code>E[X^2] = E[X]</code> because <code>x^2 = x</code> when <code>x = 0</code> or <code>x = 1</code>. </p><p>We can easily plug these two identities into our prior expression for the variance <code>Var(X) = E[X^2] - (E[X])^2 = Pr(X=1) - (Pr(X=1))^2 = &#956;(1 - &#956;)</code>, where &#956; is the mean of <code>X</code>. Practically, we can estimate <code>&#956;</code> by taking a sample mean X&#772;. Then, we can plug this simplified <code>Var(X)</code> into our previous formula for the standard error, yielding the simplified expression shown below. Therefore, we can use this simpler standard error expression when the values of <code>X</code> are binary.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RtcD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd555e407-8939-4b7d-b57a-565eadeaea9d_1779x510.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RtcD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd555e407-8939-4b7d-b57a-565eadeaea9d_1779x510.png 424w, https://substackcdn.com/image/fetch/$s_!RtcD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd555e407-8939-4b7d-b57a-565eadeaea9d_1779x510.png 848w, https://substackcdn.com/image/fetch/$s_!RtcD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd555e407-8939-4b7d-b57a-565eadeaea9d_1779x510.png 1272w, https://substackcdn.com/image/fetch/$s_!RtcD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd555e407-8939-4b7d-b57a-565eadeaea9d_1779x510.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RtcD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd555e407-8939-4b7d-b57a-565eadeaea9d_1779x510.png" width="1456" height="417" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d555e407-8939-4b7d-b57a-565eadeaea9d_1779x510.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:417,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:122891,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd555e407-8939-4b7d-b57a-565eadeaea9d_1779x510.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RtcD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd555e407-8939-4b7d-b57a-565eadeaea9d_1779x510.png 424w, https://substackcdn.com/image/fetch/$s_!RtcD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd555e407-8939-4b7d-b57a-565eadeaea9d_1779x510.png 848w, https://substackcdn.com/image/fetch/$s_!RtcD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd555e407-8939-4b7d-b57a-565eadeaea9d_1779x510.png 1272w, https://substackcdn.com/image/fetch/$s_!RtcD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd555e407-8939-4b7d-b57a-565eadeaea9d_1779x510.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Standard error of Bernoulli variable</figcaption></figure></div><h4>Law of Large Numbers and the Central Limit Theorem (CLT)</h4><p>The law of large numbers is a fundamental concept in statistics that builds upon our prior definition of the sample mean. Given a random variable <code>X</code>, we are often interested in its true mean &#956;. This mean can be estimated with the sample mean over <code>n</code> samples, but this is a random estimate that can differ from &#956;. The law of large numbers tells us that as the value of <code>n</code> increases, the sample mean will approach (i.e., <a href="https://en.wikipedia.org/wiki/Convergence_of_random_variables#Convergence_in_probability">converge in probability</a>) the true mean &#956;; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MNQH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3364743b-192e-4002-ac48-8de363a860e0_1750x716.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MNQH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3364743b-192e-4002-ac48-8de363a860e0_1750x716.png 424w, https://substackcdn.com/image/fetch/$s_!MNQH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3364743b-192e-4002-ac48-8de363a860e0_1750x716.png 848w, https://substackcdn.com/image/fetch/$s_!MNQH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3364743b-192e-4002-ac48-8de363a860e0_1750x716.png 1272w, https://substackcdn.com/image/fetch/$s_!MNQH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3364743b-192e-4002-ac48-8de363a860e0_1750x716.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MNQH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3364743b-192e-4002-ac48-8de363a860e0_1750x716.png" width="438" height="179.20457142857143" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3364743b-192e-4002-ac48-8de363a860e0_1750x716.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:716,&quot;width&quot;:1750,&quot;resizeWidth&quot;:438,&quot;bytes&quot;:177594,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a8eba5d-5e54-48f1-b664-7f8b6478dc19_1750x886.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MNQH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3364743b-192e-4002-ac48-8de363a860e0_1750x716.png 424w, https://substackcdn.com/image/fetch/$s_!MNQH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3364743b-192e-4002-ac48-8de363a860e0_1750x716.png 848w, https://substackcdn.com/image/fetch/$s_!MNQH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3364743b-192e-4002-ac48-8de363a860e0_1750x716.png 1272w, https://substackcdn.com/image/fetch/$s_!MNQH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3364743b-192e-4002-ac48-8de363a860e0_1750x716.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Expression for the law of large numbers</figcaption></figure></div><p>The law of large numbers only tells us that the sample mean will eventually settle around &#956; with sufficiently large <code>n</code>. It does not tell us how much the sample mean differs from the true mean at finite <code>n</code> or how quickly we converge to &#956; as <code>n</code> increases. We can express the intuition for the law of large numbers as follows: <em>with enough data, our estimator (i.e., the sample mean) approaches the true mean.</em></p><p><strong>Standardization and z-score.</strong> Given a random variable <code>X</code> (or a realized value <code>x</code>), we can <a href="https://en.wikipedia.org/wiki/Standard_score">standardize</a> by subtracting the mean <code>&#956;</code> and dividing by the standard deviation <code>&#963;</code>; see below. This process produces a standardized random variable <code>Z</code> (or a realized value <code>z</code>). The z-score <code>z</code><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a> indicates how many standard deviations&#8212;<em>in units of &#963;</em>&#8212;the value <code>x</code> lies above (<code>z &gt; 0</code>) or below (<code>z &lt; 0</code>) the mean &#956;.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DXuE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F387f8bc3-f2e2-4856-8441-a5fc340afcf6_1492x504.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DXuE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F387f8bc3-f2e2-4856-8441-a5fc340afcf6_1492x504.png 424w, https://substackcdn.com/image/fetch/$s_!DXuE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F387f8bc3-f2e2-4856-8441-a5fc340afcf6_1492x504.png 848w, https://substackcdn.com/image/fetch/$s_!DXuE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F387f8bc3-f2e2-4856-8441-a5fc340afcf6_1492x504.png 1272w, https://substackcdn.com/image/fetch/$s_!DXuE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F387f8bc3-f2e2-4856-8441-a5fc340afcf6_1492x504.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DXuE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F387f8bc3-f2e2-4856-8441-a5fc340afcf6_1492x504.png" width="435" height="146.99175824175825" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/387f8bc3-f2e2-4856-8441-a5fc340afcf6_1492x504.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:492,&quot;width&quot;:1456,&quot;resizeWidth&quot;:435,&quot;bytes&quot;:130724,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F387f8bc3-f2e2-4856-8441-a5fc340afcf6_1492x504.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DXuE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F387f8bc3-f2e2-4856-8441-a5fc340afcf6_1492x504.png 424w, https://substackcdn.com/image/fetch/$s_!DXuE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F387f8bc3-f2e2-4856-8441-a5fc340afcf6_1492x504.png 848w, https://substackcdn.com/image/fetch/$s_!DXuE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F387f8bc3-f2e2-4856-8441-a5fc340afcf6_1492x504.png 1272w, https://substackcdn.com/image/fetch/$s_!DXuE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F387f8bc3-f2e2-4856-8441-a5fc340afcf6_1492x504.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Any variable or value can be standardized in this way. For example, we will next standardize the sample mean while formulating the Central Limit Theorem.</p><p>The <strong>Central Limit Theorem (CLT) </strong>goes beyond the law of large numbers by describing how our sample mean estimates will be distributed around the true mean &#956;. Our random variable <code>X</code> has a mean of &#956;, and we estimate this mean with a sample mean. We know from our prior derivation that this sample mean has a variance of <code>&#963;^2 / n</code> (assuming IID random variables and finite variance <code>&#963;^2</code><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>). </p><p>Using this mean and variance, we can standardize the sample mean to obtain <code>Z_n</code> by subtracting the mean and dividing by the standard error; see below. The denominator of <code>Z_n</code> is our previous equation for standard error&#8212;<em>this is just the standard deviation of our sample mean</em>! We rarely know the actual value of <code>&#963;</code>, so we can estimate the true value with the sample standard deviation <code>s</code>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pI2E!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde20240-4ffd-46c4-be46-4fd84e8e576e_1618x1047.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pI2E!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde20240-4ffd-46c4-be46-4fd84e8e576e_1618x1047.png 424w, https://substackcdn.com/image/fetch/$s_!pI2E!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde20240-4ffd-46c4-be46-4fd84e8e576e_1618x1047.png 848w, https://substackcdn.com/image/fetch/$s_!pI2E!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde20240-4ffd-46c4-be46-4fd84e8e576e_1618x1047.png 1272w, https://substackcdn.com/image/fetch/$s_!pI2E!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde20240-4ffd-46c4-be46-4fd84e8e576e_1618x1047.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pI2E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde20240-4ffd-46c4-be46-4fd84e8e576e_1618x1047.png" width="508" height="328.66483516483515" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dde20240-4ffd-46c4-be46-4fd84e8e576e_1618x1047.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:942,&quot;width&quot;:1456,&quot;resizeWidth&quot;:508,&quot;bytes&quot;:266092,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde20240-4ffd-46c4-be46-4fd84e8e576e_1618x1047.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pI2E!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde20240-4ffd-46c4-be46-4fd84e8e576e_1618x1047.png 424w, https://substackcdn.com/image/fetch/$s_!pI2E!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde20240-4ffd-46c4-be46-4fd84e8e576e_1618x1047.png 848w, https://substackcdn.com/image/fetch/$s_!pI2E!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde20240-4ffd-46c4-be46-4fd84e8e576e_1618x1047.png 1272w, https://substackcdn.com/image/fetch/$s_!pI2E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde20240-4ffd-46c4-be46-4fd84e8e576e_1618x1047.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The Central Limit Theorem (CLT)</figcaption></figure></div><p>The CLT tells us that the distribution of <code>Z_n</code> will converge to a standard <a href="https://en.wikipedia.org/wiki/Normal_distribution">normal distribution</a><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>&#8212;<em>meaning a normal distribution with a mean of zero and variance of one</em>&#8212;as the value of <code>n</code> increases. Stated differently, this means that the distribution of our sample mean becomes approximately normal with sufficiently large <code>n</code>, as shown in the orange distribution above. From this information, we know that the standard deviation of the sample mean distribution will decrease proportionally to <code>1 / sqrt(n)</code> and the error of our sample mean estimate is on the order of <code>&#963; / sqrt(n)</code>&#8212;<em>the standard deviation of the above distribution</em>. </p><h4>Confidence Intervals</h4><p>Consider a random variable <code>X</code> with a true mean &#956; that we estimate with the sample mean <code>X&#772;_n</code> computed from <code>n</code> samples. To quantify the uncertainty of this estimate, we can next compute a 95% <a href="https://en.wikipedia.org/wiki/Confidence_interval">confidence interval</a> that has the following form: <code>x&#772;_n &#177; y</code>. This confidence interval indicates that if we repeated the sampling procedure many times and recomputed this confidence interval each time, 95% of the resulting confidence intervals would contain the true mean <code>&#956;</code>. Our goal is to find the value of <code>y</code> that statistically yields such a 95% confidence interval. To find a formula that allows us to compute this confidence interval, we actually need to combine all of the ideas we have learned so far.</p><p>First, let&#8217;s consider our sample mean estimator <code>X&#772;_n</code>. Assuming IID samples with finite variance, we know from the CLT that this estimator has an approximately normal distribution <code>N(&#956;, &#963;^2 / n)</code> assuming that the value of <code>n</code> is sufficiently large, as well as a standard error given by <code>SE(X&#772;_n) = &#963; / sqrt(n)</code>. When computing a 95% confidence interval, we consider a normal distribution <code>N(0, 1)</code> and try to find a bound that includes 95% of the probability mass for this distribution; see below for an illustration.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!id3b!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d275107-1fc1-443b-93ed-0b24e895aa71_2962x1464.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!id3b!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d275107-1fc1-443b-93ed-0b24e895aa71_2962x1464.png 424w, https://substackcdn.com/image/fetch/$s_!id3b!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d275107-1fc1-443b-93ed-0b24e895aa71_2962x1464.png 848w, https://substackcdn.com/image/fetch/$s_!id3b!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d275107-1fc1-443b-93ed-0b24e895aa71_2962x1464.png 1272w, https://substackcdn.com/image/fetch/$s_!id3b!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d275107-1fc1-443b-93ed-0b24e895aa71_2962x1464.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!id3b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d275107-1fc1-443b-93ed-0b24e895aa71_2962x1464.png" width="1456" height="720" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6d275107-1fc1-443b-93ed-0b24e895aa71_2962x1464.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:720,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:598532,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d275107-1fc1-443b-93ed-0b24e895aa71_2962x1464.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!id3b!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d275107-1fc1-443b-93ed-0b24e895aa71_2962x1464.png 424w, https://substackcdn.com/image/fetch/$s_!id3b!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d275107-1fc1-443b-93ed-0b24e895aa71_2962x1464.png 848w, https://substackcdn.com/image/fetch/$s_!id3b!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d275107-1fc1-443b-93ed-0b24e895aa71_2962x1464.png 1272w, https://substackcdn.com/image/fetch/$s_!id3b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d275107-1fc1-443b-93ed-0b24e895aa71_2962x1464.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">95% CI for a standard normal distribution</figcaption></figure></div><p>Given a standard normal distribution, we have <code>P(|Z| &lt; 1.96) = 0.95</code>. This is a two-sided confidence interval, meaning 2.5% of the total 5% of probability mass outside our confidence interval is allocated to each side of the distribution. In most cases, however, we will want to compute a confidence interval for a non-standard normal distribution. To do this, we can just standardize the distribution as discussed previously. For example, given our distribution <code>N(&#956;, &#963;^2 / n)</code> from the CLT, we can derive a standardized variable <code>Z</code> that follows a standard normal distribution. From here, we can just transform the confidence interval with the same standardization process; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zAwi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5595202-4a2b-4ddb-999a-9e3db7855ac3_2353x654.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zAwi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5595202-4a2b-4ddb-999a-9e3db7855ac3_2353x654.png 424w, https://substackcdn.com/image/fetch/$s_!zAwi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5595202-4a2b-4ddb-999a-9e3db7855ac3_2353x654.png 848w, https://substackcdn.com/image/fetch/$s_!zAwi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5595202-4a2b-4ddb-999a-9e3db7855ac3_2353x654.png 1272w, https://substackcdn.com/image/fetch/$s_!zAwi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5595202-4a2b-4ddb-999a-9e3db7855ac3_2353x654.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zAwi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5595202-4a2b-4ddb-999a-9e3db7855ac3_2353x654.png" width="1456" height="405" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a5595202-4a2b-4ddb-999a-9e3db7855ac3_2353x654.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:405,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:209716,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5595202-4a2b-4ddb-999a-9e3db7855ac3_2353x654.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zAwi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5595202-4a2b-4ddb-999a-9e3db7855ac3_2353x654.png 424w, https://substackcdn.com/image/fetch/$s_!zAwi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5595202-4a2b-4ddb-999a-9e3db7855ac3_2353x654.png 848w, https://substackcdn.com/image/fetch/$s_!zAwi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5595202-4a2b-4ddb-999a-9e3db7855ac3_2353x654.png 1272w, https://substackcdn.com/image/fetch/$s_!zAwi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5595202-4a2b-4ddb-999a-9e3db7855ac3_2353x654.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Computing 95% CI for a normal distribution</figcaption></figure></div><p>This approach yields a formula&#8212;<em>based upon our sample size and the standard error of our sample mean</em>&#8212;that can be used to compute a 95% confidence interval. </p><h2><a href="https://arxiv.org/abs/2411.00640">A Statistical Approach to LLM Evaluations</a> [1]</h2><p>Now that we have built a solid statistical foundation, we can use these ideas to create a framework for LLM evaluations that better quantifies uncertainty. In doing this, we can be more confident in our model evaluations and understand whether certain evaluation results are legitimate or just caused by noise. Our discussions will be based on a seminal paper from Anthropic [1] that provides several key recommendations for performing LLM evaluations in a way that is grounded in statistics, rather than just comparing raw performance metrics. </p><blockquote><p><em>&#8220;Fundamentally, evaluations are experiments; but the literature on evaluations has largely ignored the literature from other sciences on experiment analysis and planning. This article shows researchers with some training in statistics how to think about and analyze data from language model evaluations.&#8221;</em> - from [1]</p></blockquote><p><strong>Statistical framing for LLM evaluations.</strong> In theory, when evaluating an LLM, there exists a super-population of questions (illustrated below) that exhaustively covers all the ways in which the LLM can be evaluated. Practically speaking, any evaluation dataset represents only a finite subset of questions from this super-population, as represented by the red shaded region in the figure below. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bgei!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8f66323-e52f-4aeb-adcc-4189318d419b_848x616.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bgei!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8f66323-e52f-4aeb-adcc-4189318d419b_848x616.png 424w, https://substackcdn.com/image/fetch/$s_!bgei!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8f66323-e52f-4aeb-adcc-4189318d419b_848x616.png 848w, https://substackcdn.com/image/fetch/$s_!bgei!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8f66323-e52f-4aeb-adcc-4189318d419b_848x616.png 1272w, https://substackcdn.com/image/fetch/$s_!bgei!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8f66323-e52f-4aeb-adcc-4189318d419b_848x616.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bgei!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8f66323-e52f-4aeb-adcc-4189318d419b_848x616.png" width="326" height="236.81132075471697" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e8f66323-e52f-4aeb-adcc-4189318d419b_848x616.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:616,&quot;width&quot;:848,&quot;resizeWidth&quot;:326,&quot;bytes&quot;:58825,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8f66323-e52f-4aeb-adcc-4189318d419b_848x616.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bgei!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8f66323-e52f-4aeb-adcc-4189318d419b_848x616.png 424w, https://substackcdn.com/image/fetch/$s_!bgei!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8f66323-e52f-4aeb-adcc-4189318d419b_848x616.png 848w, https://substackcdn.com/image/fetch/$s_!bgei!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8f66323-e52f-4aeb-adcc-4189318d419b_848x616.png 1272w, https://substackcdn.com/image/fetch/$s_!bgei!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8f66323-e52f-4aeb-adcc-4189318d419b_848x616.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Sampling from a super-population</figcaption></figure></div><p>This framing can be used to rethink our perspective on model evaluations. Instead of trying to maximize the performance of our model on a finite benchmark, we should be trying to improve an underlying skill of the model. Any evaluation dataset captures a corresponding skill imperfectly, as it is only a finite sample from the super-population that is associated with that skill.</p><p><strong>Key recommendations.</strong> There are a set of concrete recommendations proposed in [1] that outline how one can approach LLM evaluations in a rigorous manner. We first outline these recommendations here, then spend the rest of this section explaining each of them in more depth: </p><ol><li><p>When questions are IID, LLM evaluation results should be accompanied by standard errors that are computed using the CLT.</p></li><li><p>If questions are not IID (e.g., drawn from related clusters or groups), then our CLT standard error formula is no longer valid and we should instead compute a clustered standard error. </p></li><li><p>To reduce the variance of evaluation results, we can re-sample outputs from the LLM multiple times&#8212;<em>or even analyze next token probabilities</em>&#8212;to better account for the variance of each individual evaluation result.</p></li><li><p>When comparing two models, we can perform analysis of their paired difference (i.e., rather than just providing separate, aggregated evaluation scores over the dataset) to yield a more confident result. </p></li></ol><p><strong>Preliminaries.</strong> The evaluation dataset in [1] is assumed to contain <code>n</code> questions, and each question receives an evaluation score <code>s_i</code>; e.g., a <a href="https://cameronrwolfe.substack.com/i/153722335/reinforcement-learning-with-verifiable-rewards">binary correctness signal</a> or an <a href="https://cameronrwolfe.substack.com/p/llm-as-a-judge">LLM-as-a-Judge</a> score. A score can be decomposed as <code>s_i = x_i + &#1013;_i</code>, where <code>x_i</code> is the expected score (i.e., <code>E[s_i] = x_i</code>) and <code>&#1013;_i</code> adds randomness to the score. We assume zero-mean randomness (i.e., <code>E[&#1013;_i|i] = 0</code>) that does not change the expected score. Put simply, this setup models a non-deterministic evaluation setting. Notably, LLM evaluation is fundamentally non-deterministic, as it involves sampling from the <a href="https://cameronrwolfe.substack.com/i/136638774/understanding-next-token-prediction">next token distribution</a> of one or more LLMs (i.e., the model being evaluated and possibly an LLM judge).</p><h4>Standard Errors and the CLT</h4><p>The simplest case when analyzing evaluation results is when each question <code>i</code> is independent. Our goal in analyzing an evaluation result is to understand the true performance of our model, represented by the mean score <code>&#956; = E[s] = E[x]</code><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a> from our super-population. We only have access to a finite set of scores from our evaluation dataset. However, we know from the law of large numbers that we can estimate the true mean by taking a sample mean <code>s&#773;</code> over a finite set of evaluation scores. This estimator approaches <code>&#956;</code> as the value of <code>n</code> becomes larger.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5xxW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F922cfd8e-bdc5-4d5f-b6ef-ce75f47266ce_1386x1032.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5xxW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F922cfd8e-bdc5-4d5f-b6ef-ce75f47266ce_1386x1032.png 424w, https://substackcdn.com/image/fetch/$s_!5xxW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F922cfd8e-bdc5-4d5f-b6ef-ce75f47266ce_1386x1032.png 848w, https://substackcdn.com/image/fetch/$s_!5xxW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F922cfd8e-bdc5-4d5f-b6ef-ce75f47266ce_1386x1032.png 1272w, https://substackcdn.com/image/fetch/$s_!5xxW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F922cfd8e-bdc5-4d5f-b6ef-ce75f47266ce_1386x1032.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5xxW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F922cfd8e-bdc5-4d5f-b6ef-ce75f47266ce_1386x1032.png" width="572" height="425.9047619047619" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/922cfd8e-bdc5-4d5f-b6ef-ce75f47266ce_1386x1032.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1032,&quot;width&quot;:1386,&quot;resizeWidth&quot;:572,&quot;bytes&quot;:247099,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F922cfd8e-bdc5-4d5f-b6ef-ce75f47266ce_1386x1032.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5xxW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F922cfd8e-bdc5-4d5f-b6ef-ce75f47266ce_1386x1032.png 424w, https://substackcdn.com/image/fetch/$s_!5xxW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F922cfd8e-bdc5-4d5f-b6ef-ce75f47266ce_1386x1032.png 848w, https://substackcdn.com/image/fetch/$s_!5xxW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F922cfd8e-bdc5-4d5f-b6ef-ce75f47266ce_1386x1032.png 1272w, https://substackcdn.com/image/fetch/$s_!5xxW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F922cfd8e-bdc5-4d5f-b6ef-ce75f47266ce_1386x1032.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Standard error and confidence interval for LLM evaluations (from [1])</figcaption></figure></div><p>In other words, <em>taking an average score over a large number of independently-sampled questions generally provides a good estimate of a model&#8217;s true performance. </em>However, &#8220;good&#8221; is difficult to quantify, and how do we know if <code>n</code> is sufficiently large? To quantify uncertainty, we can use the CLT to compute the standard error for our sample mean; see above. As we can see, this expression is identical&#8212;<em>other than switching </em><code>x</code><em> with </em><code>s</code>&#8212;to our previously-derived standard error expression. We can also derive a confidence interval from the standard error similarly to before.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SvuV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28ad266f-3487-44e1-99cc-08057fbf79e4_1542x467.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SvuV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28ad266f-3487-44e1-99cc-08057fbf79e4_1542x467.png 424w, https://substackcdn.com/image/fetch/$s_!SvuV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28ad266f-3487-44e1-99cc-08057fbf79e4_1542x467.png 848w, https://substackcdn.com/image/fetch/$s_!SvuV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28ad266f-3487-44e1-99cc-08057fbf79e4_1542x467.png 1272w, https://substackcdn.com/image/fetch/$s_!SvuV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28ad266f-3487-44e1-99cc-08057fbf79e4_1542x467.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SvuV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28ad266f-3487-44e1-99cc-08057fbf79e4_1542x467.png" width="572" height="173.25" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/28ad266f-3487-44e1-99cc-08057fbf79e4_1542x467.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:441,&quot;width&quot;:1456,&quot;resizeWidth&quot;:572,&quot;bytes&quot;:102246,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28ad266f-3487-44e1-99cc-08057fbf79e4_1542x467.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SvuV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28ad266f-3487-44e1-99cc-08057fbf79e4_1542x467.png 424w, https://substackcdn.com/image/fetch/$s_!SvuV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28ad266f-3487-44e1-99cc-08057fbf79e4_1542x467.png 848w, https://substackcdn.com/image/fetch/$s_!SvuV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28ad266f-3487-44e1-99cc-08057fbf79e4_1542x467.png 1272w, https://substackcdn.com/image/fetch/$s_!SvuV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28ad266f-3487-44e1-99cc-08057fbf79e4_1542x467.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Standard error with a Bernoulli variable (from [1])</figcaption></figure></div><p>If we assume a Bernoulli distribution&#8212;<em>meaning that for all </em><code>i</code><em> we have </em><code>s_i &#1013; {0, 1}</code>&#8212;this expression can be simplified even further; see above. However, the Bernoulli formula requires that scores are truly binary (i.e., not fractional<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a>). </p><blockquote><p><em>&#8220;We suggest reporting the standard error of the mean alongside (beneath) the mean when reporting eval scores.&#8221;</em> - from [1]</p></blockquote><p>Now that we know how to compute these quantities for an LLM evaluation, the recommendation in [1] is simple: <em>just report this standard error and the number of samples </em><code>n</code><em> alongside the actual evaluation result</em>. Computing this standard error is not difficult&#8212;<em>it requires forming a sample estimate of the standard deviation of </em><code>s</code>. A toy example of the proposed reporting structure for two models evaluated over three evaluation datasets is provided in the table below for reference.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sKjZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b751cf9-3002-4cf8-b777-80b065b27fcd_1704x514.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sKjZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b751cf9-3002-4cf8-b777-80b065b27fcd_1704x514.png 424w, https://substackcdn.com/image/fetch/$s_!sKjZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b751cf9-3002-4cf8-b777-80b065b27fcd_1704x514.png 848w, https://substackcdn.com/image/fetch/$s_!sKjZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b751cf9-3002-4cf8-b777-80b065b27fcd_1704x514.png 1272w, https://substackcdn.com/image/fetch/$s_!sKjZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b751cf9-3002-4cf8-b777-80b065b27fcd_1704x514.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sKjZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b751cf9-3002-4cf8-b777-80b065b27fcd_1704x514.png" width="1456" height="439" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9b751cf9-3002-4cf8-b777-80b065b27fcd_1704x514.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:439,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:118354,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b751cf9-3002-4cf8-b777-80b065b27fcd_1704x514.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sKjZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b751cf9-3002-4cf8-b777-80b065b27fcd_1704x514.png 424w, https://substackcdn.com/image/fetch/$s_!sKjZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b751cf9-3002-4cf8-b777-80b065b27fcd_1704x514.png 848w, https://substackcdn.com/image/fetch/$s_!sKjZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b751cf9-3002-4cf8-b777-80b065b27fcd_1704x514.png 1272w, https://substackcdn.com/image/fetch/$s_!sKjZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b751cf9-3002-4cf8-b777-80b065b27fcd_1704x514.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>From the standard error, we can compute a confidence interval for each model&#8217;s evaluation metric. These intervals summarize uncertainty in the estimated mean performance. When comparing models, non-overlapping confidence intervals suggest a real performance difference, but overlapping intervals do not by themselves rule one out. A precise comparison requires directly analyzing the difference between the models, which we will handle in a future section. </p><p>As an example, confidence intervals for the table above have been computed below for all model and dataset combinations. We see here that all models have overlapping confidence intervals. In future sections, we will learn methods that can be used to compare models with a greater level of precision.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Dnha!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff77e8f45-d6d9-4588-a068-70a5616bec1f_1854x434.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Dnha!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff77e8f45-d6d9-4588-a068-70a5616bec1f_1854x434.png 424w, https://substackcdn.com/image/fetch/$s_!Dnha!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff77e8f45-d6d9-4588-a068-70a5616bec1f_1854x434.png 848w, https://substackcdn.com/image/fetch/$s_!Dnha!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff77e8f45-d6d9-4588-a068-70a5616bec1f_1854x434.png 1272w, https://substackcdn.com/image/fetch/$s_!Dnha!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff77e8f45-d6d9-4588-a068-70a5616bec1f_1854x434.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Dnha!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff77e8f45-d6d9-4588-a068-70a5616bec1f_1854x434.png" width="1456" height="341" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f77e8f45-d6d9-4588-a068-70a5616bec1f_1854x434.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:341,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:88761,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff77e8f45-d6d9-4588-a068-70a5616bec1f_1854x434.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Dnha!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff77e8f45-d6d9-4588-a068-70a5616bec1f_1854x434.png 424w, https://substackcdn.com/image/fetch/$s_!Dnha!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff77e8f45-d6d9-4588-a068-70a5616bec1f_1854x434.png 848w, https://substackcdn.com/image/fetch/$s_!Dnha!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff77e8f45-d6d9-4588-a068-70a5616bec1f_1854x434.png 1272w, https://substackcdn.com/image/fetch/$s_!Dnha!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff77e8f45-d6d9-4588-a068-70a5616bec1f_1854x434.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Confidence intervals for model evaluation scores</figcaption></figure></div><p><strong>Bootstrapping</strong> is another common approach to use for evaluating machine learning models (including LLMs) that proceeds as follows:</p><ol><li><p>Sample <code>n</code> question scores with replacement. </p></li><li><p>Compute the sample mean <code>s&#773;</code>.</p></li><li><p>Repeat steps 1-2 multiple times. </p></li><li><p>Measure the standard deviation of these sample means.</p></li><li><p>Use this standard deviation as an estimate of the standard error. </p></li></ol><p>While this approach is valid and <a href="https://github.com/openai/evals">commonly used</a> in LLM evaluations, authors in [1] argue that bootstrapping is unnecessary when the CLT is valid. Therefore, we can just use the CLT when questions are sampled independently, <code>n</code> is sufficiently large, and the variance of our scores is finite. However, the CLT does fall short when <code>n</code> is small&#8212;<em>the handling of this evaluation regime is discussed extensively in [2]</em>. </p><h4>Clustered Errors</h4><blockquote><p><em>&#8220;We show how to use clustered standard errors, a technique developed in the social sciences, to account for the dependence and correlation structure present in question clusters.&#8221;</em> - from [1]</p></blockquote><p>If questions are not sampled independently, the standard error expression from the CLT is no longer valid. In this case, the CLT underestimates uncertainty&#8212;<em>our confidence intervals are too narrow</em>. We are evaluating on <code>n</code> questions, but some of the questions are actually related to each other. As a result, the &#8220;effective&#8221; number of evaluation questions is smaller than <code>n</code>, thus increasing the standard error. Some practical examples of non-independent questions include:</p><ul><li><p>The same prompt in different languages. </p></li><li><p>Prompts that reference the same document or source.</p></li><li><p>Questions that are generally related in format or topic.</p></li></ul><p>To avoid underestimating uncertainty, authors in [1] recommend using a <a href="https://arxiv.org/abs/1710.02926">clustered standard error</a>. We use <code>s_{i, c}</code> to denote the score for question <code>i</code> in cluster <code>c</code>. The cluster-adjusted standard error assumes that clusters are independent: <em>questions in a cluster can be correlated, but questions across clusters cannot</em>.</p><p>To evaluate an LLM on these clusters, we still compute the sample mean across all question scores <code>S&#773;</code>, but we modify our standard error expression. Before, we assumed that scores <code>S_i</code> were IID, which implies that <code>Cov(S_i, S_j) = 0</code> when <code>i &#8800; j</code>. When questions are clustered, we no longer have zero covariance, so we need to adjust our derivation of the standard error; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bKWW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff97d7c72-6434-4015-b8ce-71b44913be18_1849x1022.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bKWW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff97d7c72-6434-4015-b8ce-71b44913be18_1849x1022.png 424w, https://substackcdn.com/image/fetch/$s_!bKWW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff97d7c72-6434-4015-b8ce-71b44913be18_1849x1022.png 848w, https://substackcdn.com/image/fetch/$s_!bKWW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff97d7c72-6434-4015-b8ce-71b44913be18_1849x1022.png 1272w, https://substackcdn.com/image/fetch/$s_!bKWW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff97d7c72-6434-4015-b8ce-71b44913be18_1849x1022.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bKWW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff97d7c72-6434-4015-b8ce-71b44913be18_1849x1022.png" width="1456" height="805" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f97d7c72-6434-4015-b8ce-71b44913be18_1849x1022.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:805,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:307477,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff97d7c72-6434-4015-b8ce-71b44913be18_1849x1022.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bKWW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff97d7c72-6434-4015-b8ce-71b44913be18_1849x1022.png 424w, https://substackcdn.com/image/fetch/$s_!bKWW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff97d7c72-6434-4015-b8ce-71b44913be18_1849x1022.png 848w, https://substackcdn.com/image/fetch/$s_!bKWW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff97d7c72-6434-4015-b8ce-71b44913be18_1849x1022.png 1272w, https://substackcdn.com/image/fetch/$s_!bKWW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff97d7c72-6434-4015-b8ce-71b44913be18_1849x1022.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The above clustered standard error equation interpolates between two cases:</p><ol><li><p>Scores within a cluster are perfectly correlated and each cluster is treated as if it were a single question <code>i</code>. </p></li><li><p>Scores within a cluster have no correlation, so our expression reduces to the original standard error expression from the CLT.</p></li></ol><div class="pullquote"><p><em>&#8220;The clustered standard error acts as a kind of sliding scale between cases where scores within a cluster are perfectly correlated (in which case each cluster acts as a single independent observation) and perfectly uncorrelated (in which case the clustered standard error is equivalent to the unclustered case). The intra-cluster correlations&#8230; are captured by the triple summation (over clusters and cross-terms within clusters).&#8221; - from [1]</em></p></div><p>When questions are not sampled independently, authors in [1] recommend reporting cluster-adjusted standard errors, as well as the number of questions <code>n</code> and the number of clusters <code>C</code>; see below. Similarly to before, the cluster-adjusted standard error can be used to compute a confidence interval. In practice, the clustered standard error may be drastically larger than the CLT standard error. For example, authors provide a concrete example in [1] where the standard error increases by 3&#215; when accounting for clusters. <em>Failing to consider whether questions are actually independent can drastically impact the interpretation of evaluation results</em>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!O2wi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3400c02-a7e0-4d19-bcb1-604881039c3b_1958x578.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!O2wi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3400c02-a7e0-4d19-bcb1-604881039c3b_1958x578.png 424w, https://substackcdn.com/image/fetch/$s_!O2wi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3400c02-a7e0-4d19-bcb1-604881039c3b_1958x578.png 848w, https://substackcdn.com/image/fetch/$s_!O2wi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3400c02-a7e0-4d19-bcb1-604881039c3b_1958x578.png 1272w, https://substackcdn.com/image/fetch/$s_!O2wi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3400c02-a7e0-4d19-bcb1-604881039c3b_1958x578.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!O2wi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3400c02-a7e0-4d19-bcb1-604881039c3b_1958x578.png" width="1456" height="430" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f3400c02-a7e0-4d19-bcb1-604881039c3b_1958x578.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:430,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:141909,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3400c02-a7e0-4d19-bcb1-604881039c3b_1958x578.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!O2wi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3400c02-a7e0-4d19-bcb1-604881039c3b_1958x578.png 424w, https://substackcdn.com/image/fetch/$s_!O2wi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3400c02-a7e0-4d19-bcb1-604881039c3b_1958x578.png 848w, https://substackcdn.com/image/fetch/$s_!O2wi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3400c02-a7e0-4d19-bcb1-604881039c3b_1958x578.png 1272w, https://substackcdn.com/image/fetch/$s_!O2wi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3400c02-a7e0-4d19-bcb1-604881039c3b_1958x578.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>We assume that questions are sampled independently in future sections unless stated otherwise. However, we can use similar steps as outlined above to derive most results in a cluster-adjusted fashion. Many of the derivations extend to the clustered setting once the covariance structure is accounted for appropriately.</p><h4>Reducing Variance</h4><p>We now understand how to compute standard errors and confidence intervals for our evaluation results. The next reasonable question to ask is: <em>What can we do to reduce the standard error?</em> First, recall that our evaluation score is defined as <code>s_i = x_i + &#1013;_i</code>, where we have <code>E[s_i] = x_i</code> and <code>Var(&#1013;_i) = &#963;_i^2</code>. To answer this question, we begin with our expression for the standard error and perform a decomposition with the law of total variance; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1q7L!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b0e5de0-c9d2-4840-a0bc-3b5c19d7f956_1936x890.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1q7L!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b0e5de0-c9d2-4840-a0bc-3b5c19d7f956_1936x890.png 424w, https://substackcdn.com/image/fetch/$s_!1q7L!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b0e5de0-c9d2-4840-a0bc-3b5c19d7f956_1936x890.png 848w, https://substackcdn.com/image/fetch/$s_!1q7L!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b0e5de0-c9d2-4840-a0bc-3b5c19d7f956_1936x890.png 1272w, https://substackcdn.com/image/fetch/$s_!1q7L!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b0e5de0-c9d2-4840-a0bc-3b5c19d7f956_1936x890.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1q7L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b0e5de0-c9d2-4840-a0bc-3b5c19d7f956_1936x890.png" width="1456" height="669" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6b0e5de0-c9d2-4840-a0bc-3b5c19d7f956_1936x890.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:669,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:212028,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b0e5de0-c9d2-4840-a0bc-3b5c19d7f956_1936x890.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1q7L!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b0e5de0-c9d2-4840-a0bc-3b5c19d7f956_1936x890.png 424w, https://substackcdn.com/image/fetch/$s_!1q7L!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b0e5de0-c9d2-4840-a0bc-3b5c19d7f956_1936x890.png 848w, https://substackcdn.com/image/fetch/$s_!1q7L!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b0e5de0-c9d2-4840-a0bc-3b5c19d7f956_1936x890.png 1272w, https://substackcdn.com/image/fetch/$s_!1q7L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b0e5de0-c9d2-4840-a0bc-3b5c19d7f956_1936x890.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>To apply the law of total variance, we use the following two random variables:</p><ul><li><p>A random variable over evaluation scores <code>S</code>.</p></li><li><p>A random variable over the question that gets sampled <code>I</code>.</p></li></ul><p>We apply the law of total variance by conditioning <code>S</code> on <code>I</code>, where <code>X_I = E[S|I]</code> is the expected score for the sampled question <code>I</code>. We can then further simplify the equation using known properties of the mean and variance of a score.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!33L1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec984b4-30fc-4a0b-9f41-420b04304b18_1413x566.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!33L1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec984b4-30fc-4a0b-9f41-420b04304b18_1413x566.png 424w, https://substackcdn.com/image/fetch/$s_!33L1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec984b4-30fc-4a0b-9f41-420b04304b18_1413x566.png 848w, https://substackcdn.com/image/fetch/$s_!33L1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec984b4-30fc-4a0b-9f41-420b04304b18_1413x566.png 1272w, https://substackcdn.com/image/fetch/$s_!33L1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec984b4-30fc-4a0b-9f41-420b04304b18_1413x566.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!33L1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec984b4-30fc-4a0b-9f41-420b04304b18_1413x566.png" width="442" height="177.05024769992923" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fec984b4-30fc-4a0b-9f41-420b04304b18_1413x566.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:566,&quot;width&quot;:1413,&quot;resizeWidth&quot;:442,&quot;bytes&quot;:117269,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec984b4-30fc-4a0b-9f41-420b04304b18_1413x566.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!33L1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec984b4-30fc-4a0b-9f41-420b04304b18_1413x566.png 424w, https://substackcdn.com/image/fetch/$s_!33L1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec984b4-30fc-4a0b-9f41-420b04304b18_1413x566.png 848w, https://substackcdn.com/image/fetch/$s_!33L1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec984b4-30fc-4a0b-9f41-420b04304b18_1413x566.png 1272w, https://substackcdn.com/image/fetch/$s_!33L1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffec984b4-30fc-4a0b-9f41-420b04304b18_1413x566.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>This derivation yields the variance expression shown above, which provides some actionable insights. First, we see that the simplest method for reducing variance is simply increasing <code>n</code>&#8212;<em>evaluating over a larger set of questions naturally improves reliability</em>. Additionally, <code>Var(x)</code> captures the variability in the mean score across our evaluation dataset&#8212;<em>this is a fundamental property of our super-population that cannot be easily changed</em>. In simple terms, this quantity captures the spread in question difficulty across all possible evaluation questions. However, there are several approaches we can explore for decreasing the value of <code>E[&#963;_i^2]</code>.</p><p><strong>Resampling</strong> can be used to reduce score variance when evaluating any model. Instead of generating and scoring a single output per question, we generate and score <code>K</code> outputs for the same question <code>i</code> (i.e., by sampling multiple completions from the LLM). In [1], authors assume that resampled scores for a fixed question <code>i</code> are IID. After sampling <code>K</code> scores, we can take an average of the scores <code>S&#773;_i</code>, which decreases the score variance by a factor of <code>K</code>; see below for a full derivation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NHsE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2840ce4-9ac2-4d2f-a6c6-367670d2d240_1728x906.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NHsE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2840ce4-9ac2-4d2f-a6c6-367670d2d240_1728x906.png 424w, https://substackcdn.com/image/fetch/$s_!NHsE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2840ce4-9ac2-4d2f-a6c6-367670d2d240_1728x906.png 848w, https://substackcdn.com/image/fetch/$s_!NHsE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2840ce4-9ac2-4d2f-a6c6-367670d2d240_1728x906.png 1272w, https://substackcdn.com/image/fetch/$s_!NHsE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2840ce4-9ac2-4d2f-a6c6-367670d2d240_1728x906.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NHsE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2840ce4-9ac2-4d2f-a6c6-367670d2d240_1728x906.png" width="1456" height="763" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b2840ce4-9ac2-4d2f-a6c6-367670d2d240_1728x906.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:763,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:225134,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2840ce4-9ac2-4d2f-a6c6-367670d2d240_1728x906.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NHsE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2840ce4-9ac2-4d2f-a6c6-367670d2d240_1728x906.png 424w, https://substackcdn.com/image/fetch/$s_!NHsE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2840ce4-9ac2-4d2f-a6c6-367670d2d240_1728x906.png 848w, https://substackcdn.com/image/fetch/$s_!NHsE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2840ce4-9ac2-4d2f-a6c6-367670d2d240_1728x906.png 1272w, https://substackcdn.com/image/fetch/$s_!NHsE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2840ce4-9ac2-4d2f-a6c6-367670d2d240_1728x906.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Therefore, resampling&#8212;<em>or producing </em><code>K</code><em> scores for question </em><code>i</code><em> to yield a mean score </em><code>S&#773;_i</code>&#8212;provides a linear reduction of the within-question variance <code>&#963;_i^2</code> compared to using a single score. The variance for our sample mean has two key terms&#8212;<code>Var(x)</code> and <code>E[&#963;_i^2]</code>&#8212;that are summed together in the numerator. As mentioned before, <code>Var(x)</code> is not mutable, so to reduce variance we can&#8212;<em>in addition to increasing </em><code>n</code>&#8212;increase the value of <code>K</code> until <code>E[&#963;_i^2 / K] &#8810; Var(X)</code>. By doing this, the within-question variance term shrinks toward zero and the variance of our sample mean approaches <code>Var(x) / n</code>; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LrJw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F567df377-9676-4be6-9ae5-20bb6af6d5d2_1315x240.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LrJw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F567df377-9676-4be6-9ae5-20bb6af6d5d2_1315x240.png 424w, https://substackcdn.com/image/fetch/$s_!LrJw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F567df377-9676-4be6-9ae5-20bb6af6d5d2_1315x240.png 848w, https://substackcdn.com/image/fetch/$s_!LrJw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F567df377-9676-4be6-9ae5-20bb6af6d5d2_1315x240.png 1272w, https://substackcdn.com/image/fetch/$s_!LrJw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F567df377-9676-4be6-9ae5-20bb6af6d5d2_1315x240.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LrJw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F567df377-9676-4be6-9ae5-20bb6af6d5d2_1315x240.png" width="380" height="69.35361216730038" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/567df377-9676-4be6-9ae5-20bb6af6d5d2_1315x240.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:240,&quot;width&quot;:1315,&quot;resizeWidth&quot;:380,&quot;bytes&quot;:77942,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F567df377-9676-4be6-9ae5-20bb6af6d5d2_1315x240.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LrJw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F567df377-9676-4be6-9ae5-20bb6af6d5d2_1315x240.png 424w, https://substackcdn.com/image/fetch/$s_!LrJw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F567df377-9676-4be6-9ae5-20bb6af6d5d2_1315x240.png 848w, https://substackcdn.com/image/fetch/$s_!LrJw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F567df377-9676-4be6-9ae5-20bb6af6d5d2_1315x240.png 1272w, https://substackcdn.com/image/fetch/$s_!LrJw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F567df377-9676-4be6-9ae5-20bb6af6d5d2_1315x240.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>Token probabilities. </strong>If an evaluation metric can be computed from the model&#8217;s next token probabilities, we can replace a sampled score with its conditional expectation&#8212;<em>basically just the probability of the correct response</em>&#8212;and remove the within-question variance (i.e., meaning that <code>&#963;_i^2 = 0</code>). Using output token probabilities, we can easily compute the probability of a response from our LLM. For example, if our response is just a single token (e.g., a multiple choice answer), then we know that the probability for this score is the probability of that token within the LLM&#8217;s next token distribution. If our response is more complex (i.e., multiple tokens), then we can also compute the probability of the entire response via the product of probabilities for each individual token; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_IeO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff52c27e7-c025-4498-b68a-b493045dc957_1710x867.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_IeO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff52c27e7-c025-4498-b68a-b493045dc957_1710x867.png 424w, https://substackcdn.com/image/fetch/$s_!_IeO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff52c27e7-c025-4498-b68a-b493045dc957_1710x867.png 848w, https://substackcdn.com/image/fetch/$s_!_IeO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff52c27e7-c025-4498-b68a-b493045dc957_1710x867.png 1272w, https://substackcdn.com/image/fetch/$s_!_IeO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff52c27e7-c025-4498-b68a-b493045dc957_1710x867.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_IeO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff52c27e7-c025-4498-b68a-b493045dc957_1710x867.png" width="352" height="178.41758241758242" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f52c27e7-c025-4498-b68a-b493045dc957_1710x867.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:738,&quot;width&quot;:1456,&quot;resizeWidth&quot;:352,&quot;bytes&quot;:191224,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff52c27e7-c025-4498-b68a-b493045dc957_1710x867.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_IeO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff52c27e7-c025-4498-b68a-b493045dc957_1710x867.png 424w, https://substackcdn.com/image/fetch/$s_!_IeO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff52c27e7-c025-4498-b68a-b493045dc957_1710x867.png 848w, https://substackcdn.com/image/fetch/$s_!_IeO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff52c27e7-c025-4498-b68a-b493045dc957_1710x867.png 1272w, https://substackcdn.com/image/fetch/$s_!_IeO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff52c27e7-c025-4498-b68a-b493045dc957_1710x867.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Probability of a multi-token response</figcaption></figure></div><p>For a question <code>i</code>, we will refer to the probability of a response to this question as <code>p_i</code>. If we have access to this probability, then we can use <code>s_i = x_i = p_i</code>, and the variance term for our score goes away (i.e., <code>&#963;_i^2 = 0</code>). As a result, directly using token probabilities is an effective variance reduction technique. In [1], authors also recommend against changing the sampling temperature&#8212;<em>both for resampling and with token probabilities</em>&#8212;because this alters the underlying response distribution and, in turn, the evaluation target. For this reason, these results study a different model configuration that is not fully comparable to our original LLM. </p><div class="pullquote"><p>&#8220;We recommend a two-pronged variance-reduction strategy. When next-token probabilities are available, and the LLM eval can be conducted using next-token probabilities (i.e. without token generation), compute the expected score for each question, and compute the standard error of expected scores across questions. When next-token probabilities are not available, or the answer requires a chain of thought or other complex interaction, choose a <code>K</code> such that <code>E[&#963;_i^2] / K &#8810; Var(x) </code>and compute the standard error across question-level mean scores. In neither case should the sampling temperature be adjusted for the sake of reducing variance in the scores.&#8221; - from [1]</p></div><p>Going further, we should note that this approach cannot be used in all cases. First of all, many closed LLMs do not provide direct access to token probabilities. Even if these probabilities are available, using them to compute <code>p_i</code> can be complex depending on the evaluation setup. For example, long-form responses with many tokens&#8212;<em>though their probability can be computed</em>&#8212;will usually be evaluated with an LLM judge, which uses a sampling procedure of its own and, therefore, adds variability into the resulting score. Additionally, recent reasoning models output a <a href="https://cameronrwolfe.substack.com/p/demystifying-reasoning-models">reasoning trajectory</a> alongside their final response, which makes computing the output probability more complicated. In these cases, correctly computing <code>p_i</code> is not straightforward, and we cannot assume zero variance by setting <code>x_i = p_i</code>. In these cases, the resampling strategy described above is a better approach.</p><h4>Model Comparisons</h4><p>Now that we deeply understand how to analyze an evaluation score for a single model, we need to focus more on properly comparing the evaluation results of multiple different models. Usually, the goal of evaluation is to understand the performance of a model with respect to other models; e.g., determining if a new model version is better than the current or creating a leaderboard of the best models for a certain evaluation task. Although the techniques we have learned about so far can be applied to comparing evaluation results, we can usually make comparisons more statistically efficient by performing a pairwise analysis.</p><p><strong>Difference of means.</strong> As we saw when learning about standard errors and confidence intervals, a common comparison heuristic is to compute separate confidence intervals for multiple models and check whether they overlap. If two 95% confidence intervals do not overlap, then there is a statistically significant difference between the evaluation results. As we will see, however, this test is actually overly conservative for detecting performance differences<em>&#8212;intervals can overlap even when there is a statistically significant difference in mean scores</em>. Instead, we can analyze the difference in mean between two models; see below. We will refer to the two models being compared as model <code>A</code> and model <code>B</code> for simplicity. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QVGb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cab4f39-7213-4e86-ba6e-0a9bcf7515cb_1104x598.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QVGb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cab4f39-7213-4e86-ba6e-0a9bcf7515cb_1104x598.png 424w, https://substackcdn.com/image/fetch/$s_!QVGb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cab4f39-7213-4e86-ba6e-0a9bcf7515cb_1104x598.png 848w, https://substackcdn.com/image/fetch/$s_!QVGb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cab4f39-7213-4e86-ba6e-0a9bcf7515cb_1104x598.png 1272w, https://substackcdn.com/image/fetch/$s_!QVGb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cab4f39-7213-4e86-ba6e-0a9bcf7515cb_1104x598.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QVGb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cab4f39-7213-4e86-ba6e-0a9bcf7515cb_1104x598.png" width="363" height="196.625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8cab4f39-7213-4e86-ba6e-0a9bcf7515cb_1104x598.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:598,&quot;width&quot;:1104,&quot;resizeWidth&quot;:363,&quot;bytes&quot;:110231,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cab4f39-7213-4e86-ba6e-0a9bcf7515cb_1104x598.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QVGb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cab4f39-7213-4e86-ba6e-0a9bcf7515cb_1104x598.png 424w, https://substackcdn.com/image/fetch/$s_!QVGb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cab4f39-7213-4e86-ba6e-0a9bcf7515cb_1104x598.png 848w, https://substackcdn.com/image/fetch/$s_!QVGb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cab4f39-7213-4e86-ba6e-0a9bcf7515cb_1104x598.png 1272w, https://substackcdn.com/image/fetch/$s_!QVGb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cab4f39-7213-4e86-ba6e-0a9bcf7515cb_1104x598.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>We can compute the standard error of the estimated difference in mean scores; see below. The standard error of the estimated difference in means is the square root of the sum of the variances of the mean estimators for models <code>A</code> and <code>B</code>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9nF_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffea185ec-e795-4b1c-93db-c4b5e4b7764a_1902x826.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9nF_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffea185ec-e795-4b1c-93db-c4b5e4b7764a_1902x826.png 424w, https://substackcdn.com/image/fetch/$s_!9nF_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffea185ec-e795-4b1c-93db-c4b5e4b7764a_1902x826.png 848w, https://substackcdn.com/image/fetch/$s_!9nF_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffea185ec-e795-4b1c-93db-c4b5e4b7764a_1902x826.png 1272w, https://substackcdn.com/image/fetch/$s_!9nF_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffea185ec-e795-4b1c-93db-c4b5e4b7764a_1902x826.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9nF_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffea185ec-e795-4b1c-93db-c4b5e4b7764a_1902x826.png" width="1456" height="632" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fea185ec-e795-4b1c-93db-c4b5e4b7764a_1902x826.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:632,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:171822,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffea185ec-e795-4b1c-93db-c4b5e4b7764a_1902x826.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9nF_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffea185ec-e795-4b1c-93db-c4b5e4b7764a_1902x826.png 424w, https://substackcdn.com/image/fetch/$s_!9nF_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffea185ec-e795-4b1c-93db-c4b5e4b7764a_1902x826.png 848w, https://substackcdn.com/image/fetch/$s_!9nF_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffea185ec-e795-4b1c-93db-c4b5e4b7764a_1902x826.png 1272w, https://substackcdn.com/image/fetch/$s_!9nF_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffea185ec-e795-4b1c-93db-c4b5e4b7764a_1902x826.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In this derivation, we use the variance of a difference identity, as expressed below. This identity is a special case of the variance of a sum identity we saw previously. In [1], authors consider an unpaired comparison where <code>S&#773;_A</code> and <code>S&#773;_B</code>&#8203; are treated as estimates from independent evaluation runs (e.g., computed on independent question samples) such that <code>Cov(S_A, S_B) = 0</code>. This unpaired assumption could be violated (e.g., if models are evaluated over the same set of questions)&#8212;<em>we should use the paired analysis from the next section in this case</em>. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BVns!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e458138-3f4d-4920-b7f2-af4a474acdf8_2462x143.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BVns!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e458138-3f4d-4920-b7f2-af4a474acdf8_2462x143.png 424w, https://substackcdn.com/image/fetch/$s_!BVns!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e458138-3f4d-4920-b7f2-af4a474acdf8_2462x143.png 848w, https://substackcdn.com/image/fetch/$s_!BVns!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e458138-3f4d-4920-b7f2-af4a474acdf8_2462x143.png 1272w, https://substackcdn.com/image/fetch/$s_!BVns!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e458138-3f4d-4920-b7f2-af4a474acdf8_2462x143.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BVns!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e458138-3f4d-4920-b7f2-af4a474acdf8_2462x143.png" width="516" height="30.123626373626372" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3e458138-3f4d-4920-b7f2-af4a474acdf8_2462x143.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:85,&quot;width&quot;:1456,&quot;resizeWidth&quot;:516,&quot;bytes&quot;:110193,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e458138-3f4d-4920-b7f2-af4a474acdf8_2462x143.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!BVns!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e458138-3f4d-4920-b7f2-af4a474acdf8_2462x143.png 424w, https://substackcdn.com/image/fetch/$s_!BVns!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e458138-3f4d-4920-b7f2-af4a474acdf8_2462x143.png 848w, https://substackcdn.com/image/fetch/$s_!BVns!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e458138-3f4d-4920-b7f2-af4a474acdf8_2462x143.png 1272w, https://substackcdn.com/image/fetch/$s_!BVns!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e458138-3f4d-4920-b7f2-af4a474acdf8_2462x143.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Variance of a difference identity</figcaption></figure></div><p>We can easily compute a 95% confidence interval using this standard error. To determine if one model is better than the other, we can check whether this confidence interval overlaps with a value of zero. If the 95% confidence interval does not include zero, then&#8212;<em>assuming the true difference is zero</em>&#8212;there is less than a 5% chance that we would observe a difference this extreme. Our expression for computing a 95% confidence interval has been copied below for convenience. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!B2Uq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52437c2e-ff67-4f58-ac8a-a2e3239edbfb_824x112.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!B2Uq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52437c2e-ff67-4f58-ac8a-a2e3239edbfb_824x112.png 424w, https://substackcdn.com/image/fetch/$s_!B2Uq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52437c2e-ff67-4f58-ac8a-a2e3239edbfb_824x112.png 848w, https://substackcdn.com/image/fetch/$s_!B2Uq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52437c2e-ff67-4f58-ac8a-a2e3239edbfb_824x112.png 1272w, https://substackcdn.com/image/fetch/$s_!B2Uq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52437c2e-ff67-4f58-ac8a-a2e3239edbfb_824x112.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!B2Uq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52437c2e-ff67-4f58-ac8a-a2e3239edbfb_824x112.png" width="366" height="49.74757281553398" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/52437c2e-ff67-4f58-ac8a-a2e3239edbfb_824x112.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:112,&quot;width&quot;:824,&quot;resizeWidth&quot;:366,&quot;bytes&quot;:33119,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52437c2e-ff67-4f58-ac8a-a2e3239edbfb_824x112.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!B2Uq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52437c2e-ff67-4f58-ac8a-a2e3239edbfb_824x112.png 424w, https://substackcdn.com/image/fetch/$s_!B2Uq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52437c2e-ff67-4f58-ac8a-a2e3239edbfb_824x112.png 848w, https://substackcdn.com/image/fetch/$s_!B2Uq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52437c2e-ff67-4f58-ac8a-a2e3239edbfb_824x112.png 1272w, https://substackcdn.com/image/fetch/$s_!B2Uq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52437c2e-ff67-4f58-ac8a-a2e3239edbfb_824x112.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>For model <code>A</code> to outperform model <code>B</code> according to this confidence interval, the difference in the mean score of models <code>A</code> and <code>B</code> must be greater than <code>1.96 &#215; sqrt(SE_A^2 + SE_B^2)</code>. If we compute separate confidence intervals for each model, then this same difference must be greater than <code>1.96 &#215; (SE_A + SE_B)</code>, which is stricter. In this way, checking overlap of separate confidence intervals is conservative, while constructing a confidence interval using the difference&#8212;<em>and checking whether it excludes zero</em>&#8212;is a better test.</p><p><strong>Paired difference.</strong> If models <code>A</code> and <code>B</code> evaluate on the same set of questions, we can further reduce variance by analyzing the question-level differences in scores. To begin, we can define question-level paired score differences as shown below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8E2g!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61291549-8209-4019-b215-9574a92b4959_1628x499.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8E2g!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61291549-8209-4019-b215-9574a92b4959_1628x499.png 424w, https://substackcdn.com/image/fetch/$s_!8E2g!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61291549-8209-4019-b215-9574a92b4959_1628x499.png 848w, https://substackcdn.com/image/fetch/$s_!8E2g!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61291549-8209-4019-b215-9574a92b4959_1628x499.png 1272w, https://substackcdn.com/image/fetch/$s_!8E2g!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61291549-8209-4019-b215-9574a92b4959_1628x499.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8E2g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61291549-8209-4019-b215-9574a92b4959_1628x499.png" width="408" height="124.97802197802197" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/61291549-8209-4019-b215-9574a92b4959_1628x499.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:446,&quot;width&quot;:1456,&quot;resizeWidth&quot;:408,&quot;bytes&quot;:119036,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61291549-8209-4019-b215-9574a92b4959_1628x499.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8E2g!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61291549-8209-4019-b215-9574a92b4959_1628x499.png 424w, https://substackcdn.com/image/fetch/$s_!8E2g!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61291549-8209-4019-b215-9574a92b4959_1628x499.png 848w, https://substackcdn.com/image/fetch/$s_!8E2g!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61291549-8209-4019-b215-9574a92b4959_1628x499.png 1272w, https://substackcdn.com/image/fetch/$s_!8E2g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61291549-8209-4019-b215-9574a92b4959_1628x499.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>We can then estimate the standard error of question-level score differences by drawing upon our same standard error expression used previously; see below. We can then use this standard error to compute confidence intervals like before. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!30yN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41356879-e701-4fd0-8b51-cade237bfb7e_1914x542.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!30yN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41356879-e701-4fd0-8b51-cade237bfb7e_1914x542.png 424w, https://substackcdn.com/image/fetch/$s_!30yN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41356879-e701-4fd0-8b51-cade237bfb7e_1914x542.png 848w, https://substackcdn.com/image/fetch/$s_!30yN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41356879-e701-4fd0-8b51-cade237bfb7e_1914x542.png 1272w, https://substackcdn.com/image/fetch/$s_!30yN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41356879-e701-4fd0-8b51-cade237bfb7e_1914x542.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!30yN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41356879-e701-4fd0-8b51-cade237bfb7e_1914x542.png" width="1456" height="412" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/41356879-e701-4fd0-8b51-cade237bfb7e_1914x542.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:412,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:157446,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41356879-e701-4fd0-8b51-cade237bfb7e_1914x542.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!30yN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41356879-e701-4fd0-8b51-cade237bfb7e_1914x542.png 424w, https://substackcdn.com/image/fetch/$s_!30yN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41356879-e701-4fd0-8b51-cade237bfb7e_1914x542.png 848w, https://substackcdn.com/image/fetch/$s_!30yN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41356879-e701-4fd0-8b51-cade237bfb7e_1914x542.png 1272w, https://substackcdn.com/image/fetch/$s_!30yN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41356879-e701-4fd0-8b51-cade237bfb7e_1914x542.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Standard error of the mean score difference (from [1])</figcaption></figure></div><p>We can compute this standard error as shown above, but we are mostly interested in understanding whether this expression provides a meaningful reduction in variance. Ideally, we want the above paired standard error to be smaller than that of the difference of means so that we can better detect statistically significant model differences. To determine if this is the case, we can expand the above variance expression using the variance of a difference identity; see below. Unlike the prior unpaired analysis, we no longer assume that this covariance is zero.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!G2K1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29a3f13c-0fb0-453e-9fac-62ac424b2106_2256x106.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!G2K1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29a3f13c-0fb0-453e-9fac-62ac424b2106_2256x106.png 424w, https://substackcdn.com/image/fetch/$s_!G2K1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29a3f13c-0fb0-453e-9fac-62ac424b2106_2256x106.png 848w, https://substackcdn.com/image/fetch/$s_!G2K1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29a3f13c-0fb0-453e-9fac-62ac424b2106_2256x106.png 1272w, https://substackcdn.com/image/fetch/$s_!G2K1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29a3f13c-0fb0-453e-9fac-62ac424b2106_2256x106.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!G2K1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29a3f13c-0fb0-453e-9fac-62ac424b2106_2256x106.png" width="1456" height="68" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/29a3f13c-0fb0-453e-9fac-62ac424b2106_2256x106.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:68,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:89278,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29a3f13c-0fb0-453e-9fac-62ac424b2106_2256x106.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!G2K1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29a3f13c-0fb0-453e-9fac-62ac424b2106_2256x106.png 424w, https://substackcdn.com/image/fetch/$s_!G2K1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29a3f13c-0fb0-453e-9fac-62ac424b2106_2256x106.png 848w, https://substackcdn.com/image/fetch/$s_!G2K1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29a3f13c-0fb0-453e-9fac-62ac424b2106_2256x106.png 1272w, https://substackcdn.com/image/fetch/$s_!G2K1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29a3f13c-0fb0-453e-9fac-62ac424b2106_2256x106.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The variance reduction for the above expression depends upon whether the mean scores of models <code>A</code> and <code>B</code> are correlated or not. If so, then this covariance term will be positive and we will see a corresponding reduction in variance. Intuitively stated, a positive correlation indicates that models <code>A</code> and <code>B</code> agree on whether certain prompts are easy or hard (i.e., per-item scores are directionally similar). </p><blockquote><p><em>&#8220;Because eval question scores are likely to be positively correlated, even across unrelated models, paired differences represent a &#8220;free&#8221; reduction in estimator variance when comparing two models. We therefore recommend using the paired version of the standard error estimate wherever practicable.&#8221;</em> - from [1]</p></blockquote><p>In practice, most LLMs tend to agree on per-prompt difficulty, so analyzing paired differences is a useful approach that can offer meaningful reductions in variance. In [1], authors recommend reporting pairwise differences, standard errors, confidence intervals, and score correlations between models; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1tI4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8718f45-c03c-4964-a194-dbac17b95ec8_1678x452.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1tI4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8718f45-c03c-4964-a194-dbac17b95ec8_1678x452.png 424w, https://substackcdn.com/image/fetch/$s_!1tI4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8718f45-c03c-4964-a194-dbac17b95ec8_1678x452.png 848w, https://substackcdn.com/image/fetch/$s_!1tI4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8718f45-c03c-4964-a194-dbac17b95ec8_1678x452.png 1272w, https://substackcdn.com/image/fetch/$s_!1tI4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8718f45-c03c-4964-a194-dbac17b95ec8_1678x452.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1tI4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8718f45-c03c-4964-a194-dbac17b95ec8_1678x452.png" width="1456" height="392" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c8718f45-c03c-4964-a194-dbac17b95ec8_1678x452.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:392,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:144596,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8718f45-c03c-4964-a194-dbac17b95ec8_1678x452.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1tI4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8718f45-c03c-4964-a194-dbac17b95ec8_1678x452.png 424w, https://substackcdn.com/image/fetch/$s_!1tI4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8718f45-c03c-4964-a194-dbac17b95ec8_1678x452.png 848w, https://substackcdn.com/image/fetch/$s_!1tI4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8718f45-c03c-4964-a194-dbac17b95ec8_1678x452.png 1272w, https://substackcdn.com/image/fetch/$s_!1tI4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8718f45-c03c-4964-a194-dbac17b95ec8_1678x452.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><h4>Practical Implementation</h4><p>Although we have learned a lot of statistics throughout this discussion, actually implementing these ideas&#8212;<em>once we understand them</em>&#8212;does not add much extra complexity to the evaluation process. Computing standard errors and confidence intervals is straightforward and, once an implementation is available, can be readily adopted as a standard practice for model evaluations. However, we must be wary of the key assumptions being made when computing the standard error to avoid overconfidence; e.g., questions that are not independent require a cluster-adjusted standard error. A reference implementation of the techniques we have learned so far is provided below. This implementation outlines how all of the recommendations proposed in [1] can be applied when evaluating an LLM.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;1d08f26e-197b-476a-8f54-78525b788653&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">import math
import numpy as np


#################################
# Evaluation settings and scores
#################################

# model names
model_a_name = "Galleon"
model_b_name = "Dreadnought"

# example scores for the models (n = 10)
model_a = np.array([1, 1, 0, 1, 0, 1, 1, 0, 1, 1], dtype=float)
model_b = np.array([1, 0, 0, 1, 0, 1, 0, 0, 1, 1], dtype=float)

# form toy clusters
# two clusters, assignment based on even / odd index
clusters = np.array([i % 2 for i in range(len(model_a))])

# z score for 95% confidence interval
Z_95 = 1.96


####################
# Utility functions
####################

def mean_score(scores):
    return float(np.mean(scores))

def sample_sd(scores):
    return float(np.std(scores, ddof=1))

def ci_95(mean, se):
    return (mean - Z_95 * se, mean + Z_95 * se)

def fmt_pct(x):
    return f"{100 * x:.1f}%"

def fmt_pct_paren(x):
    return f"({100 * x:.2f}%)"

def fmt_ci(ci):
    return f"({100 * ci[0]:.2f}%, {100 * ci[1]:.2f}%)"


########################################################
# CLT SE
#
# Standard CLT SE for the sample mean: SE = s / sqrt(n)
# where s is the sample standard deviation.
########################################################

def clt_standard_error(scores):
    n = len(scores)
    return sample_sd(scores) / math.sqrt(n)


#####################################################
# Clustered SE
#
# Cluster-adjusted standard error:
# \sqrt{
#     SE_{CLT}^{2}
#     +
#     \frac{1}{n^{2}}
#     \sum_{c}\sum_{i}\sum_{j\neq i}
#     (s_{i,c} - \overline{s})(s_{j,c}-\overline{s})
# }
#####################################################

def clustered_standard_error(scores, clusters):
    n = len(scores)
    s_bar = np.mean(scores)

    # clt variance
    se_clt_sq = clt_standard_error(scores) ** 2

    # within-cluster cross terms
    cross_terms = 0.0
    for c in np.unique(clusters):
        idx = (clusters == c)
        residuals = scores[idx] - s_bar
        cross_terms += residuals.sum()**2 - np.sum(residuals**2)

    var_hat = se_clt_sq + (cross_terms / (n**2))
    return math.sqrt(var_hat)


######################################
# Summary statistics for single model
######################################

def summarize_model(scores, clusters):
    mean = mean_score(scores)

    se_clt = clt_standard_error(scores)
    ci_clt = ci_95(mean, se_clt)

    se_cluster = clustered_standard_error(scores, clusters)
    ci_cluster = ci_95(mean, se_cluster)

    return {
        "mean": mean,
        "se_clt": se_clt,
        "ci_clt": ci_clt,
        "se_cluster": se_cluster,
        "ci_cluster": ci_cluster,
        "n": len(scores),
        "num_clusters": len(np.unique(clusters)),
    }


###########################################################
# Comparing models w/ difference in means and separate SEs
###########################################################

def difference_in_means(scores_a, scores_b, clusters_a=None, clusters_b=None):
    mean_a = mean_score(scores_a)
    mean_b = mean_score(scores_b)
    diff = mean_a - mean_b

    # CLT version
    se_a_clt = clt_standard_error(scores_a)
    se_b_clt = clt_standard_error(scores_b)
    se_diff_clt = math.sqrt(se_a_clt ** 2 + se_b_clt ** 2)
    ci_diff_clt = ci_95(diff, se_diff_clt)

    # Clustered version
    if clusters_a is not None and clusters_b is not None:
        se_a_cluster = clustered_standard_error(scores_a, clusters_a)
        se_b_cluster = clustered_standard_error(scores_b, clusters_b)
        se_diff_cluster = math.sqrt(se_a_cluster ** 2 + se_b_cluster ** 2)
        ci_diff_cluster = ci_95(diff, se_diff_cluster)
    else:
        se_diff_cluster = None
        ci_diff_cluster = None

    return {
        "diff": diff,
        "se_diff_clt": se_diff_clt,
        "ci_diff_clt": ci_diff_clt,
        "se_diff_cluster": se_diff_cluster,
        "ci_diff_cluster": ci_diff_cluster,
    }


##########################################################
# Comparing models with paired instance-level differences
##########################################################

def paired_differences(scores_a, scores_b, clusters=None):
    diffs = scores_a - scores_b
    mean_diff = mean_score(diffs)

    se_paired_clt = clt_standard_error(diffs)
    ci_paired_clt = ci_95(mean_diff, se_paired_clt)

    if clusters is not None:
        se_paired_cluster = clustered_standard_error(diffs, clusters)
        ci_paired_cluster = ci_95(mean_diff, se_paired_cluster)
    else:
        se_paired_cluster = None
        ci_paired_cluster = None

    # Correlation between model scores across questions
    corr = float(np.corrcoef(scores_a, scores_b)[0, 1])

    return {
        "diffs": diffs,
        "mean_diff": mean_diff,
        "se_paired_clt": se_paired_clt,
        "ci_paired_clt": ci_paired_clt,
        "se_paired_cluster": se_paired_cluster,
        "ci_paired_cluster": ci_paired_cluster,
        "correlation": corr,
    }


############################################
# Functions to recreate tables from paper
#
# Numbers will not match exactly because we
# hard-code model scores at top of script
############################################

def print_table_2_style(model_name, summary):
    print(f"| {model_name:12s} | "
          f"mean = {fmt_pct(summary['mean']):&gt;6s}  "
          f"SE = {fmt_pct_paren(summary['se_clt']):&gt;8s}  "
          f"95% CI = {fmt_ci(summary['ci_clt'])}  "
          f"n = {summary['n']}")

def print_table_3_style(model_name, summary):
    print(f"| {model_name:12s} | "
          f"mean = {fmt_pct(summary['mean']):&gt;6s}  "
          f"clustered SE = {fmt_pct_paren(summary['se_cluster']):&gt;8s}  "
          f"95% CI = {fmt_ci(summary['ci_cluster'])}  "
          f"n = {summary['n']}, clusters = {summary['num_clusters']}")

def print_table_5_style(model_name, baseline_name, paired_results, clustered=False):
    if clustered:
        se = paired_results["se_paired_cluster"]
        ci = paired_results["ci_paired_cluster"]
        label = "paired clustered"
    else:
        se = paired_results["se_paired_clt"]
        ci = paired_results["ci_paired_clt"]
        label = "paired CLT"

    print(f"| {model_name:12s} | baseline = {baseline_name:12s} | "
          f"diff = {fmt_pct(paired_results['mean_diff']):&gt;6s}  "
          f"SE = {fmt_pct_paren(se):&gt;8s}  "
          f"95% CI = {fmt_ci(ci):&gt;18s}  "
          f"corr = {paired_results['correlation']:.3f}  "
          f"[{label}]")


################################
# Run all evaluation statistics
################################

def main():
    summary_a = summarize_model(model_a, clusters)
    summary_b = summarize_model(model_b, clusters)

    print("=" * 90)
    print("Raw toy data")
    print("=" * 90)
    print(f"{model_a_name}: {model_a}")
    print(f"{model_b_name}: {model_b}")
    print(f"clusters:  {clusters}")
    print()

    print("=" * 90)
    print("Table 2 style: CLT standard errors / confidence intervals")
    print("=" * 90)
    print_table_2_style(model_a_name, summary_a)
    print_table_2_style(model_b_name, summary_b)
    print()

    print("=" * 90)
    print("Table 3 style: Clustered standard errors / confidence intervals")
    print("=" * 90)
    print_table_3_style(model_a_name, summary_a)
    print_table_3_style(model_b_name, summary_b)
    print()

    diff_means = difference_in_means(model_a, model_b, clusters, clusters)
    paired = paired_differences(model_a, model_b, clusters)

    print("=" * 90)
    print("Model comparison method 1: difference in means")
    print("=" * 90)
    print(f"Naive / CLT version:")
    print(f"  diff = {fmt_pct(diff_means['diff'])}")
    print(f"  SE(diff) = {fmt_pct_paren(diff_means['se_diff_clt'])}")
    print(f"  95% CI = {fmt_ci(diff_means['ci_diff_clt'])}")
    print()

    print(f"Clustered version:")
    print(f"  diff = {fmt_pct(diff_means['diff'])}")
    print(f"  SE(diff) = {fmt_pct_paren(diff_means['se_diff_cluster'])}")
    print(f"  95% CI = {fmt_ci(diff_means['ci_diff_cluster'])}")
    print()

    print("=" * 90)
    print("Model comparison method 2: paired instance-level differences")
    print("=" * 90)
    print(f"Question-level differences (A - B): {paired['diffs']}")
    print(f"Correlation(A, B) = {paired['correlation']:.3f}")
    print()

    print(f"Paired CLT version:")
    print(f"  mean diff = {fmt_pct(paired['mean_diff'])}")
    print(f"  SE(diff) = {fmt_pct_paren(paired['se_paired_clt'])}")
    print(f"  95% CI = {fmt_ci(paired['ci_paired_clt'])}")
    print()

    print(f"Paired clustered version:")
    print(f"  mean diff = {fmt_pct(paired['mean_diff'])}")
    print(f"  SE(diff) = {fmt_pct_paren(paired['se_paired_cluster'])}")
    print(f"  95% CI = {fmt_ci(paired['ci_paired_cluster'])}")
    print()

    print("=" * 90)
    print("Table 5 style: pairwise reporting")
    print("=" * 90)
    print_table_5_style(model_a_name, model_b_name, paired, clustered=False)
    print_table_5_style(model_a_name, model_b_name, paired, clustered=True)

main()</code></pre></div><h2>More Topics to Explore in Statistics</h2><p>The above section covers most of the key information needed to start taking a statistically-oriented approach to evaluating LLMs. However, once we adopt the mindset of applying statistics to LLM evaluations, we open a new realm of possibilities! In this section, we provide a brief look into other areas of statistics&#8212;<em>both from [1] and beyond</em>&#8212;that can be applied to LLM evaluations, as well as highlight a few extra papers on the topic for future reading and motivation. </p><h4><a href="https://arxiv.org/abs/2411.00640">Power Analysis for LLM Evals</a> [1]</h4><p>For most of the overview so far, we have focused upon measuring uncertainty and reducing variance so that we can have more confidence when evaluating and comparing models. The techniques we have learned about are primarily focused on post-hoc analysis, and we have not spent much time considering the validity of the actual evaluation process itself. In [1], authors go beyond their discussion of standard errors, confidence intervals, and model comparisons by closing with a practical explanation of how <a href="https://stats.oarc.ucla.edu/other/mult-pkg/seminars/intro-power/">power analysis</a> can be applied to LLM evaluations.</p><p><strong>What is power?</strong> The idea of power in statistics refers to the ability of some statistical experiment to make a valid measurement in the presence of noise. For example, we want to know whether one model actually improves over the performance of another in the LLM evaluation setting. Moving in this direction, power analysis allows us to answer the following question: <em>Is the evaluation we are using capable of detecting the kind of improvement for which we are aiming?</em></p><p>Standard errors and confidence intervals allow us to quantify the uncertainty of an evaluation result. Power analysis focuses on the complementary concept of determining the number of questions <code>n</code> needed in order to reliably detect a difference in performance of a certain size. In [1], a sample size formula is derived that allows us to compute the necessary value of <code>n</code> under different settings. By using this formula, we can do things like:</p><ul><li><p>Check whether a certain evaluation is even worth running given the number of available samples.</p></li><li><p>Determine a sufficient sample size when curating a new evaluation dataset.</p></li></ul><p><strong>Defining power.</strong> The discussion of power analysis in [1] uses the same exact setup used for paired model evaluations. We are comparing two models <code>A</code> and <code>B</code>, both models are evaluated on the same questions, and we analyze question-level score differences. Similarly to before, the true difference in means in this setting <code>&#956;_{A-B} = &#956;_A - &#956;_B</code> can be estimated with a sample mean difference <code>s&#773;_{A-B}</code>. This sample mean may or may not be near the true value due to the noise from sampling evaluation questions and the conditional randomness of each score.</p><p>Power refers to the ability to detect a real improvement when it actually exists. We define this based on a few different quantities:</p><ul><li><p><em>Significance level (</em><code>&#945;</code><em>)</em>: the desired false positive rate (i.e., probability of detecting a difference in mean when it does not actually exist). </p></li><li><p><em>Power (</em><code>1 - &#946;</code><em>):</em> the probability of detecting an effect (e.g., a true difference in mean) when it actually exists.</p></li><li><p><em>Minimum detectable effect (</em><code>&#948;</code><em>):</em> the smallest true difference in mean that we want to detect. </p></li></ul><p>The significance level and power are used to capture Type I&#8212;<em>or concluding there is a true difference when one does not actually exist</em>&#8212;and Type II&#8212;<em>or failing to detect a true difference when one does exist</em>&#8212;errors, respectively. Intuitively, we can change these values to control the probability of false alarms or missed detections.</p><div class="pullquote"><p>&#8220;The sample-size formula&#8230; ought to prove useful in several ways. Consumers of existing evals may use the formula to determine the number of questions to subsample from a large eval, or to determine an appropriate value of <code>K</code>... If the number of questions in the eval is fixed, consumers can calculate the Minimum Detectable Effect and decide whether the eval is worth running. The authors of new evals may use the formula to decide how many questions should be commissioned.&#8221; - from [1]</p></div><p>A <strong>sample size formula</strong> is provided in [1] for applying power analysis to LLM evaluations; see below. <code>z_p</code> represents the (<code>1 &#8722; p)</code>-th percentile of a standard normal distribution and is computed with the same approach we used to find the value of <code>1.96</code> in our prior confidence interval formulas. We will not go through the full details of this derivation, but the terms in this expression follow the same pattern as our prior discussion on variance reduction. We compute the question-level average variance for each model and use the variance of a difference identity to capture the variance of the question-level mean difference. We can derive a sample estimate of these variances from historical evaluation data. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lAOG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3018feb7-f383-4df1-af60-1d0c70f41309_2026x904.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lAOG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3018feb7-f383-4df1-af60-1d0c70f41309_2026x904.png 424w, https://substackcdn.com/image/fetch/$s_!lAOG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3018feb7-f383-4df1-af60-1d0c70f41309_2026x904.png 848w, https://substackcdn.com/image/fetch/$s_!lAOG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3018feb7-f383-4df1-af60-1d0c70f41309_2026x904.png 1272w, https://substackcdn.com/image/fetch/$s_!lAOG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3018feb7-f383-4df1-af60-1d0c70f41309_2026x904.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lAOG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3018feb7-f383-4df1-af60-1d0c70f41309_2026x904.png" width="624" height="278.57142857142856" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3018feb7-f383-4df1-af60-1d0c70f41309_2026x904.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:650,&quot;width&quot;:1456,&quot;resizeWidth&quot;:624,&quot;bytes&quot;:284857,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3018feb7-f383-4df1-af60-1d0c70f41309_2026x904.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lAOG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3018feb7-f383-4df1-af60-1d0c70f41309_2026x904.png 424w, https://substackcdn.com/image/fetch/$s_!lAOG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3018feb7-f383-4df1-af60-1d0c70f41309_2026x904.png 848w, https://substackcdn.com/image/fetch/$s_!lAOG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3018feb7-f383-4df1-af60-1d0c70f41309_2026x904.png 1272w, https://substackcdn.com/image/fetch/$s_!lAOG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3018feb7-f383-4df1-af60-1d0c70f41309_2026x904.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Sample size formula (from [1])</figcaption></figure></div><p>This sample size formula provides useful intuition for statistical significance in LLM evaluations. Our evaluation requires a larger sample size when:</p><ul><li><p>The amount of variance is large.</p></li><li><p>The size of the effect being detected is small.</p></li><li><p>A stricter confidence or higher level of power is desired.</p></li></ul><p>Additionally, we can decrease the necessary value of <code>n</code> by performing resampling, revealing that our previously-outlined techniques for variance reduction are still applicable. Notably, the sample size also grows quadratically with the inverse of the minimum detectable effect: <em>detecting a gap in performance that is half the size requires 4&#215; the number of samples</em>. We can also rearrange this sample size equation to solve for the minimum detectable effect <code>&#948;</code>, allowing us to determine the smallest gap in performance that can be detected with some benchmark. </p><p><strong>Sample size implementation.</strong> To compute the above sample size formula, we must solve for the correct <code>z_p</code> values given a specified significance level <code>&#945;</code> and power <code>1 - &#946;</code>, as well as estimate the three variance terms from actual evaluation data. An example implementation is provided below for reference, which adopts the same patterns as our evaluation statistics code from the prior section.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;2a795548-d4ff-4263-9de1-5f9a2280a0ac&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">import math
import numpy as np
from statistics import NormalDist


##########################
# Power analysis settings
##########################

alpha = 0.05
beta = 0.20
delta = 0.10


#################################
# Model scores (three resamples)
#################################

model_a_scores = np.array([
    [1, 1, 1],
    [0, 1, 0],
    [1, 1, 0],
    [1, 1, 1],
    [0, 0, 0],
    [1, 1, 1],
    [1, 0, 1],
    [0, 0, 1],
    [1, 1, 1],
    [1, 0, 1],
], dtype=float)

model_b_scores = np.array([
    [1, 0, 1],
    [0, 0, 0],
    [1, 0, 0],
    [1, 0, 1],
    [0, 0, 0],
    [1, 1, 0],
    [0, 0, 1],
    [0, 0, 0],
    [1, 0, 1],
    [1, 0, 0],
], dtype=float)


##########################################
# Compute z_p values from alpha and beta
#
# The paper uses:
# - z_{alpha/2}
# - z_beta, where power = 1 - beta
##########################################

def z_alpha_over_2(alpha: float) -&gt; float:
    return NormalDist().inv_cdf(1 - alpha / 2)

def z_beta(beta: float) -&gt; float:
    return NormalDist().inv_cdf(1 - beta)


#####################################
# Sample estimates of variance terms
#####################################

def question_means(score_matrix: np.ndarray) -&gt; np.ndarray:
    """
    Estimate x_i for each question i by averaging over K samples.
    """
    return np.mean(score_matrix, axis=1)

def question_conditional_variances(score_matrix: np.ndarray) -&gt; np.ndarray:
    """
    Estimate sigma_i^2 for each question i using the within-question
    sample variance across repeated samples. We cannot estimate this
    if we only have 1 resample (i.e., K = 1).
    """
    n, k = score_matrix.shape
    if k == 1:
        return np.zeros(n)
    return np.var(score_matrix, axis=1, ddof=1)

def estimate_sigma_squared(score_matrix: np.ndarray) -&gt; float:
    """
    Estimate E[sigma_i^2] by averaging the within-question variances.
    """
    return float(np.mean(question_conditional_variances(score_matrix)))

def estimate_omega_squared(
    model_a_score_matrix: np.ndarray,
    model_b_score_matrix: np.ndarray,
) -&gt; float:
    """
    Estimate:

        omega^2 = Var(x_A) + Var(x_B) - 2 Cov(x_A, x_B) = Var(x_A - x_B)

    where x_A and x_B are the question-level conditional means using sample
    variance and covariance across questions.
    """
    x_a = question_means(model_a_score_matrix)
    x_b = question_means(model_b_score_matrix)

    # compact form is Var(x_A - x_B)
    # equivalent to Var(x_A) + Var(x_B) - 2 Cov(x_A, x_B)
    """
    Expanded version:
    var_a = np.var(x_a, ddof=1)
    var_b = np.var(x_b, ddof=1)
    cov_ab = np.cov(x_a, x_b, ddof=1)[0, 1]
    return float(var_a + var_b - 2 * cov_ab)
    """
    diffs = x_a - x_b
    return float(np.var(diffs, ddof=1))

def estimate_power_analysis_variance_terms(
    model_a_score_matrix: np.ndarray,
    model_b_score_matrix: np.ndarray,
) -&gt; dict:
    """
    Estimate all variance terms needed for sample-size formula in [1].
    """
    sigma_a_sq = estimate_sigma_squared(model_a_score_matrix)
    sigma_b_sq = estimate_sigma_squared(model_b_score_matrix)
    omega_sq = estimate_omega_squared(model_a_score_matrix, model_b_score_matrix)

    n_questions, k_a = model_a_score_matrix.shape
    n_questions_b, k_b = model_b_score_matrix.shape

    return {
        "n_questions": n_questions,
        "K_A": k_a,
        "K_B": k_b,
        "omega_sq": omega_sq,
        "sigma_a_sq": sigma_a_sq,
        "sigma_b_sq": sigma_b_sq,
    }


###############################
# Sample size formula from [1]
###############################

def required_sample_size(
    delta: float,
    alpha: float,
    beta: float,
    omega_sq: float,
    sigma_a_sq: float,
    sigma_b_sq: float,
    K_A: int = 1,
    K_B: int = 1,
) -&gt; float:
    z_a2 = z_alpha_over_2(alpha)
    z_b = z_beta(beta)

    variance_term = omega_sq + sigma_a_sq / K_A + sigma_b_sq / K_B
    n = ((z_a2 + z_b) ** 2 * variance_term) / (delta ** 2)
    return float(n)


################################
# Compute Sample Size
################################

def main():
    terms = estimate_power_analysis_variance_terms(model_a_scores, model_b_scores)

    print("=" * 80)
    print("Estimated variance terms from fixed evaluation results")
    print("=" * 80)
    print(f"n_questions = {terms['n_questions']}")
    print(f"K_A = {terms['K_A']}")
    print(f"K_B = {terms['K_B']}")
    print(f"omega^2 = {terms['omega_sq']:.6f}")
    print(f"sigma_A^2 = {terms['sigma_a_sq']:.6f}")
    print(f"sigma_B^2 = {terms['sigma_b_sq']:.6f}")
    print()

    print("=" * 80)
    print("Critical z-values")
    print("=" * 80)
    print(f"alpha = {alpha:.3f}")
    print(f"beta  = {beta:.3f}")
    print(f"power = {1 - beta:.3f}")
    print(f"z_(alpha/2) = {z_alpha_over_2(alpha):.6f}")
    print(f"z_beta      = {z_beta(beta):.6f}")
    print()

    n_required = required_sample_size(
        delta=delta,
        alpha=alpha,
        beta=beta,
        omega_sq=terms["omega_sq"],
        sigma_a_sq=terms["sigma_a_sq"],
        sigma_b_sq=terms["sigma_b_sq"],
        K_A=terms["K_A"],
        K_B=terms["K_B"],
    )

    print("=" * 80)
    print("Required sample size")
    print("=" * 80)
    print(f"Target effect size delta = {delta:.4f}")
    print(f"Required n &#8776; {n_required:.2f}")
    print()

main()</code></pre></div><h4><strong><a href="https://arxiv.org/abs/2503.01747">Don&#8217;t Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints</a> [2]</strong></h4><blockquote><p><em>&#8220;Assumptions underlying the asymptotic, CLT-based approaches may not be suitable for LLM evals, at least in smaller data regimes. In that case, we expect to see broader failures of CLT-based confidence intervals.&#8221;</em> - from [2]</p></blockquote><p>In [2], authors extend the proposals from [1] by analyzing the effectiveness of the CLT in the small data regime (i.e., <code>n &lt;= 100</code>). As we have learned, the CLT implies that sample means approach a normal distribution as the sample size <code>n</code> increases, but the value of <code>n</code> may have to be in the hundreds or thousands for this property to hold&#8212;<em>the point at which n becomes &#8220;big enough&#8221; is difficult to determine a priori.</em> The key insight from [2] is that the CLT underestimates uncertainty when there is limited evaluation data. This problem is worsened by the fact that LLM benchmarks are becoming increasingly specialized, leading many to the creation of many smaller benchmarks that capture performance on particular tasks (e.g., the popular <a href="https://openai.com/index/introducing-swe-bench-verified/">SWE-Bench Verified</a> benchmark contains only 500 questions).</p><p><strong>CLT simulations.</strong> The shortcomings of the CLT are demonstrated in [2] via extensive simulation experiments with a known ground truth that permit directly verifying a confidence interval. From these simulations, we see that CLT-based methods consistently fail in small-data regimes by producing confidence intervals that are too narrow and overly-confident. Several scenarios are considered in the simulations in [2] that mostly align with evaluation setups from [1]:</p><ul><li><p><em>IID questions</em>: model performance is measured on IID questions (assumed to be binary in [2]).</p></li><li><p><em>Clustered questions</em>: model performance is analyzed on questions that are not IID (i.e., the clustered setting from [1]).</p></li><li><p><em>Unpaired model comparison</em>: model performance is measured over separate question sets and compared between models. </p></li><li><p><em>Paired model comparison</em>: model performance is measured on an identical set of questions and compared between models. </p></li></ul><p>Across all settings, evidence presented against CLT-based methods in the small <code>n</code> regime is clear&#8212;the <em>CLT fails across all scenarios when </em><code>n &lt; 100</code>. Such findings emphasize the fact that the CLT, despite being simple and powerful, makes underlying assumptions (i.e., IID variables, finite variance, and sufficiently large <code>n</code>) that degrade its effectiveness when violated. Authors in [2] do not recommend against using the CLT. Rather, they encourage awareness of these assumptions and limitations so that the CLT can be avoided in situations where it does not apply. Specifically, the key failure case for the CLT in [2] is when <code>n &lt; 100</code>. </p><div class="pullquote"><p>&#8220;It may be argued that CLT-based methods are usually sufficient in practice when their assumptions are satisfied. We do not disagree. However, we argue that it is safer to use the more robust strategies laid out in this paper, which are just as easy to apply, perform no worse for large n and perform substantially better in the small-n setting&#8230; knowing whether a certain n is large enough for the CLT to hold would be extremely context-dependent and difficult to determine a priori.&#8221; - from [2]</p></div><p>One other specific issue highlighted with the CLT in [2] is cases where models begin to achieve either perfect or zero accuracy. On especially small datasets, it is possible that an LLM either answers all questions correctly or&#8212;<em>in the case of a tiny but non-trivial dataset like <a href="https://huggingface.co/datasets/opencompass/AIME2025">AIME</a></em>&#8212;answers no questions correctly. In these cases, confidence intervals produced with the CLT will become overly narrow because all scores are the same, thus worsening issues with overconfidence.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Lh-G!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F037a4e5a-02e0-4fd5-be9a-6ff804fd89e5_2490x764.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Lh-G!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F037a4e5a-02e0-4fd5-be9a-6ff804fd89e5_2490x764.png 424w, https://substackcdn.com/image/fetch/$s_!Lh-G!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F037a4e5a-02e0-4fd5-be9a-6ff804fd89e5_2490x764.png 848w, https://substackcdn.com/image/fetch/$s_!Lh-G!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F037a4e5a-02e0-4fd5-be9a-6ff804fd89e5_2490x764.png 1272w, https://substackcdn.com/image/fetch/$s_!Lh-G!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F037a4e5a-02e0-4fd5-be9a-6ff804fd89e5_2490x764.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Lh-G!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F037a4e5a-02e0-4fd5-be9a-6ff804fd89e5_2490x764.png" width="1456" height="447" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/037a4e5a-02e0-4fd5-be9a-6ff804fd89e5_2490x764.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:447,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:465779,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F037a4e5a-02e0-4fd5-be9a-6ff804fd89e5_2490x764.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Lh-G!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F037a4e5a-02e0-4fd5-be9a-6ff804fd89e5_2490x764.png 424w, https://substackcdn.com/image/fetch/$s_!Lh-G!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F037a4e5a-02e0-4fd5-be9a-6ff804fd89e5_2490x764.png 848w, https://substackcdn.com/image/fetch/$s_!Lh-G!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F037a4e5a-02e0-4fd5-be9a-6ff804fd89e5_2490x764.png 1272w, https://substackcdn.com/image/fetch/$s_!Lh-G!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F037a4e5a-02e0-4fd5-be9a-6ff804fd89e5_2490x764.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p><strong>Alternative approaches.</strong> Although the full details of these techniques are beyond the scope of this post, authors in [2] provide several possible alternative methods for computing confidence intervals. Most prominent among these techniques are Bayesian methods, which are less sensitive to the value of <code>n</code> and can provide narrower confidence intervals relative to the CLT. As shown above, Bayesian intervals are still relatively straightforward to compute and can be extended to handle important settings such as clustered questions or model comparisons. A brief overview of several alternative techniques&#8212;<em>including Bayesian methods</em>&#8212;alongside their benefits and drawbacks is provided below for reference. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0kBp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fa07e8-69f4-482b-bacd-f62d1270dfa0_1194x318.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0kBp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fa07e8-69f4-482b-bacd-f62d1270dfa0_1194x318.png 424w, https://substackcdn.com/image/fetch/$s_!0kBp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fa07e8-69f4-482b-bacd-f62d1270dfa0_1194x318.png 848w, https://substackcdn.com/image/fetch/$s_!0kBp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fa07e8-69f4-482b-bacd-f62d1270dfa0_1194x318.png 1272w, https://substackcdn.com/image/fetch/$s_!0kBp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fa07e8-69f4-482b-bacd-f62d1270dfa0_1194x318.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0kBp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fa07e8-69f4-482b-bacd-f62d1270dfa0_1194x318.png" width="718" height="191.22613065326632" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a2fa07e8-69f4-482b-bacd-f62d1270dfa0_1194x318.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:318,&quot;width&quot;:1194,&quot;resizeWidth&quot;:718,&quot;bytes&quot;:75544,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fa07e8-69f4-482b-bacd-f62d1270dfa0_1194x318.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0kBp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fa07e8-69f4-482b-bacd-f62d1270dfa0_1194x318.png 424w, https://substackcdn.com/image/fetch/$s_!0kBp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fa07e8-69f4-482b-bacd-f62d1270dfa0_1194x318.png 848w, https://substackcdn.com/image/fetch/$s_!0kBp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fa07e8-69f4-482b-bacd-f62d1270dfa0_1194x318.png 1272w, https://substackcdn.com/image/fetch/$s_!0kBp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2fa07e8-69f4-482b-bacd-f62d1270dfa0_1194x318.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><h4><strong><a href="https://arxiv.org/abs/2406.10229">Quantifying Variance in Evaluation Benchmarks</a> [3]</strong></h4><blockquote><p><em>&#8220;If we cannot trust our evaluation results or do not understand what improvements are statistically significant, we cannot make sound comparisons, making it more challenging to reliably use benchmarks.&#8221;</em> - from [3]</p></blockquote><p>As we have learned, most LLM evaluations just report a single deterministic score (e.g., an accuracy of 70% on <a href="https://arxiv.org/abs/2009.03300">MMLU</a>) without explicitly accounting for variability. Small score differences are often used to claim superior performance of a model, but it is usually unclear whether a small difference is attributable to noise or actually a meaningful capability improvement. This issue causes misleading or even incorrect results on benchmarks, as well as poor decision making during the model development process. In [3], authors perform a deep dive into variance of LLM evaluation using 13 popular benchmarks and over 280 models.</p><p><strong>Measuring variability.</strong> In order to perform a large-scale analysis of benchmark variance, a broad group of LLMs is curated in [3]. First, a group of seed models&#8212;<em>all based upon <a href="https://huggingface.co/meta-llama/Llama-2-7b">Llama-2-7B</a></em>&#8212;is created by training from scratch with different random initialization seeds on 210 billion tokens of data. Checkpoints are collected throughout training for each seed, resulting in 210 total snapshots that are used for evaluation. These seed models are then supplemented with an additional 41 checkpoints of Llama 1 and 2 from various training stages, as well as 32 other models across a variety of families (e.g., <a href="https://arxiv.org/abs/2503.19786">Gemma</a> and <a href="https://arxiv.org/abs/2310.06825">Mistral</a>). This group of models is then evaluated over a set of 13 benchmarks that cover a wide variety of domains like reasoning, math, general knowledge, and coding. </p><p><strong>Variance metrics.</strong> To study the variability in evaluation results, three metrics are considered in [3]:</p><ol><li><p><em>Seed variance</em> measures the standard deviation of performance across models trained with different random seeds and is reported as an average over all training checkpoints. </p></li><li><p><em>Monotonicity</em><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a> measures the whether the sequence of evaluation scores for a model improve regularly throughout training. </p></li><li><p>The <em><a href="https://en.wikipedia.org/wiki/Signal-to-noise_ratio">signal-to-noise ratio (SNR)</a></em> of seed models is measured by dividing the mean benchmark score of the final model across different seeds by the standard deviation of scores across seeds.</p></li></ol><p><strong>Key findings.</strong> We learn in [1] that different benchmarks have drastically different variance characteristics; see below. For example, smaller benchmarks (e.g., <a href="https://cdn.aaai.org/ocs/2418/2418-10878-1-PB.pdf">COPA</a> and <a href="https://github.com/openai/human-eval">HumanEval</a>) are found to have higher seed variance and larger confidence intervals, <em>emphasizing once again the need for evaluation datasets that are sufficiently large</em>. Additionally, some benchmarks have random performance for smaller models, even after extensive training. These benchmarks may be too difficult for certain models, which reflects findings in <a href="https://cameronrwolfe.substack.com/i/179769076/evaluating-the-base-model">current research</a> showing that certain benchmarks may only be useful at a specific scale. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!x_74!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F915e739f-a29d-4e97-8441-d5279ed31e81_2456x1590.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!x_74!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F915e739f-a29d-4e97-8441-d5279ed31e81_2456x1590.png 424w, https://substackcdn.com/image/fetch/$s_!x_74!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F915e739f-a29d-4e97-8441-d5279ed31e81_2456x1590.png 848w, https://substackcdn.com/image/fetch/$s_!x_74!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F915e739f-a29d-4e97-8441-d5279ed31e81_2456x1590.png 1272w, https://substackcdn.com/image/fetch/$s_!x_74!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F915e739f-a29d-4e97-8441-d5279ed31e81_2456x1590.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!x_74!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F915e739f-a29d-4e97-8441-d5279ed31e81_2456x1590.png" width="1456" height="943" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/915e739f-a29d-4e97-8441-d5279ed31e81_2456x1590.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:943,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:722331,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F915e739f-a29d-4e97-8441-d5279ed31e81_2456x1590.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!x_74!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F915e739f-a29d-4e97-8441-d5279ed31e81_2456x1590.png 424w, https://substackcdn.com/image/fetch/$s_!x_74!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F915e739f-a29d-4e97-8441-d5279ed31e81_2456x1590.png 848w, https://substackcdn.com/image/fetch/$s_!x_74!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F915e739f-a29d-4e97-8441-d5279ed31e81_2456x1590.png 1272w, https://substackcdn.com/image/fetch/$s_!x_74!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F915e739f-a29d-4e97-8441-d5279ed31e81_2456x1590.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><p>We also see in [3] that using a continuous evaluation formulation based upon token probabilities yields more reliable evaluation results compared to binary evaluations based upon correctness. Notably, this finding aligns with the variance reduction analysis provided in [1]. Specifically, authors in [3] compute continuous metrics using either:</p><ul><li><p>The probability of the correct answer token for multiple choice questions.</p></li><li><p>The log likelihood of a reference answer&#8212;<em>computed by summing the log probabilities for all tokens in a completion</em>&#8212;for open-ended generations.</p></li></ul><p>By using continuous evaluation metrics based upon these token probabilities, the SNR and monotonicity of evaluation benchmarks noticeably improve; see above. Based upon this observation, authors in [3] also reformulate the popular MMLU question-answering benchmark to be completion-based instead of using multiple choice questions&#8212;<em>this new dataset is called MMLU-Cloze</em><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-9" href="#footnote-9" target="_self">9</a>. As shown in the figure below, this reformulation drastically reduces the variability of the benchmark, thus highlighting the benefit of using continuous metrics for LLM evaluation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yO6d!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a973728-52f6-48f6-8469-d9ef9edcac89_2038x1234.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yO6d!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a973728-52f6-48f6-8469-d9ef9edcac89_2038x1234.png 424w, https://substackcdn.com/image/fetch/$s_!yO6d!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a973728-52f6-48f6-8469-d9ef9edcac89_2038x1234.png 848w, https://substackcdn.com/image/fetch/$s_!yO6d!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a973728-52f6-48f6-8469-d9ef9edcac89_2038x1234.png 1272w, https://substackcdn.com/image/fetch/$s_!yO6d!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a973728-52f6-48f6-8469-d9ef9edcac89_2038x1234.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yO6d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a973728-52f6-48f6-8469-d9ef9edcac89_2038x1234.png" width="1456" height="882" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9a973728-52f6-48f6-8469-d9ef9edcac89_2038x1234.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:882,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:502499,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a973728-52f6-48f6-8469-d9ef9edcac89_2038x1234.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yO6d!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a973728-52f6-48f6-8469-d9ef9edcac89_2038x1234.png 424w, https://substackcdn.com/image/fetch/$s_!yO6d!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a973728-52f6-48f6-8469-d9ef9edcac89_2038x1234.png 848w, https://substackcdn.com/image/fetch/$s_!yO6d!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a973728-52f6-48f6-8469-d9ef9edcac89_2038x1234.png 1272w, https://substackcdn.com/image/fetch/$s_!yO6d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a973728-52f6-48f6-8469-d9ef9edcac89_2038x1234.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><h4><strong><a href="https://arxiv.org/abs/2508.13144">Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation</a> [4]</strong></h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HLiR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d9b4b88-9236-44a8-8c80-844e25b6ab62_1480x708.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HLiR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d9b4b88-9236-44a8-8c80-844e25b6ab62_1480x708.png 424w, https://substackcdn.com/image/fetch/$s_!HLiR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d9b4b88-9236-44a8-8c80-844e25b6ab62_1480x708.png 848w, https://substackcdn.com/image/fetch/$s_!HLiR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d9b4b88-9236-44a8-8c80-844e25b6ab62_1480x708.png 1272w, https://substackcdn.com/image/fetch/$s_!HLiR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d9b4b88-9236-44a8-8c80-844e25b6ab62_1480x708.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HLiR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d9b4b88-9236-44a8-8c80-844e25b6ab62_1480x708.png" width="1456" height="697" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2d9b4b88-9236-44a8-8c80-844e25b6ab62_1480x708.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:697,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:329511,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/188458832?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d9b4b88-9236-44a8-8c80-844e25b6ab62_1480x708.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HLiR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d9b4b88-9236-44a8-8c80-844e25b6ab62_1480x708.png 424w, https://substackcdn.com/image/fetch/$s_!HLiR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d9b4b88-9236-44a8-8c80-844e25b6ab62_1480x708.png 848w, https://substackcdn.com/image/fetch/$s_!HLiR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d9b4b88-9236-44a8-8c80-844e25b6ab62_1480x708.png 1272w, https://substackcdn.com/image/fetch/$s_!HLiR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d9b4b88-9236-44a8-8c80-844e25b6ab62_1480x708.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [4])</figcaption></figure></div><p>During the LLM development process, we perform small-scale experiments to tune our training settings and rely upon evaluation results to determine the best setting. However, the results of small-scale experiments may not translate well to large-scale training runs, and noise in the evaluation process can lead to incorrect decisions. In [4], an SNR-based framework is proposed for assessing benchmark reliability and improving the predictive accuracy of evaluation across scales.</p><p><strong>Assessing reliability.</strong> Evaluation datasets are analyzed in [4] using an SNR metric, but a specific definition of signal and noise is proposed. Assume <code>scores</code> is an array of scores for a benchmark, where each index stores the evaluation score for a different model. We define <code>signal = [max(scores) - min(scores)] / mean(scores)</code>. In other words, the signal metric captures the relative spread of scores across models for a particular benchmark. </p><p>Noise measures variability in performance due to randomness in the training process. To compute noise, we could train multiple models with different random seeds and measure the variance in their evaluation scores. However, such an approach is computationally expensive. As a solution, noise is measured in [4] by:</p><ul><li><p>Considering the last <code>n</code> model checkpoints from the training process.</p></li><li><p>Obtaining the evaluation result for each of these checkpoints, yielding a list of evaluation scores <code>ckpt_scores</code>.</p></li><li><p>Computing <code>noise = std(ckpt_scores) / mean(ckpt_scores)</code>.</p></li></ul><p>We can then combine these metrics into a single SNR metric by taking the quotient of signal and noise. This SNR metric is helpful for analyzing benchmark reliability, as any evaluation dataset with high SNR is capable of distinguishing between different models and is relatively insensitive to training randomness.</p><p><strong>Practical tips.</strong> The SNR metric is validated in a large-scale evaluation study in [4] that considers 465 LLMs and 30 evaluation benchmarks. Across all evaluation settings, we see that benchmarks with higher SNR provide more reliable model rankings. Specifically, the correlation between SNR and decision accuracy&#8212;<em>meaning that the better model receives a higher evaluation score on a particular benchmark</em>&#8212;is found to be quite high. Several practical tips for LLM evaluations are proposed in [4] based on these observations:</p><ul><li><p>For an evaluation benchmark, we can select specific sub-tasks with the highest SNR to improve reliability. For example, authors in [4] use SNR to select 16 (of 57 total) MMLU tasks for evaluation, which improves decision accuracy and drastically reduces evaluation costs. </p></li><li><p>Instead of only evaluating the final model checkpoint, we can compute an average evaluation score across the last <code>n</code> model checkpoints to improve reliability and mitigate noise due to training randomness.</p></li></ul><p>Authors in [4] also advocate for using continuous&#8212;<em>rather than discrete</em>&#8212;metrics for evaluation. Similarly to findings in [1, 3], we see in [4] that evaluating a model based upon the log likelihood of the correct completion improves the reliability of the evaluation process, as evidenced by a clearly-improved SNR.</p><blockquote><p><em>&#8220;We calculate the bits-per-byte (BPB) using the correct continuations of each test set. The bits-per-byte is the negative log likelihood of the correct answer divided by the number of UTF-8 bytes in the answer string.&#8221;</em> - from [4]</p></blockquote><h2>Key Takeaways</h2><p>In this overview, we have learned a wide variety of tools for evaluating LLMs in an uncertainty-aware manner. To close, we will summarize what we&#8217;ve learned by outlining how each of these tools can be used when evaluating an LLM. In the simplest case, we can draw upon the CLT to derive a standard error and confidence interval along with our evaluation results. However, there are a few cases in which this approach will not yield valid results:</p><ul><li><p>If the value of <code>n</code> is small, then the CLT-based standard error expression is overly confident. We can solve this issue by evaluating over a larger dataset or using another approach (e.g., the Bayesian methods outlined in [2]) that is better equipped to deal with smaller <code>n</code>.</p></li><li><p>If evaluation questions are not independent, then we can derive a cluster-adjusted standard error to account for the relationship between questions in our evaluation dataset. </p></li></ul><p>When comparing models that are evaluated on the same questions (i.e., a paired setup), we can apply the same approaches over their question-level differences to provide a more statistically efficient estimate of which model performs better. </p><p>To reduce evaluation variance, we can use resampling, where <code>K</code> is selected such that <code>E[&#963;_i^2 / K] &#8810; Var(X)</code>. In some settings, token probabilities can be used to compute the expected score&#8212;<em>or the probability of the ground truth answer</em>&#8212; directly, thus reducing within-question variance. Such an approach has been shown in several concurrent works [1, 3, 4] to improve the stability of evaluation results. When creating an evaluation dataset, we can use power analysis&#8212;<em>or just adopt the sample size formula from [1]</em>&#8212;to determine the number of samples needed. We can also rearrange the sample size formula to find the minimum detectable effect &#948; that can be measured with a given dataset, which helps us to determine whether certain evaluations are even worth running at all. </p><h4>New to the newsletter?</h4><p>Hi! I&#8217;m <a href="https://cameronrwolfe.me/">Cameron R. Wolfe</a>, Deep Learning Ph.D. and Senior Research Scientist at <a href="https://research.netflix.com/research-area/nlp-and-conversations">Netflix</a>. This is the Deep (Learning) Focus newsletter, where I help readers better understand important topics in AI research. The newsletter will always be free and open to read. If you like the newsletter, please subscribe, consider a paid subscription, share it, or follow me on <a href="https://twitter.com/cwolferesearch">X</a> and <a href="https://www.linkedin.com/in/cameron-r-wolfe-ph-d-04744a238/">LinkedIn</a>!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://cameronrwolfe.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://cameronrwolfe.substack.com/subscribe?"><span>Subscribe now</span></a></p><h4>Bibliography</h4><p>[1] Miller, Evan. &#8220;Adding error bars to evals: A statistical approach to language model evaluations.&#8221; <em>arXiv preprint arXiv:2411.00640</em> (2024).</p><p>[2] Bowyer, Sam, Laurence Aitchison, and Desi R. Ivanova. &#8220;Position: Don&#8217;t Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints.&#8221; <em>arXiv preprint arXiv:2503.01747</em> (2025).</p><p>[3] Madaan, Lovish, et al. &#8220;Quantifying variance in evaluation benchmarks, 2024.&#8221; <em>URL https://arxiv. org/abs/2406.10229</em> (2024).</p><p>[4] Heineman, David, et al. &#8220;Signal and noise: A framework for reducing uncertainty in language model evaluation.&#8221; <em>arXiv preprint arXiv:2508.13144</em> (2025).</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>The evaluation process is stochastic, so if we re-run the evaluation on this question multiple times we can observe a different result!</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Previously, we introduced the sample variance, denoted as <code>s^2</code>. The sample standard deviation, denoted as <code>s</code>, is simply the square root of this expression. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>The z-score refers to the realized value <code>z</code> of the random variable <code>Z</code>.  </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>The reason we must assume variance <code>&#963;^2</code> is finite is so that this expression is well-defined and exists. The standard deviation and standard error are not finite or meaningful when variance is infinite. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>We write the normal distribution as <code>N(x, y)</code>, where <code>x</code> is the mean of the normal distribution and <code>y</code> is the variance. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>Here, the unconditional versions of <code>s</code> and <code>x</code> (i.e., without the <code>i</code> subscript) is used, so we are taking this expectation over the entire super-population. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>More specifically, if we are reporting a metric in the range <code>[0, 1]</code> (e.g., an f1 score), then this formula cannot be used. These are fractional scores rather than binary scores with a value of either 0 or 1. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>Practically, monotonicity is computed by taking the sequence of scores throughout training and measuring the <a href="https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient">Kendall rank correlation</a> between this sequence and a perfectly monotonic sequence (i.e., a sequence in which the model&#8217;s performance increases at every checkpoint throughout training). </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-9" href="#footnote-anchor-9" class="footnote-number" contenteditable="false" target="_self">9</a><div class="footnote-content"><p>In the context of LLMs, a Cloze task refers to a fill-in-the-blank test where the LLM is given context (e.g., a paragraph or sentence) with missing tokens and expected to predict the missing information. </p></div></div>]]></content:encoded></item><item><title><![CDATA[Rubric-Based Rewards for RL]]></title><description><![CDATA[Extending the benefits of large-scale RL training to non-verifiable domains...]]></description><link>https://cameronrwolfe.substack.com/p/rubric-rl</link><guid isPermaLink="false">https://cameronrwolfe.substack.com/p/rubric-rl</guid><pubDate>Mon, 16 Feb 2026 10:33:41 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/97a09d37-0d6b-493e-a68c-300f80550467_2329x1299.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9S-H!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c2a326d-ce08-4c08-b61f-f6729bef3826_2322x1304.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9S-H!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c2a326d-ce08-4c08-b61f-f6729bef3826_2322x1304.png 424w, https://substackcdn.com/image/fetch/$s_!9S-H!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c2a326d-ce08-4c08-b61f-f6729bef3826_2322x1304.png 848w, https://substackcdn.com/image/fetch/$s_!9S-H!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c2a326d-ce08-4c08-b61f-f6729bef3826_2322x1304.png 1272w, https://substackcdn.com/image/fetch/$s_!9S-H!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c2a326d-ce08-4c08-b61f-f6729bef3826_2322x1304.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9S-H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c2a326d-ce08-4c08-b61f-f6729bef3826_2322x1304.png" width="1456" height="818" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9c2a326d-ce08-4c08-b61f-f6729bef3826_2322x1304.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:818,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1631209,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c2a326d-ce08-4c08-b61f-f6729bef3826_2322x1304.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9S-H!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c2a326d-ce08-4c08-b61f-f6729bef3826_2322x1304.png 424w, https://substackcdn.com/image/fetch/$s_!9S-H!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c2a326d-ce08-4c08-b61f-f6729bef3826_2322x1304.png 848w, https://substackcdn.com/image/fetch/$s_!9S-H!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c2a326d-ce08-4c08-b61f-f6729bef3826_2322x1304.png 1272w, https://substackcdn.com/image/fetch/$s_!9S-H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c2a326d-ce08-4c08-b61f-f6729bef3826_2322x1304.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1, 2, 3, 5, 16])</figcaption></figure></div><p>Many of the recent capability gains in large language models (LLMs) have been a product of advancements in reinforcement learning (RL). In particular, RL with verifiable rewards (RLVR) has drastically improved LLM capabilities by using rules-based, deterministic correctness checks (e.g., passing the test cases for a coding problem) as a reward signal. Deterministic verifiers allow RLVR to provide a reliable reward signal that is more difficult to exploit compared to the neural <a href="https://cameronrwolfe.substack.com/p/reward-models">reward models</a> that were traditionally used for RL with LLMs. Such improved reliability has made stable RL training possible at scale, enabling the creation of powerful <a href="https://cameronrwolfe.substack.com/p/demystifying-reasoning-models">reasoning models</a> with extensive RL training. Despite these benefits, verifiable rewards also have limitations&#8212;<em>the same properties that make RLVR reliable confine it to domains with clean, automatically-checkable outcomes</em>. </p><blockquote><p><em>&#8220;While lots of efforts have been paid on RLVR, many high-value applications of LLMs, such as long-form question answering, general helpfulness, operate in inherently subjective domains where correctness cannot be sufficiently captured by binary signals.&#8221;</em> - from [3]</p></blockquote><p>Many important applications (e.g., creative writing or scientific reasoning) are not verifiable, making RLVR difficult to apply directly. To address this gap, we need reward signals that preserve RLVR&#8217;s scalability and reliability while still working in non-verifiable settings. Rubric-based rewards are a promising step in this direction: <em>they decompose desired model behavior into structured, interpretable criteria that an LLM judge can evaluate and aggregate into a multi-dimensional reward</em>. By creating prompt-specific rubrics that specify the evaluation process in detail, we can derive a more reliable reward signal from LLM judges and, therefore, use RL training to improve model capabilities even in highly subjective domains. For this reason, rubric-based RL training, which we will cover extensively in this overview, has become one of the most popular topics in current AI research. </p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://cameronrwolfe.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Join 60,000 others who use Deep (Learning) Focus to understand AI research. Consider a paid subscription if you would like to help support the newsletter.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2><strong>From LLM-as-a-Judge to Rubrics</strong></h2><p>Before learning about how rubrics can be used for RL training, we need to build a background understanding of LLM-as-a-Judge and the different setups that can be used to evaluate open-ended problems with an LLM. At the end of the section, we will connect these ideas to rubrics and RL training by overviewing existing RL training techniques and how they are being extended to non-verifiable domains. </p><h4>LLM-as-a-Judge</h4><p>Prior to the LLM era, many evaluation metrics used for generative tasks (e.g., <a href="https://en.wikipedia.org/wiki/BLEU">BLEU</a> or <a href="https://en.wikipedia.org/wiki/ROUGE_(metric)">ROUGE</a>) were quite brittle. These metrics use <a href="https://en.wikipedia.org/wiki/N-gram">n-gram</a> matching (or embedding-based matching as in <a href="https://arxiv.org/abs/1904.09675">BERTScore</a>) to compare a model&#8217;s output to a golden reference answer. Though this approach works relatively well, there are some fundamental problems that arise with reference-based metrics:</p><ul><li><p>We always require a reference answer in order to perform evaluation.</p></li><li><p>Our output must be similar to this reference answer to perform well.</p></li></ul><p>As we know, LLMs are capable of solving many different tasks, and most of these tasks are open-ended in nature. For example, we can use the same LLM to do creative writing or to answer medical questions. Although these problems are quite different, they do have a fundamental similarity: <em>there are many ways to answer a question correctly.</em> Traditional reference-based metrics struggle to handle such nuanced scenarios where divergence from a chosen reference answer does not imply that an output is bad. As a result, we have seen from several papers that reference-based metrics tend to <a href="https://arxiv.org/abs/1707.06875">correlate poorly</a> with human preferences.</p><blockquote><p><em>&#8220;LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain.&#8221; </em>- from [7]</p></blockquote><p><strong>LLM-as-a-Judge</strong> is a reference-free metric that prompts a foundation model to perform evaluation based upon specified criteria. Although it has limitations, this technique shows high agreement in many settings with human preferences and is capable of evaluating open-ended tasks in a scalable manner (i.e., minimal implementation changes are required). To evaluate a new task, <em>we simply need to create a new prompt that outlines the evaluation criteria for this task</em>. LLM-as-a-Judge was <a href="https://lmsys.org/blog/2023-03-30-vicuna/">originally proposed</a> after the release of GPT-4. This metric quickly gained popularity due to its utility and simplicity, culminating in the publication of an in-depth technical report [7]. Today, LLM-as-a-Judge is a widely-used technique in LLM evaluation; e.g., <a href="https://tatsu-lab.github.io/alpaca_eval/">AlpacaEval</a>, <a href="https://lmsys.org/blog/2023-05-03-arena/">Chatbot Arena</a>, <a href="https://lmsys.org/blog/2024-04-19-arena-hard/">Arena-Hard</a>, and more. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zyZu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d136a6-2eb6-4158-8f85-55fa26fa3c8f_1974x1234.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zyZu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d136a6-2eb6-4158-8f85-55fa26fa3c8f_1974x1234.png 424w, https://substackcdn.com/image/fetch/$s_!zyZu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d136a6-2eb6-4158-8f85-55fa26fa3c8f_1974x1234.png 848w, https://substackcdn.com/image/fetch/$s_!zyZu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d136a6-2eb6-4158-8f85-55fa26fa3c8f_1974x1234.png 1272w, https://substackcdn.com/image/fetch/$s_!zyZu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d136a6-2eb6-4158-8f85-55fa26fa3c8f_1974x1234.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zyZu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d136a6-2eb6-4158-8f85-55fa26fa3c8f_1974x1234.png" width="1456" height="910" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/74d136a6-2eb6-4158-8f85-55fa26fa3c8f_1974x1234.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:910,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zyZu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d136a6-2eb6-4158-8f85-55fa26fa3c8f_1974x1234.png 424w, https://substackcdn.com/image/fetch/$s_!zyZu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d136a6-2eb6-4158-8f85-55fa26fa3c8f_1974x1234.png 848w, https://substackcdn.com/image/fetch/$s_!zyZu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d136a6-2eb6-4158-8f85-55fa26fa3c8f_1974x1234.png 1272w, https://substackcdn.com/image/fetch/$s_!zyZu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d136a6-2eb6-4158-8f85-55fa26fa3c8f_1974x1234.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">LLM-as-a-Judge prompt formats (from [7])</figcaption></figure></div><p><strong>Scoring setups.</strong> When performing evaluation with an LLM, there are a few different scoring setups that are commonly used (shown above):</p><ol><li><p><em>Pairwise (preference) scoring</em>: the judge is presented with a prompt and two model responses and asked to identify the better response.</p></li><li><p><em>Direct assessment (pointwise) scoring</em>: the judge is given a single response to a prompt and asked to assign a score; e.g., using a 1-5 <a href="https://en.wikipedia.org/wiki/Likert_scale">Likert scale</a>.</p></li><li><p><em>Reference-guided scoring</em>: the judge is given a golden reference response in addition to the prompt and candidate response(s) to help with scoring.</p></li></ol><p>This list of scoring setups is not exhaustive, but most scoring setups for LLM-as-a-Judge use some variant or combination of the above techniques. For example, we can derive a pairwise score by scoring two responses independently and comparing their scores. In most cases, we also pair LLM-as-a-Judge with <a href="https://cameronrwolfe.substack.com/p/chain-of-thought-prompting-for-llms">chain-of-thought prompting</a> by asking the model to explain its evaluation process before providing a final score. Not only do such explanations make the evaluation process more interpretable, but they also improve the scoring accuracy of the LLM. Practically, implementing this change can be as simple as adding <em>&#8220;Please provide a step-by-step explanation prior to your final score&#8221;</em> to your prompt.</p><blockquote><p><em>&#8220;We identify biases and limitations of LLM judges. However, we&#8230; show the agreement between LLM judges and humans is high despite these limitations.&#8221; </em>- from [7]</p></blockquote><p><strong>Biases of LLM-as-a-Judge.</strong> Despite the effectiveness of LLM-as-a-Judge, this technique has several limitations of which we need to be aware. Fundamentally, the LLM judge is an imperfect proxy for human evaluation. By using a model for evaluation, we introduce several sources of bias into the evaluation process:</p><ol><li><p><em>Position bias</em>: the judge may favor outputs based upon their position within the prompt (e.g., the first response in a pairwise prompt).</p></li><li><p><em>Verbosity bias</em>: the judge may assign better scores to outputs based upon their length (i.e., longer responses receive higher scores).</p></li><li><p><em>Self-enhancement bias</em>: the judge tends to favor responses that are generated by itself (e.g., GPT-5 can assign higher scores to its own outputs).</p></li><li><p><em>Capability bias</em>: the judge struggles with evaluating responses to prompts that it cannot itself solve. </p></li><li><p><em>Distribution bias</em>: the judge may be biased towards certain scores in its scoring range (e.g., on a 1-5 Likert scale the judge may output mostly 3&#8217;s). </p></li></ol><p>In addition to these biases, LLM judges are generally sensitive to the details of their prompt. Therefore, we should not simply write a prompt and assume proper evaluation. We must calibrate our evaluation process, collect high-quality human labels, and tune our prompt to align well with human judgment; see <a href="https://hamel.dev/blog/posts/llm-judge/">here</a>.</p><p>There are several techniques we can adopt to combat scoring bias; e.g., in-context learning to better calibrate the judge&#8217;s score distribution, randomizing position and sampling multiple scores (i.e., position switching), providing high-quality reference answers, or using a jury of multiple LLM judges. For further details on LLM-as-a-Judge, a full overview of the topic is available at the link below. </p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;7921ff0f-16d6-46b0-adbc-1c9eb73823a9&quot;,&quot;caption&quot;:&quot;As large language models (LLMs) have become more and more capable, one of the most difficult aspects of working with these models is determining how to properly evaluate them. Many powerful models exist, and they each solve a wide variety of complex, open-ended tasks. As a result, discerning differences in performance between these mo&#8230;&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Using LLMs for Evaluation&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:29736521,&quot;name&quot;:&quot;Cameron R. Wolfe, Ph.D.&quot;,&quot;bio&quot;:&quot;Research @ Netflix &#8226; Rice University PhD &#8226; I make AI understandable&quot;,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/69aba7df-b571-4609-aa47-fc2d031c11b8_1242x1595.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2024-07-22T09:34:01.735Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3cca744e-8ad5-4266-9680-7da4fe94f497_1878x1052.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://cameronrwolfe.substack.com/p/llm-as-a-judge&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:141159804,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:125,&quot;comment_count&quot;:14,&quot;publication_id&quot;:1092659,&quot;publication_name&quot;:&quot;Deep (Learning) Focus&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!87xa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab9b43fb-52d5-40da-995d-5b7cd3f91064_896x896.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h4>LLM Evaluation with Rubrics</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cVp2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e67d161-dacd-48a7-9bb2-eb93052fe583_1724x1268.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cVp2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e67d161-dacd-48a7-9bb2-eb93052fe583_1724x1268.png 424w, https://substackcdn.com/image/fetch/$s_!cVp2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e67d161-dacd-48a7-9bb2-eb93052fe583_1724x1268.png 848w, https://substackcdn.com/image/fetch/$s_!cVp2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e67d161-dacd-48a7-9bb2-eb93052fe583_1724x1268.png 1272w, https://substackcdn.com/image/fetch/$s_!cVp2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e67d161-dacd-48a7-9bb2-eb93052fe583_1724x1268.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cVp2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e67d161-dacd-48a7-9bb2-eb93052fe583_1724x1268.png" width="1456" height="1071" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8e67d161-dacd-48a7-9bb2-eb93052fe583_1724x1268.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1071,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:449141,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e67d161-dacd-48a7-9bb2-eb93052fe583_1724x1268.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cVp2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e67d161-dacd-48a7-9bb2-eb93052fe583_1724x1268.png 424w, https://substackcdn.com/image/fetch/$s_!cVp2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e67d161-dacd-48a7-9bb2-eb93052fe583_1724x1268.png 848w, https://substackcdn.com/image/fetch/$s_!cVp2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e67d161-dacd-48a7-9bb2-eb93052fe583_1724x1268.png 1272w, https://substackcdn.com/image/fetch/$s_!cVp2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e67d161-dacd-48a7-9bb2-eb93052fe583_1724x1268.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [15])</figcaption></figure></div><p>The prompts used for LLM-as-a-Judge in the above section are quite simple. We just describe the evaluation task at a high level and let the LLM judge output a score. However, scoring with a single, general prompt is not always the best approach. Prior work [15] has shown that we can significantly improve the reliability of LLM evaluation by:</p><ul><li><p>Creating several per-criterion scoring prompts.</p></li><li><p>Providing a step-by-step description of the evaluation process.</p></li></ul><p>Put simply, <em>providing a granular scoring prompt is beneficial</em>, and we need not stop here. We can create judge prompts targeted to each domain, task, or instance. Increasing the granularity of LLM-as-a-Judge in this way is where the idea of a rubric arises. A rubric is just a scoring prompt that provides a detailed set of criteria by which a response is evaluated; see below. In many cases, rubrics are prompt (or instance)-specific, meaning that a tailored rubric is created for each prompt-response pair being evaluated. These prompt-specific rubrics are often synthetically generated with an LLM&#8212;<em>potentially with human intervention</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cC5H!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ccd4de0-4969-4935-ba99-dc91e21e43aa_1450x820.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cC5H!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ccd4de0-4969-4935-ba99-dc91e21e43aa_1450x820.png 424w, https://substackcdn.com/image/fetch/$s_!cC5H!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ccd4de0-4969-4935-ba99-dc91e21e43aa_1450x820.png 848w, https://substackcdn.com/image/fetch/$s_!cC5H!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ccd4de0-4969-4935-ba99-dc91e21e43aa_1450x820.png 1272w, https://substackcdn.com/image/fetch/$s_!cC5H!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ccd4de0-4969-4935-ba99-dc91e21e43aa_1450x820.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cC5H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ccd4de0-4969-4935-ba99-dc91e21e43aa_1450x820.png" width="1450" height="820" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2ccd4de0-4969-4935-ba99-dc91e21e43aa_1450x820.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1450,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:276482,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ccd4de0-4969-4935-ba99-dc91e21e43aa_1450x820.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!cC5H!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ccd4de0-4969-4935-ba99-dc91e21e43aa_1450x820.png 424w, https://substackcdn.com/image/fetch/$s_!cC5H!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ccd4de0-4969-4935-ba99-dc91e21e43aa_1450x820.png 848w, https://substackcdn.com/image/fetch/$s_!cC5H!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ccd4de0-4969-4935-ba99-dc91e21e43aa_1450x820.png 1272w, https://substackcdn.com/image/fetch/$s_!cC5H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ccd4de0-4969-4935-ba99-dc91e21e43aa_1450x820.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>As we can see above, rubrics are usually checklist-style and separated into a list of distinct criteria. Each of these criteria captures a single quality dimension that can be evaluated with an LLM judge. Additionally, in many setups, weights are defined for each criterion to simplify the aggregation of criteria-level scores. Given the similarity of rubrics and vanilla LLM-as-a-Judge, the emergence of rubrics is hard to attribute to a single paper. Rather, <em>the use of rubrics was a slow transition that occurred over time as LLM-as-a-Judge prompts became more granular</em>. </p><blockquote><p><em>&#8220;HealthBench is a rubric evaluation. To grade open-ended model responses, we score them against a conversation-specific physician-written rubric composed of self-contained, objective criteria. Criteria capture attributes that a response should be rewarded or penalized for in the context of that conversation and their relative importance.&#8221;</em> - from [16]</p></blockquote><p>In recent work, prompt-specific rubrics have become heavily used for evaluation in expert domains. For example, HealthBench [16] evaluates the quality of medical conversations according to physician-written rubrics that are specific to each conversation; see below. These rubrics focus on detailed and objective criteria&#8212;<em>each associated with a weight</em>&#8212;that can be verified with an LLM to yield a binary (pass or fail) score. MultiChallenge [17]&#8212;<em>a multi-turn chat benchmark focused on tough edge cases like iterative editing, self-coherence, and instruction retention</em>&#8212;develops prompt-specific rubrics to improve benchmark reliability, finding that rubrics improve agreement between expert humans and LLM judges.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6WVC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2d2b494-9f21-46c4-adf5-dbc60bd866ee_2730x1510.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6WVC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2d2b494-9f21-46c4-adf5-dbc60bd866ee_2730x1510.png 424w, https://substackcdn.com/image/fetch/$s_!6WVC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2d2b494-9f21-46c4-adf5-dbc60bd866ee_2730x1510.png 848w, https://substackcdn.com/image/fetch/$s_!6WVC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2d2b494-9f21-46c4-adf5-dbc60bd866ee_2730x1510.png 1272w, https://substackcdn.com/image/fetch/$s_!6WVC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2d2b494-9f21-46c4-adf5-dbc60bd866ee_2730x1510.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6WVC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2d2b494-9f21-46c4-adf5-dbc60bd866ee_2730x1510.png" width="1456" height="805" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c2d2b494-9f21-46c4-adf5-dbc60bd866ee_2730x1510.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:805,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3299527,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2d2b494-9f21-46c4-adf5-dbc60bd866ee_2730x1510.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6WVC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2d2b494-9f21-46c4-adf5-dbc60bd866ee_2730x1510.png 424w, https://substackcdn.com/image/fetch/$s_!6WVC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2d2b494-9f21-46c4-adf5-dbc60bd866ee_2730x1510.png 848w, https://substackcdn.com/image/fetch/$s_!6WVC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2d2b494-9f21-46c4-adf5-dbc60bd866ee_2730x1510.png 1272w, https://substackcdn.com/image/fetch/$s_!6WVC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2d2b494-9f21-46c4-adf5-dbc60bd866ee_2730x1510.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [16])</figcaption></figure></div><p>In this overview, we will go beyond the use of rubrics for evaluation and instead focus on the application of rubrics for deriving a reward signal in RL training. One of the biggest risks when using LLM-as-a-Judge-derived rewards for RL training is reward hacking&#8212;<em>LLM judges have known biases that can be exploited</em>. However, we see above that detailed rubrics help to make the evaluation process more reliable, thus reducing risks associated with reward hacking. </p><h4>RL with Verifiable (and Non-Verifiable) Rewards</h4><p>Though RL training has long been used for LLMs, the role of RL in LLM training pipelines has become more central with the recent advent of <a href="https://cameronrwolfe.substack.com/p/demystifying-reasoning-models">reasoning models</a>. In general, there are two common RL paradigms used for LLMs:</p><ul><li><p><em><a href="https://cameronrwolfe.substack.com/p/the-story-of-rlhf-origins-motivations">Reinforcement Learning from Human Feedback (RLHF)</a></em> trains the LLM using RL with rewards derived from a <a href="https://cameronrwolfe.substack.com/p/reward-models">reward model</a> trained on human preferences.</p></li><li><p><em><a href="https://cameronrwolfe.substack.com/i/153722335/reinforcement-learning-with-verifiable-rewards">Reinforcement Learning with Verifiable Rewards (RLVR)</a></em> trains the LLM using RL with rewards derived from rule-based or deterministic verifiers.</p></li></ul><p>The main difference between RLHF and RLVR is how we assign rewards&#8212;<em>RLHF uses a reward model, while RLVR uses verifiable rewards</em>. Aside from this difference, both are online RL algorithms with a similar structure; see below. For details on the inner workings of RL optimizers, please see prior posts on <a href="https://cameronrwolfe.substack.com/p/ppo-llm">PPO</a> and <a href="https://cameronrwolfe.substack.com/p/grpo">GRPO</a>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uPv8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uPv8!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 424w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 848w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 1272w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uPv8!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif" width="1456" height="817" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:817,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;[animate output image]&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="[animate output image]" title="[animate output image]" srcset="https://substackcdn.com/image/fetch/$s_!uPv8!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 424w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 848w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 1272w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Impact of RLVR. </strong>Recent progress in reasoning models has been driven largely by reinforcement learning with verifiable rewards (RLVR), which derives a reward signal during RL training from deterministic (or programmatic) rules that can be reliably checked (e.g., passing unit tests for code or matching a known numerical answer in math). Using rules-based rewards lowers our risk of reward hacking because we are using a hard rule to derive our reward rather than an LLM-based reward model. As a result, we can run larger-scale RL runs (i.e., over more data and for a larger number of iterations) with less risk of training instability. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zfsl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zfsl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 424w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 848w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 1272w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zfsl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png" width="1456" height="499" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:499,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zfsl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 424w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 848w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 1272w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Verifying a math problem with exact string matching</figcaption></figure></div><p>On the other hand, the same property that makes RLVR so powerful&#8212;<em>the dependence on reliable, rules-based rewards</em>&#8212;limits its applicability. Practically, we can only use RLVR on tasks with clean ground-truth labels<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> that can be checked automatically. Luckily, several important tasks fall into this category (e.g., math and coding). However, there are many other tasks we would like to solve but are subjective and difficult to verify. Due to this need for verification, we see that LLMs have advanced quickly in certain verifiable capabilities, while gains on non-verifiable tasks have been less uniform. To solve this issue, we need to develop an approach for extending recent advances in RL training to non-verifiable tasks.</p><blockquote><p><em>&#8220;In RLVR, rewards are derived from deterministic, programmatically verifiable signals&#8212;such as passing unit tests in code generation or matching the correct numerical answer in mathematical reasoning. While effective, this requirement for unambiguous correctness largely confines RLVR to domains with clear, automatically checkable outcomes.</em>&#8221; - from [2]</p></blockquote><p><strong>Open-ended domains.</strong> We typically turn to RLHF for training LLMs in open-ended settings. RLHF replaces deterministic verifiers with a learned <a href="https://cameronrwolfe.substack.com/p/reward-models">reward model</a> trained on preference data; see below. Preference data can be collected for any domain by simply sampling multiple completions for each prompt and having a <a href="https://cameronrwolfe.substack.com/p/the-story-of-rlhf-origins-motivations">human</a> (or <a href="https://cameronrwolfe.substack.com/p/rlaif-reinforcement-learning-from">model</a>) select the better of the two. We can drastically increase domain coverage by using RLHF. However, relying upon preference data and reward models introduces notable difficulties and failure modes:</p><ul><li><p>A large volume of preference data must be collected.</p></li><li><p>We lose granular control over the alignment criteria&#8212;<em>preferences are expressed in aggregate over a large volume of data rather than via explicit criteria</em>.</p></li><li><p>The reward model can overfit to artifacts (e.g., response length, formatting, etc.) and generally introduces more risk of reward hacking. </p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1T_j!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99f3ffbc-9104-419f-9ccf-3902425a85d8_1580x562.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1T_j!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99f3ffbc-9104-419f-9ccf-3902425a85d8_1580x562.png 424w, https://substackcdn.com/image/fetch/$s_!1T_j!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99f3ffbc-9104-419f-9ccf-3902425a85d8_1580x562.png 848w, https://substackcdn.com/image/fetch/$s_!1T_j!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99f3ffbc-9104-419f-9ccf-3902425a85d8_1580x562.png 1272w, https://substackcdn.com/image/fetch/$s_!1T_j!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99f3ffbc-9104-419f-9ccf-3902425a85d8_1580x562.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1T_j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99f3ffbc-9104-419f-9ccf-3902425a85d8_1580x562.png" width="466" height="165.78846153846155" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/99f3ffbc-9104-419f-9ccf-3902425a85d8_1580x562.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:518,&quot;width&quot;:1456,&quot;resizeWidth&quot;:466,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1T_j!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99f3ffbc-9104-419f-9ccf-3902425a85d8_1580x562.png 424w, https://substackcdn.com/image/fetch/$s_!1T_j!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99f3ffbc-9104-419f-9ccf-3902425a85d8_1580x562.png 848w, https://substackcdn.com/image/fetch/$s_!1T_j!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99f3ffbc-9104-419f-9ccf-3902425a85d8_1580x562.png 1272w, https://substackcdn.com/image/fetch/$s_!1T_j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99f3ffbc-9104-419f-9ccf-3902425a85d8_1580x562.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Basic structure of preference data</figcaption></figure></div><p>RLHF is a general technique, but it is usually used in practice for improving broad, subjective properties; e.g., helpfulness, harmlessness, or style. For complex, open-ended tasks, the reward signal tends to be multi-dimensional. Traditional reward modeling captures these quality dimensions via a single preference label, which eliminates our ability to specify quality dimensions at a more granular level. One could collect criterion-level preferences to solve this issue, but doing so requires training (and maintaining) separate reward models per criterion and increases the volume of data that must be collected. A natural alternative is to make evaluation dimensions explicit by using a rubric to ground the reward in structured, interpretable criteria rather than a single judgment.</p><p><strong>Rubrics-as-Rewards.</strong> The idea of deriving a reward from a rubric-based LLM judge is one of the current frontiers of RL research&#8212;<em>it presents an opportunity to extend RLVR to arbitrary open-ended tasks</em>. Although this area of research is still nascent and evolving quickly, <em>the idea of using rubrics for RL is not new</em>! Similar ideas have already been proposed for better handling the safety alignment of LLMs. During LLM alignment, we have a detailed list of safety specifications that describe the desired behavior of the model. These specifications are changed frequently as new needs or failure cases arise in practice. The dynamic nature of safety criteria makes applying a standard RLHF approach difficult&#8212;<em>the preference data must be adjusted or re-collected each time that our criteria change</em>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xplG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f0a1e80-d29b-44e2-b4f3-92803f21a455_2042x1212.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xplG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f0a1e80-d29b-44e2-b4f3-92803f21a455_2042x1212.png 424w, https://substackcdn.com/image/fetch/$s_!xplG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f0a1e80-d29b-44e2-b4f3-92803f21a455_2042x1212.png 848w, https://substackcdn.com/image/fetch/$s_!xplG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f0a1e80-d29b-44e2-b4f3-92803f21a455_2042x1212.png 1272w, https://substackcdn.com/image/fetch/$s_!xplG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f0a1e80-d29b-44e2-b4f3-92803f21a455_2042x1212.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xplG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f0a1e80-d29b-44e2-b4f3-92803f21a455_2042x1212.png" width="1456" height="864" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3f0a1e80-d29b-44e2-b4f3-92803f21a455_2042x1212.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:864,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:163631,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f0a1e80-d29b-44e2-b4f3-92803f21a455_2042x1212.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xplG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f0a1e80-d29b-44e2-b4f3-92803f21a455_2042x1212.png 424w, https://substackcdn.com/image/fetch/$s_!xplG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f0a1e80-d29b-44e2-b4f3-92803f21a455_2042x1212.png 848w, https://substackcdn.com/image/fetch/$s_!xplG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f0a1e80-d29b-44e2-b4f3-92803f21a455_2042x1212.png 1272w, https://substackcdn.com/image/fetch/$s_!xplG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f0a1e80-d29b-44e2-b4f3-92803f21a455_2042x1212.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [14])</figcaption></figure></div><p>To avoid the need for constant data collection, methods like Constitutional AI [13] and Deliberative Alignment [14] show that a reliable reward signal can be derived directly from the safety specifications themselves. More specifically, we can provide safety criteria as input to a strong reasoning model that is used to generate data or evaluate model outputs according to these criteria. Due to the strong instruction following capabilities of frontier-level reasoning models, this approach is capable of providing a reliable reward signal for safety training. </p><div class="pullquote"><p><em>&#8220;Collecting and maintaining human data for model safety is often costly and time-consuming, and the data can become outdated as safety guidelines evolve with model capability improvements or changes in user behaviors. Even when requirements are relatively stable, they can still be hard to convey to annotators. This is especially the case for safety, where desired model responses are complex, requiring nuance on whether and how to respond to requests.&#8221;</em> - from [9]</p></div><p>This approach avoids the need to re-collect data as criteria change. Rather, we just maintain a clear, itemized list of safety criteria&#8212;<em>basically a safety rubric</em>&#8212;that can be provided as input to the alignment system. Instead of collecting data, we focus on creating a &#8220;constitution&#8221; that dictates the behavior of our model. Once this constitution is available, we rely upon an LLM judge to apply the necessary supervision for achieving this desired behavior. This approach is both dynamic and interpretable, but it can only be applied in domains where the LLM judge is known to perform well. Extending similar techniques to arbitrary domains, which we will explore for the remainder of this post, is a non-trivial research problem.</p><h2>Using Rubrics for RL</h2><p>We now have a detailed understanding of LLM-as-a-Judge, rubrics, and their application to RL training. Next, we will extend these ideas by overviewing a broad collection of recent papers that study the application of rubrics to RL training. Many papers have been written on this topic in quick succession. As we will see, however, much of this work shares a similar flavor. Slowly, rubric-based RL has become more effective across a wider variety of tasks, enabling powerful reasoning models to achieve impressive gains even in non-verifiable domains. </p><h4><a href="https://arxiv.org/abs/2507.17746">Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains</a> [1]</h4><blockquote><p><em>&#8220;Rather than using rubrics only for evaluation, we treat them as checklist-style supervision that produces reward signals for on-policy RL. Each rubric is composed of modular, interpretable subgoals that provide automated feedback aligned with expert intent. By decomposing what makes a good response into tangible, human-interpretable criteria, rubrics offer a middle ground between binary correctness signals and coarse preference rankings.&#8221;</em> - from [1]</p></blockquote><p>RLVR is effective in verifiable domains with a clear correctness signal like math or coding, but there are many domains in the real world that are not strictly verifiable (e.g., science or health). For these domains, we need a more versatile reward mechanism&#8212;<em>such as an <a href="https://cameronrwolfe.substack.com/p/llm-as-a-judge">LLM judge</a> or <a href="https://cameronrwolfe.substack.com/p/reward-models">reward model</a></em>&#8212;that can handle open-ended problems that lack a clear or verifiable answer. Going beyond a <a href="https://cameronrwolfe.substack.com/i/141159804/different-setups-for-llm-as-a-judge">vanilla LLM-as-a-Judge setup</a>, we see in [1] that prompting the LLM judge with a rubric composed of structured, instance-specific&#8212;<em>meaning unique to each prompt</em>&#8212;criteria benefits the model&#8217;s performance in on-policy RL training.</p><p><strong>Creating rubrics.</strong> Rubrics in [1] are checklist-style and cover multiple criteria that are specific to each prompt being scored. The checklist for a rubric contains <code>K</code> total criteria <code>c_i</code>, each with a corresponding weight <code>w_i</code>. A criterion is defined as a binary correctness check that can be validated using an LLM judge. We can also recover an RLVR setup by assuming <code>K = 1</code> and letting <code>c_1</code> be a deterministically verifiable reward signal with weight  <code>w_1 = 1.0</code>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9Gs6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36d6997c-f2da-44e6-ac55-509f073f6632_2180x1070.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9Gs6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36d6997c-f2da-44e6-ac55-509f073f6632_2180x1070.png 424w, https://substackcdn.com/image/fetch/$s_!9Gs6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36d6997c-f2da-44e6-ac55-509f073f6632_2180x1070.png 848w, https://substackcdn.com/image/fetch/$s_!9Gs6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36d6997c-f2da-44e6-ac55-509f073f6632_2180x1070.png 1272w, https://substackcdn.com/image/fetch/$s_!9Gs6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36d6997c-f2da-44e6-ac55-509f073f6632_2180x1070.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9Gs6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36d6997c-f2da-44e6-ac55-509f073f6632_2180x1070.png" width="658" height="323.125" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/36d6997c-f2da-44e6-ac55-509f073f6632_2180x1070.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:715,&quot;width&quot;:1456,&quot;resizeWidth&quot;:658,&quot;bytes&quot;:301678,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36d6997c-f2da-44e6-ac55-509f073f6632_2180x1070.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9Gs6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36d6997c-f2da-44e6-ac55-509f073f6632_2180x1070.png 424w, https://substackcdn.com/image/fetch/$s_!9Gs6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36d6997c-f2da-44e6-ac55-509f073f6632_2180x1070.png 848w, https://substackcdn.com/image/fetch/$s_!9Gs6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36d6997c-f2da-44e6-ac55-509f073f6632_2180x1070.png 1272w, https://substackcdn.com/image/fetch/$s_!9Gs6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36d6997c-f2da-44e6-ac55-509f073f6632_2180x1070.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Explicit versus implicit rubric aggregation</figcaption></figure></div><p>We refer to this approach of using rubrics to generate a reward signal for RL as Rubrics-as-Rewards (RaR). There are two approaches we can use to evaluate a rubric and derive a reward for RL training (shown above):</p><ul><li><p><em>Explicit aggregation</em>: each criterion is independently evaluated using an LLM judge, and the final reward is derived by summing and normalizing the weighted score of each criterion.</p></li><li><p><em>Implicit aggregation</em>: all criteria along with their weights are passed to an LLM judge, which is asked to derive a final reward that considers all information.</p></li></ul><p>Explicit aggregation provides more granular control over the weight of each criterion, which can aid in interpretability but requires tuning and can be fragile. In contrast, the implicit aggregation approach delegates the reward aggregation process&#8212;<em>including handling the weights of each criterion</em>&#8212;to the LLM judge. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hVRt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b5392d8-c6c8-4c53-889b-b7ac4b6225ed_1460x566.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hVRt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b5392d8-c6c8-4c53-889b-b7ac4b6225ed_1460x566.png 424w, https://substackcdn.com/image/fetch/$s_!hVRt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b5392d8-c6c8-4c53-889b-b7ac4b6225ed_1460x566.png 848w, https://substackcdn.com/image/fetch/$s_!hVRt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b5392d8-c6c8-4c53-889b-b7ac4b6225ed_1460x566.png 1272w, https://substackcdn.com/image/fetch/$s_!hVRt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b5392d8-c6c8-4c53-889b-b7ac4b6225ed_1460x566.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hVRt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b5392d8-c6c8-4c53-889b-b7ac4b6225ed_1460x566.png" width="1456" height="564" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1b5392d8-c6c8-4c53-889b-b7ac4b6225ed_1460x566.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:564,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:216319,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b5392d8-c6c8-4c53-889b-b7ac4b6225ed_1460x566.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hVRt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b5392d8-c6c8-4c53-889b-b7ac4b6225ed_1460x566.png 424w, https://substackcdn.com/image/fetch/$s_!hVRt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b5392d8-c6c8-4c53-889b-b7ac4b6225ed_1460x566.png 848w, https://substackcdn.com/image/fetch/$s_!hVRt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b5392d8-c6c8-4c53-889b-b7ac4b6225ed_1460x566.png 1272w, https://substackcdn.com/image/fetch/$s_!hVRt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b5392d8-c6c8-4c53-889b-b7ac4b6225ed_1460x566.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p><strong>Generating rubrics.</strong> All instance-specific rubrics used in [1] are generated by an LLM; see above. When generating rubrics, guiding principles are provided to the model with respect to how rubrics should be created. Namely, rubrics must <em>i)</em> be grounded in guidance from human experts, <em>ii)</em> be comprehensive (i.e., span many dimensions of quality), <em>iii)</em> specify per-criterion importance (e.g., factuality is more important than style), and <em>iv)</em> use self-contained criteria (i.e., criteria should not depend on one another). Given these desiderata and a golden (expert-curated) reference answer for a prompt, the LLM then generates a rubric that includes:</p><ul><li><p>7-20 self-contained criteria. </p></li><li><p>A numeric or categorical (i.e., essential, pitfall, important, or optional<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>) weight for each of these criteria.</p></li></ul><p>Numeric weights provide fine-grained control over criterion importance, but categorical weights, each of which are mapped to a numerical score, are more interpretable&#8212;<em>both for humans and the LLM</em>&#8212;which leads them to be used for experiments in [1]. Once generated, a rubric can be used as a reward function by passing it to an LLM judge and performing explicit or implicit aggregation.</p><blockquote><p><em>&#8220;We generate rubrics using OpenAI&#8217;s o3-mini and GPT-4o, conditioning generation on reference answers from the underlying datasets to approximate expert grounding. The resulting collections&#8212;<a href="https://huggingface.co/datasets/anisha2102/RaR-Medicine">RaR-Medicine</a> and <a href="https://huggingface.co/datasets/anisha2102/RaR-Science">RaR-Science</a>&#8212;are released for public use.&#8221;</em> - from [1]</p></blockquote><p><strong>Experimental settings.</strong> In [1], authors see rubrics as an opportunity to provide flexible, scalable, and interpretable reward signals for RL in real-world domains that go beyond verifiable problems like code and math. Moving in this direction, two non-verifiable domains are considered in [1]: <em>medicine and science</em>. Prompts and rubrics used for RL in [1] are sampled from a mixture of public datasets, such as <a href="https://arxiv.org/abs/2502.13124">NaturalReasoning</a>, <a href="https://arxiv.org/abs/2501.15587">SCP-116K</a>, and <a href="https://huggingface.co/datasets/RJT1990/GeneralThoughtArchive">GeneralThought-430K</a>. This data is further curated to create two datasets for RaR training in [1]:</p><ul><li><p><em><a href="https://huggingface.co/datasets/anisha2102/RaR-Medicine">RaR-Medicine</a>:</em> ~20K prompts focused on medical reasoning with instance-specific rubrics generated with GPT-4o. </p></li><li><p><em><a href="https://huggingface.co/datasets/anisha2102/RaR-Science">RaR-Science</a>:</em> ~20K prompts curated to align with the problem categories from GPQA-Diamond with instance-specific rubrics generated by o3-mini.</p></li></ul><p>All experiments use <a href="https://huggingface.co/Qwen/Qwen2.5-7B">Qwen-2.5-7B</a> as a base model and train with GRPO. Rewards are assigned using GPT-4o-mini with the instance-level rubrics described above. The proposed technique in [1], referred to as RaR-Implicit, uses LLM-generated, instance-specific rubrics with implicit aggregation as a reward signal. Several rubric-free and fixed-rubric baselines are also considered:</p><ul><li><p><em>Base models</em>: Qwen-2.5-7B and <a href="https://huggingface.co/Qwen/Qwen2.5-7B-Instruct">Qwen-2.5-7B-Instruct</a> models are evaluated with no additional training.</p></li><li><p><em>Direct Assessment Judge</em>: an LLM judge provides a direct assessment score for each response on a 10-point <a href="https://en.wikipedia.org/wiki/Likert_scale">Likert scale</a>&#8212;<em>this is a standard LLM-as-a-Judge setup that does not use a granular, instance-specific rubric</em>.</p></li><li><p><em>Reference-Based Judge</em>: same as above, but the LLM judge is given a golden reference answer as context when generating a score.</p></li><li><p><em>RaR-Predefined</em>: a fixed set of generic rubrics are used for all prompts with explicit aggregation and uniform criteria weights. </p></li><li><p><em>RaR-Explicit</em>: instance-specific rubrics are used, but all criteria receive fixed weights based on their categorical importance label.</p></li></ul><p>All models are evaluated on the <a href="https://epoch.ai/benchmarks/gpqa-diamond">GPQA-Diamond</a> (Science) and <a href="https://openai.com/index/healthbench/">HealthBench</a> (medicine) benchmarks. For some smaller ablation experiments, RL training is performed on the training set of HealthBench rather than RaR-Medicine.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zcTo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa57ba595-9591-4539-988c-3a267ab59d87_1592x882.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zcTo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa57ba595-9591-4539-988c-3a267ab59d87_1592x882.png 424w, https://substackcdn.com/image/fetch/$s_!zcTo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa57ba595-9591-4539-988c-3a267ab59d87_1592x882.png 848w, https://substackcdn.com/image/fetch/$s_!zcTo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa57ba595-9591-4539-988c-3a267ab59d87_1592x882.png 1272w, https://substackcdn.com/image/fetch/$s_!zcTo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa57ba595-9591-4539-988c-3a267ab59d87_1592x882.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zcTo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa57ba595-9591-4539-988c-3a267ab59d87_1592x882.png" width="1456" height="807" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a57ba595-9591-4539-988c-3a267ab59d87_1592x882.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:807,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:285764,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa57ba595-9591-4539-988c-3a267ab59d87_1592x882.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zcTo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa57ba595-9591-4539-988c-3a267ab59d87_1592x882.png 424w, https://substackcdn.com/image/fetch/$s_!zcTo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa57ba595-9591-4539-988c-3a267ab59d87_1592x882.png 848w, https://substackcdn.com/image/fetch/$s_!zcTo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa57ba595-9591-4539-988c-3a267ab59d87_1592x882.png 1272w, https://substackcdn.com/image/fetch/$s_!zcTo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa57ba595-9591-4539-988c-3a267ab59d87_1592x882.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p><strong>Do rubrics provide useful rewards? </strong>Across all experiments in [1], we see that using structured, rubric-based rewards during RL training is beneficial. Rubric-based rewards are especially impactful when using smaller LLM judges for RL training and are found to reduce variance in reward signals across different sizes of LLM judges. As shown above, rubric-based approaches outperform all rubric-free methods aside from the reference-based LLM judge, relative to which we only see marginal gains from rubrics. However, rubrics are found to yield a more notable gain over reference-based LLM judge rewards in later experiments that train on HealthBench; see below. We also see that implicit aggregation tends to outperform explicit aggregation by a small (but consistent) margin. </p><blockquote><p><em>&#8220;Rubric-guided training achieves strong performance across domains, significantly outperforming Likert-based baselines and matching or exceeding the performance of reference-based reward generation.&#8221;</em> - from [1]</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Z_4o!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff252947f-8ea2-48ba-bc42-13ed3031c03a_2060x734.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Z_4o!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff252947f-8ea2-48ba-bc42-13ed3031c03a_2060x734.png 424w, https://substackcdn.com/image/fetch/$s_!Z_4o!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff252947f-8ea2-48ba-bc42-13ed3031c03a_2060x734.png 848w, https://substackcdn.com/image/fetch/$s_!Z_4o!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff252947f-8ea2-48ba-bc42-13ed3031c03a_2060x734.png 1272w, https://substackcdn.com/image/fetch/$s_!Z_4o!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff252947f-8ea2-48ba-bc42-13ed3031c03a_2060x734.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Z_4o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff252947f-8ea2-48ba-bc42-13ed3031c03a_2060x734.png" width="1456" height="519" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f252947f-8ea2-48ba-bc42-13ed3031c03a_2060x734.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:519,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:199939,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff252947f-8ea2-48ba-bc42-13ed3031c03a_2060x734.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Z_4o!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff252947f-8ea2-48ba-bc42-13ed3031c03a_2060x734.png 424w, https://substackcdn.com/image/fetch/$s_!Z_4o!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff252947f-8ea2-48ba-bc42-13ed3031c03a_2060x734.png 848w, https://substackcdn.com/image/fetch/$s_!Z_4o!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff252947f-8ea2-48ba-bc42-13ed3031c03a_2060x734.png 1272w, https://substackcdn.com/image/fetch/$s_!Z_4o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff252947f-8ea2-48ba-bc42-13ed3031c03a_2060x734.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>These experiments also highlight the necessity of expert-curated references for generating rubrics&#8212;<em>performance noticeably deteriorates without references, indicating purely synthetic rubrics are suboptimal. </em>Predefined or generic rubrics are also found to perform quite poorly, indicating that prompt-specific criteria are useful for deriving high-quality rubrics. These best practices for creating better rubrics are also evaluated beyond their impact on RL training. In [1], authors show that rubrics created via their proposed approach have noticeably higher levels of agreement with preference annotations from human experts; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!48A-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F209e88f2-4eed-4cea-a016-e0185bc3779c_2050x936.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!48A-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F209e88f2-4eed-4cea-a016-e0185bc3779c_2050x936.png 424w, https://substackcdn.com/image/fetch/$s_!48A-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F209e88f2-4eed-4cea-a016-e0185bc3779c_2050x936.png 848w, https://substackcdn.com/image/fetch/$s_!48A-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F209e88f2-4eed-4cea-a016-e0185bc3779c_2050x936.png 1272w, https://substackcdn.com/image/fetch/$s_!48A-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F209e88f2-4eed-4cea-a016-e0185bc3779c_2050x936.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!48A-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F209e88f2-4eed-4cea-a016-e0185bc3779c_2050x936.png" width="1456" height="665" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/209e88f2-4eed-4cea-a016-e0185bc3779c_2050x936.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:665,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:277650,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F209e88f2-4eed-4cea-a016-e0185bc3779c_2050x936.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!48A-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F209e88f2-4eed-4cea-a016-e0185bc3779c_2050x936.png 424w, https://substackcdn.com/image/fetch/$s_!48A-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F209e88f2-4eed-4cea-a016-e0185bc3779c_2050x936.png 848w, https://substackcdn.com/image/fetch/$s_!48A-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F209e88f2-4eed-4cea-a016-e0185bc3779c_2050x936.png 1272w, https://substackcdn.com/image/fetch/$s_!48A-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F209e88f2-4eed-4cea-a016-e0185bc3779c_2050x936.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><h4><a href="https://arxiv.org/abs/2508.12790">Reinforcement Learning with Rubric Anchors</a> [2]</h4><blockquote><p><em>&#8220;The success / failure hinges tightly on the diversity, granularity, and quantity of the rubrics themselves, as well as on a proper training routine and meticulous data curation.&#8221; </em>- from [1]</p></blockquote><p>Authors in [2] continue studying the application of RL to open-ended tasks using rubric-based rewards. They scale the rubric creation process to produce a dataset of ~10K rubrics curated by humans, LLMs, or a combination of both. Building on this dataset, a practical exposition of rubric-based RL is provided, ultimately arriving at a functional RaR training framework called Rubicon. Interestingly, simply increasing the number of rubrics&#8212;<em>whether generated synthetically or with human assistance</em>&#8212;yields only marginal gains. Instead, we must carefully curate high-quality rubrics, suggesting that the success of RaR heavily depends upon both rubric quality and the quality of the underlying training dataset.</p><p><strong>Rubric system.</strong> Instead of using strictly instance-level rubrics, multiple scopes are considered in [2], including instance, task, and dataset-level rubrics. When generating data, the system in [2] (shown below) starts by constructing the rubric first. Data is synthesized only after the rubric is created so that it explicitly matches the rubric. Then, the combination of rubric and data is used for both RL training and evaluation. Tasks in [2] are selected according to the <a href="https://www.jasonwei.net/blog/asymmetry-of-verification-and-verifiers-law">asymmetry of verification</a>&#8212;<em>verifying a candidate output should be much easier than generating it</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JmJ3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9d5109e-ebfd-41fd-b80b-623d7182677d_2326x906.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JmJ3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9d5109e-ebfd-41fd-b80b-623d7182677d_2326x906.png 424w, https://substackcdn.com/image/fetch/$s_!JmJ3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9d5109e-ebfd-41fd-b80b-623d7182677d_2326x906.png 848w, https://substackcdn.com/image/fetch/$s_!JmJ3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9d5109e-ebfd-41fd-b80b-623d7182677d_2326x906.png 1272w, https://substackcdn.com/image/fetch/$s_!JmJ3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9d5109e-ebfd-41fd-b80b-623d7182677d_2326x906.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JmJ3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9d5109e-ebfd-41fd-b80b-623d7182677d_2326x906.png" width="1456" height="567" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f9d5109e-ebfd-41fd-b80b-623d7182677d_2326x906.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:567,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:360319,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9d5109e-ebfd-41fd-b80b-623d7182677d_2326x906.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JmJ3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9d5109e-ebfd-41fd-b80b-623d7182677d_2326x906.png 424w, https://substackcdn.com/image/fetch/$s_!JmJ3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9d5109e-ebfd-41fd-b80b-623d7182677d_2326x906.png 848w, https://substackcdn.com/image/fetch/$s_!JmJ3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9d5109e-ebfd-41fd-b80b-623d7182677d_2326x906.png 1272w, https://substackcdn.com/image/fetch/$s_!JmJ3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9d5109e-ebfd-41fd-b80b-623d7182677d_2326x906.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p>To ensure rubric quality, authors run dedicated ablation experiments for every set of rubrics that is generated to measure their impact on the training process. Each rubric is comprised of <code>K</code> criteria <code>C = {c_1, c_2, &#8230;, c_K}</code>. An example of a rubric created for evaluating open-ended or creative tasks is provided below. After evaluating each of these criteria, we are left with a multi-dimensional reward vector that can be aggregated to yield a final reward. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tTh5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67bef4a7-0f51-4ea2-8c9f-554e9aa0f9d1_1238x1394.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tTh5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67bef4a7-0f51-4ea2-8c9f-554e9aa0f9d1_1238x1394.png 424w, https://substackcdn.com/image/fetch/$s_!tTh5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67bef4a7-0f51-4ea2-8c9f-554e9aa0f9d1_1238x1394.png 848w, https://substackcdn.com/image/fetch/$s_!tTh5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67bef4a7-0f51-4ea2-8c9f-554e9aa0f9d1_1238x1394.png 1272w, https://substackcdn.com/image/fetch/$s_!tTh5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67bef4a7-0f51-4ea2-8c9f-554e9aa0f9d1_1238x1394.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tTh5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67bef4a7-0f51-4ea2-8c9f-554e9aa0f9d1_1238x1394.png" width="1238" height="1394" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/67bef4a7-0f51-4ea2-8c9f-554e9aa0f9d1_1238x1394.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1394,&quot;width&quot;:1238,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:498041,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67bef4a7-0f51-4ea2-8c9f-554e9aa0f9d1_1238x1394.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tTh5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67bef4a7-0f51-4ea2-8c9f-554e9aa0f9d1_1238x1394.png 424w, https://substackcdn.com/image/fetch/$s_!tTh5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67bef4a7-0f51-4ea2-8c9f-554e9aa0f9d1_1238x1394.png 848w, https://substackcdn.com/image/fetch/$s_!tTh5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67bef4a7-0f51-4ea2-8c9f-554e9aa0f9d1_1238x1394.png 1272w, https://substackcdn.com/image/fetch/$s_!tTh5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67bef4a7-0f51-4ea2-8c9f-554e9aa0f9d1_1238x1394.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p>As a baseline, criteria-level rewards can be aggregated via a weighted average, but non-linear dependencies may exist between criteria that make a weighted average suboptimal. For this reason, authors in [2] consider the following advanced strategies for criteria aggregation:</p><ul><li><p><em>Veto Mechanisms</em>: failing on a critical dimension overrides any reward from other dimensions.</p></li><li><p><em>Saturation-Aware Aggregation</em>: over-performing on a single dimension yields diminishing returns relative to a balanced reward across dimensions. </p></li><li><p><em>Pairwise Interaction Modeling</em>: criteria are modeled together to capture inter-criteria relationships (i.e., synergistic or antagonistic effects). </p></li><li><p><em>Targeted Reward Shaping</em>: rewards in high-performance regions are amplified to better capture differentials and avoid scores becoming compressed.</p></li></ul><p><strong>Training strategy.</strong> The data used in [2] is derived from a proprietary post-training corpus with ~900K examples. Prior to any training, <a href="https://cameronrwolfe.substack.com/i/179769076/rlvr-with-grpo">offline difficulty filtering</a> is performed to remove any examples on which the base model performs too poorly or already performs well<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>. From here, RL training progresses in two phases, each with a different curriculum:</p><ul><li><p>The first phase focuses on instruction-following and programmatically-verifiable tasks to teach the LLM how to properly handle constraints.</p></li><li><p>The second phase extends the training process to more open-ended and creative tasks with a higher level of subjectivity.</p></li></ul><p>While the first phase primarily relies upon static rubrics and verifiers, we must use reference-based rubrics&#8212;<em>often with instance-specific criteria</em>&#8212;for the second phase. Granular rubrics help to provide a more reliable reward signal on tasks that are highly subjective. This multi-stage training framework aims to progressively cultivate the capabilities of the model. When training jointly on all tasks, authors observe a &#8220;seesaw effect&#8221;&#8212;<em>joint training actually reduces model performance relative to forming a multi-stage curriculum</em>; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qXlF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d06eba0-dd0b-4748-b670-f48295e4cc6c_1278x1018.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qXlF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d06eba0-dd0b-4748-b670-f48295e4cc6c_1278x1018.png 424w, https://substackcdn.com/image/fetch/$s_!qXlF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d06eba0-dd0b-4748-b670-f48295e4cc6c_1278x1018.png 848w, https://substackcdn.com/image/fetch/$s_!qXlF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d06eba0-dd0b-4748-b670-f48295e4cc6c_1278x1018.png 1272w, https://substackcdn.com/image/fetch/$s_!qXlF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d06eba0-dd0b-4748-b670-f48295e4cc6c_1278x1018.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qXlF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d06eba0-dd0b-4748-b670-f48295e4cc6c_1278x1018.png" width="1278" height="1018" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1d06eba0-dd0b-4748-b670-f48295e4cc6c_1278x1018.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1018,&quot;width&quot;:1278,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:212079,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d06eba0-dd0b-4748-b670-f48295e4cc6c_1278x1018.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qXlF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d06eba0-dd0b-4748-b670-f48295e4cc6c_1278x1018.png 424w, https://substackcdn.com/image/fetch/$s_!qXlF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d06eba0-dd0b-4748-b670-f48295e4cc6c_1278x1018.png 848w, https://substackcdn.com/image/fetch/$s_!qXlF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d06eba0-dd0b-4748-b670-f48295e4cc6c_1278x1018.png 1272w, https://substackcdn.com/image/fetch/$s_!qXlF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d06eba0-dd0b-4748-b670-f48295e4cc6c_1278x1018.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p><strong>Reward hacking</strong> is one of the biggest risks in a RaR setup. Whereas verifiable rewards are deterministic, neural reward models can be exploited, and the likelihood of our policy finding such an exploit increases in large-scale RL runs<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>. The Rubicon approach proposed in [2] combats reward hacking by performing an offline analysis of rollout data. After the first phase of RL training, authors examine rollouts that yield abnormally high rewards and create a basic taxonomy of recurring reward hacking patterns that are discovered. From this taxonomy, a specific rubric is created for preventing reward hacking&#8212;<em>this rubric can also be iteratively refined over time</em>. The addition of a reward hacking rubric improves training stability (i.e., avoids collapse into a reward-hacked state) and allows RL training to be conducted for a much larger number of training steps. </p><div class="pullquote"><p>&#8220;Applying RL with rubrics from different task types could create conflicting objectives, leading to performance trade-offs &#8212; a phenomenon we refer to as the seesaw effect&#8230; training exclusively with instruction-following rubrics improves compliance but reduces creativity, while training exclusively with creativity and empathy rubrics enhances open-ended responses but harms strict adherence&#8230; These results suggest that simply combining all rubric types in a single RL run is likely to intensify such conflicts. To overcome this, we adopt a multi-stage RL strategy.&#8221; - from [2]</p></div><p><strong>Rubicon-preview</strong> is a <a href="https://huggingface.co/Qwen/Qwen3-30B-A3B">Qwen-3-30B-A3B</a> base model that is finetuned in [2] using the Rubicon framework. This model excels in open-ended and humanities-related benchmarks. For example, we see below that Rubicon-preview achieves an absolute improvement of 5.2% compared to the base model on various instruction following, emotional intelligence, and writing benchmarks. Notably, Rubicon-preview also outperforms <a href="https://arxiv.org/abs/2412.19437">DeepSeek-V3-671B</a> on most of these tasks, where an especially significant performance boost is observed on writing tasks. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MmTq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a354560-7f5f-4f75-b614-afb34ecc3894_1288x306.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MmTq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a354560-7f5f-4f75-b614-afb34ecc3894_1288x306.png 424w, https://substackcdn.com/image/fetch/$s_!MmTq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a354560-7f5f-4f75-b614-afb34ecc3894_1288x306.png 848w, https://substackcdn.com/image/fetch/$s_!MmTq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a354560-7f5f-4f75-b614-afb34ecc3894_1288x306.png 1272w, https://substackcdn.com/image/fetch/$s_!MmTq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a354560-7f5f-4f75-b614-afb34ecc3894_1288x306.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MmTq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a354560-7f5f-4f75-b614-afb34ecc3894_1288x306.png" width="1288" height="306" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7a354560-7f5f-4f75-b614-afb34ecc3894_1288x306.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:306,&quot;width&quot;:1288,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:100344,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a354560-7f5f-4f75-b614-afb34ecc3894_1288x306.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MmTq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a354560-7f5f-4f75-b614-afb34ecc3894_1288x306.png 424w, https://substackcdn.com/image/fetch/$s_!MmTq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a354560-7f5f-4f75-b614-afb34ecc3894_1288x306.png 848w, https://substackcdn.com/image/fetch/$s_!MmTq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a354560-7f5f-4f75-b614-afb34ecc3894_1288x306.png 1272w, https://substackcdn.com/image/fetch/$s_!MmTq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a354560-7f5f-4f75-b614-afb34ecc3894_1288x306.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p>The performance benefits of Rubicon-preview are also achieved with shocking sample efficiency&#8212;<em>the model is only trained on ~5K data samples</em>. By using an RaR approach, authors are also able to granularly control the style or voice of the resulting model. More specifically, a few case studies are presented in [2] that demonstrate the use of rubrics to guide the LLM away from the didactic tone that is common of chatbots and towards a human-like tone with more emotion. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CPpf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2941e56b-30c0-4ce1-9558-f7d8e92323e5_1270x206.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CPpf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2941e56b-30c0-4ce1-9558-f7d8e92323e5_1270x206.png 424w, https://substackcdn.com/image/fetch/$s_!CPpf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2941e56b-30c0-4ce1-9558-f7d8e92323e5_1270x206.png 848w, https://substackcdn.com/image/fetch/$s_!CPpf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2941e56b-30c0-4ce1-9558-f7d8e92323e5_1270x206.png 1272w, https://substackcdn.com/image/fetch/$s_!CPpf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2941e56b-30c0-4ce1-9558-f7d8e92323e5_1270x206.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CPpf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2941e56b-30c0-4ce1-9558-f7d8e92323e5_1270x206.png" width="1270" height="206" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2941e56b-30c0-4ce1-9558-f7d8e92323e5_1270x206.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:206,&quot;width&quot;:1270,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:55781,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2941e56b-30c0-4ce1-9558-f7d8e92323e5_1270x206.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CPpf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2941e56b-30c0-4ce1-9558-f7d8e92323e5_1270x206.png 424w, https://substackcdn.com/image/fetch/$s_!CPpf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2941e56b-30c0-4ce1-9558-f7d8e92323e5_1270x206.png 848w, https://substackcdn.com/image/fetch/$s_!CPpf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2941e56b-30c0-4ce1-9558-f7d8e92323e5_1270x206.png 1272w, https://substackcdn.com/image/fetch/$s_!CPpf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2941e56b-30c0-4ce1-9558-f7d8e92323e5_1270x206.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p>Going further, creatively-oriented RaR training does not seem to damage the LLM&#8217;s general capabilities. As shown above, Rubicon-preview performs on par with or better than the original base model across a wide scope of benchmarks. Such a result should not come as a surprise given the natural ability of RL to avoid forgetting and retain the prior knowledge or skills of an LLM; see <a href="https://cameronrwolfe.substack.com/p/rl-continual-learning">here</a>.</p><h4><strong><a href="https://arxiv.org/abs/2510.07743">OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment</a> [3]</strong></h4><p>We&#8217;ve seen several papers that study the use of rubrics for RL training, where rubrics are generated&#8212;<em>possibly with human intervention</em>&#8212;and evaluated by an off-the-shelf LLM. Instead of focusing upon the downstream application of rubrics in RL, authors in [3] specifically analyze the rubric generation and evaluation process. To facilitate this study, an open dataset of prompt-rubric pairs, called <a href="https://huggingface.co/datasets/OpenRubrics/OpenRubrics">OpenRubrics</a>, is created for training both rubric generation models and rubric-based reward models. As we learned in [2], RaR training is highly dependent upon rubric quality. Creating better rubrics&#8212;<em>and reducing the amount of human supervision in this process</em>&#8212;makes RaR training more scalable and effective.</p><p>The <strong>rubric structure</strong> used in [3] is consistent with prior work. Namely, each rubric is comprised of <code>K</code> criteria, where each criterion is a rubric description that specifies one aspect of response quality. Two types of criteria are considered:</p><ol><li><p><em>Hard rules</em>: explicit or objective constraints (e.g., length or correctness).</p></li><li><p><em>Principles:</em> higher-level qualitative aspects (e.g., reasoning soundness, factuality, or stylistic coherence).</p></li></ol><p>Unlike prior work, rubrics in [3] do not use per-criterion weights and are used for pairwise comparison of two completions&#8212;<em>as opposed to direct assessment</em>. For a rubric <code>R = {c_1, &#8230;, c_K}</code> and two responses <code>y_1</code> and <code>y_2</code> to the same prompt <code>x</code>, we want our rubric-based reward model to provide a binary preference label (i.e., <code>y_1 &gt; y_2</code> or <code>y_1 &lt; y_2</code>) by reasoning over the rubric criteria.</p><div class="pullquote"><p>&#8220;We prompt the LLM to generate two complementary types of rubrics: hard rules, which capture explicit and objective constraints specified in the prompt, and principles, which summarize implicit and generalizable qualities of strong responses. This design allows the rubrics to capture both surface-level requirements and deeper dimensions of quality. Although hard rules are typically straightforward to extract, the principles are more subtle and require fine-grained reasoning.&#8221; - from [3]</p></div><p><strong>Building OpenRubrics.</strong> The prompts and preference labels used for creating OpenRubrics are sourced from several public datasets (e.g., <a href="https://huggingface.co/datasets/openbmb/UltraFeedback">UltraFeedback</a>, <a href="https://huggingface.co/datasets/MegaScience/MegaScience">MegaScience</a>, <a href="https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT">Medical-o1</a>, instruction following data from <a href="https://arxiv.org/abs/2411.15124">Tulu-3</a>, and more). For each of these datasets, preference data is obtained via domain-specific post-processing of the existing data. For example, the highest and lowest scoring responses form a preference pair for UltraFeedback, while for MegaScience and Medical-o1 completions are generated with a pool of LLMs and scored via a jury of different reward models to obtain preference pairs; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xwPa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08c12e30-3151-4d3c-b642-ec4bbba84625_2386x726.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xwPa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08c12e30-3151-4d3c-b642-ec4bbba84625_2386x726.png 424w, https://substackcdn.com/image/fetch/$s_!xwPa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08c12e30-3151-4d3c-b642-ec4bbba84625_2386x726.png 848w, https://substackcdn.com/image/fetch/$s_!xwPa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08c12e30-3151-4d3c-b642-ec4bbba84625_2386x726.png 1272w, https://substackcdn.com/image/fetch/$s_!xwPa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08c12e30-3151-4d3c-b642-ec4bbba84625_2386x726.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xwPa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08c12e30-3151-4d3c-b642-ec4bbba84625_2386x726.png" width="1456" height="443" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/08c12e30-3151-4d3c-b642-ec4bbba84625_2386x726.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:443,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:343481,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08c12e30-3151-4d3c-b642-ec4bbba84625_2386x726.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xwPa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08c12e30-3151-4d3c-b642-ec4bbba84625_2386x726.png 424w, https://substackcdn.com/image/fetch/$s_!xwPa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08c12e30-3151-4d3c-b642-ec4bbba84625_2386x726.png 848w, https://substackcdn.com/image/fetch/$s_!xwPa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08c12e30-3151-4d3c-b642-ec4bbba84625_2386x726.png 1272w, https://substackcdn.com/image/fetch/$s_!xwPa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08c12e30-3151-4d3c-b642-ec4bbba84625_2386x726.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><p>Once this preference data is available, rubrics are generated using two key strategies proposed in [3] (shown above):</p><ol><li><p><em>Contrastive Rubric Generation (CRG)</em>: an instruction-tuned LLM is provided both a prompt and a preference pair and asked to produce discriminative evaluation criteria by contrasting the chosen and rejected responses.</p></li><li><p><em>Rubric Filtering</em>: rubrics are filtered by prompting an LLM to choose the preferred response given a preference pair and rubric as input and only retaining rubrics that yield agreement with human-provided preference labels (i.e., preference label consistency)<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>. </p></li></ol><p>CRG and rubric filtering aim to create rubrics that are both prompt-specific and aligned with human preference examples, <em>allowing them to serve as useful anchors for reward modeling</em>. The result of this rubric generation and filtering approach is OpenRubrics, the key statistics of which are summarized in the plots below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!U2Ni!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf55c52b-1691-4738-b306-c5a019a92acb_1564x895.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!U2Ni!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf55c52b-1691-4738-b306-c5a019a92acb_1564x895.png 424w, https://substackcdn.com/image/fetch/$s_!U2Ni!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf55c52b-1691-4738-b306-c5a019a92acb_1564x895.png 848w, https://substackcdn.com/image/fetch/$s_!U2Ni!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf55c52b-1691-4738-b306-c5a019a92acb_1564x895.png 1272w, https://substackcdn.com/image/fetch/$s_!U2Ni!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf55c52b-1691-4738-b306-c5a019a92acb_1564x895.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!U2Ni!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf55c52b-1691-4738-b306-c5a019a92acb_1564x895.png" width="1456" height="833" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/af55c52b-1691-4738-b306-c5a019a92acb_1564x895.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:833,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:293426,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf55c52b-1691-4738-b306-c5a019a92acb_1564x895.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!U2Ni!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf55c52b-1691-4738-b306-c5a019a92acb_1564x895.png 424w, https://substackcdn.com/image/fetch/$s_!U2Ni!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf55c52b-1691-4738-b306-c5a019a92acb_1564x895.png 848w, https://substackcdn.com/image/fetch/$s_!U2Ni!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf55c52b-1691-4738-b306-c5a019a92acb_1564x895.png 1272w, https://substackcdn.com/image/fetch/$s_!U2Ni!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf55c52b-1691-4738-b306-c5a019a92acb_1564x895.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><blockquote><p><em>&#8220;After collecting the rubrics-based dataset, we proceed to develop a rubric generation model that outputs evaluation rubrics and a reward model Rubric-RM that generates final preference labels.&#8221;</em> - from [3]</p></blockquote><p><strong>Rubric-RM.</strong> OpenRubrics provides a high-quality dataset of preference pairs and rubrics. In [3], this data is used to train two kinds of models (both of which are based upon <a href="https://huggingface.co/Qwen/Qwen3-4B">Qwen-3-4B</a> or <a href="https://huggingface.co/Qwen/Qwen3-8B">Qwen-3-8B</a>):</p><ol><li><p>A rubric generation model&#8212;<em>trained via SFT</em>&#8212;that, given a prompt, can produce a discriminative rubric for predicting preference labels.</p></li><li><p>A reward model&#8212;<em>also trained via SFT</em><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a>&#8212;called Rubric-RM that can predict rubric-guided, pairwise preferences. </p></li></ol><p>At inference time, these two models are used in tandem. Given a prompt, we first use the rubric generation model to produce our rubric. Then, Rubric-RM ingests this rubric, the prompt, and a pair of completions to generate a final preference prediction. We can also use majority voting (i.e., running this pipeline several times and taking the most frequently outputted score) to improve accuracy. Although using a two-stage pipeline increases inference costs, authors mention that costs can be decreased significantly by caching generated rubrics.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pSGK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13911e95-b761-4a5f-8695-3e28d00ef417_1582x812.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pSGK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13911e95-b761-4a5f-8695-3e28d00ef417_1582x812.png 424w, https://substackcdn.com/image/fetch/$s_!pSGK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13911e95-b761-4a5f-8695-3e28d00ef417_1582x812.png 848w, https://substackcdn.com/image/fetch/$s_!pSGK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13911e95-b761-4a5f-8695-3e28d00ef417_1582x812.png 1272w, https://substackcdn.com/image/fetch/$s_!pSGK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13911e95-b761-4a5f-8695-3e28d00ef417_1582x812.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pSGK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13911e95-b761-4a5f-8695-3e28d00ef417_1582x812.png" width="1456" height="747" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/13911e95-b761-4a5f-8695-3e28d00ef417_1582x812.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:747,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:251475,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13911e95-b761-4a5f-8695-3e28d00ef417_1582x812.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pSGK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13911e95-b761-4a5f-8695-3e28d00ef417_1582x812.png 424w, https://substackcdn.com/image/fetch/$s_!pSGK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13911e95-b761-4a5f-8695-3e28d00ef417_1582x812.png 848w, https://substackcdn.com/image/fetch/$s_!pSGK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13911e95-b761-4a5f-8695-3e28d00ef417_1582x812.png 1272w, https://substackcdn.com/image/fetch/$s_!pSGK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13911e95-b761-4a5f-8695-3e28d00ef417_1582x812.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><p><strong>Comparison to other reward models.</strong> Rubric-RM is compared to a wide variety of other reward models and LLM-as-a-Judge approaches on several key evaluation benchmarks; see above. Rubric-RM tends to outperform similarly-sized baselines; e.g., the 8B variant gets 70.1% average accuracy, whereas the strongest 7B-scale reward model (RM-R1-7B) has an average accuracy of only 61.7%. These results are made even stronger with the use of majority voting. When comparing to the Qwen-3 base models, we see a noticeable uplift in preference scoring accuracy for Rubric-RM, highlighting the effectiveness of the finetuning strategy in [3].</p><blockquote><p><em>&#8220;Rubric-RM excels on benchmarks requiring fine-grained instruction adherence&#8230; This demonstrates that rubrics capture nuanced constraints better than scalar reward models.&#8221;</em> - from [3]</p></blockquote><p>The gains from Rubric-RM are most pronounced on instruction-following tasks, which means that the rubrics in [3] work well for explicit evaluation criteria. On the other hand, this finding indicates less impact for subjective criteria, <em>revealing that improving rubric supervision for open-ended tasks is still an open problem</em>. </p><p><strong>Application to post-training.</strong> Beyond evaluating Rubric-RM on reward modeling benchmarks, we can also measure the model&#8217;s downstream impact by using it as a reward signal in LLM post-training. Downstream evaluations in [3] only consider instruction following tasks (i.e., <a href="https://arxiv.org/abs/2311.07911">IFEval</a>, <a href="https://arxiv.org/abs/2401.03601">InfoBench</a>, and <a href="https://arxiv.org/abs/2507.02833">IFBench</a>)&#8212;<em>likely because this is the domain on which Rubric-RM excels</em>&#8212;and use DPO for preference tuning. Rubric-RM is found to yield a boost over other reward models; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ob_C!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba83fba-5066-40a3-9d65-17628483294a_1578x722.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ob_C!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba83fba-5066-40a3-9d65-17628483294a_1578x722.png 424w, https://substackcdn.com/image/fetch/$s_!ob_C!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba83fba-5066-40a3-9d65-17628483294a_1578x722.png 848w, https://substackcdn.com/image/fetch/$s_!ob_C!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba83fba-5066-40a3-9d65-17628483294a_1578x722.png 1272w, https://substackcdn.com/image/fetch/$s_!ob_C!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba83fba-5066-40a3-9d65-17628483294a_1578x722.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ob_C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba83fba-5066-40a3-9d65-17628483294a_1578x722.png" width="1456" height="666" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fba83fba-5066-40a3-9d65-17628483294a_1578x722.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:666,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:221490,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba83fba-5066-40a3-9d65-17628483294a_1578x722.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ob_C!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba83fba-5066-40a3-9d65-17628483294a_1578x722.png 424w, https://substackcdn.com/image/fetch/$s_!ob_C!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba83fba-5066-40a3-9d65-17628483294a_1578x722.png 848w, https://substackcdn.com/image/fetch/$s_!ob_C!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba83fba-5066-40a3-9d65-17628483294a_1578x722.png 1272w, https://substackcdn.com/image/fetch/$s_!ob_C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffba83fba-5066-40a3-9d65-17628483294a_1578x722.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><h4><strong><a href="https://arxiv.org/abs/2511.19399">DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research</a> [4]</strong></h4><blockquote><p><em>&#8220;Deep research (DR) models aim to produce in-depth, well-attributed answers to complex research tasks by planning, searching, and synthesizing information from diverse sources&#8221; </em>- from [4]</p></blockquote><p>Rubrics are studied in the context of deep research (DR) agents in [4]. A DR agent is an LLM that is taught to perform multi-step research and produce long-form answers&#8212;<em>or surveys</em>&#8212;that answer a query with detailed information and citations. This idea was popularized by <a href="https://blog.google/products-and-platforms/products/gemini/google-gemini-deep-research/">Gemini DR</a> and followed shortly after by DR agents from <a href="https://openai.com/index/introducing-deep-research/">OpenAI</a>, <a href="https://www.anthropic.com/engineering/multi-agent-research-system">Anthropic</a>, and more. Though many closed models support DR mode, open models are behind in this area: <em>most open DR models are either prompt-based or trained on short-form, search-intensive QA tasks (i.e., not reflective of frontier DR agents) with RLVR</em>. To solve this, authors in [4] train Dr. Tulu-8B&#8212;<em>a fully-open</em><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a><em> LLM agent for long-form, open-ended DR tasks</em>&#8212;using a novel online RL technique that evolves instance-level rubrics alongside the policy throughout training.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8OAX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36d75c42-fc6d-4268-9678-3f28532f3bef_2018x1476.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8OAX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36d75c42-fc6d-4268-9678-3f28532f3bef_2018x1476.png 424w, https://substackcdn.com/image/fetch/$s_!8OAX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36d75c42-fc6d-4268-9678-3f28532f3bef_2018x1476.png 848w, https://substackcdn.com/image/fetch/$s_!8OAX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36d75c42-fc6d-4268-9678-3f28532f3bef_2018x1476.png 1272w, https://substackcdn.com/image/fetch/$s_!8OAX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36d75c42-fc6d-4268-9678-3f28532f3bef_2018x1476.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8OAX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36d75c42-fc6d-4268-9678-3f28532f3bef_2018x1476.png" width="1456" height="1065" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/36d75c42-fc6d-4268-9678-3f28532f3bef_2018x1476.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1065,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:910914,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36d75c42-fc6d-4268-9678-3f28532f3bef_2018x1476.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8OAX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36d75c42-fc6d-4268-9678-3f28532f3bef_2018x1476.png 424w, https://substackcdn.com/image/fetch/$s_!8OAX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36d75c42-fc6d-4268-9678-3f28532f3bef_2018x1476.png 848w, https://substackcdn.com/image/fetch/$s_!8OAX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36d75c42-fc6d-4268-9678-3f28532f3bef_2018x1476.png 1272w, https://substackcdn.com/image/fetch/$s_!8OAX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36d75c42-fc6d-4268-9678-3f28532f3bef_2018x1476.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [4])</figcaption></figure></div><p><strong>Definition of DR.</strong> Before describing Dr. Tulu, we need to understand the basic mechanics of DR agents. Details of closed DR agents are not publicly disclosed, but we can discern from using these agents that they:</p><ol><li><p>Heavily rely on search tools to ground their answers in external knowledge.</p></li><li><p>Output long answers (i.e., basically survey papers) with many citations.</p></li></ol><p>Authors in [4] use these observations to formalize an action space for DR agents; see below. In this formulation, a DR agent has the ability to <em>i)</em> think, <em>ii)</em> call a set of search tools, <em>iii)</em> provide a final answer, and <em>iv)</em> insert citations into the final answer. For all actions, any context that is output (e.g., thinking traces or tool outputs) is just concatenated to the sequence being processed by the DR agent. The DR agent itself is just an LLM that performs <a href="https://cameronrwolfe.substack.com/p/teaching-language-models-to-use-tools">tool use</a> in this action space.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!D4L9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2189065-9373-46ac-a9d7-4e9ab566a57f_2266x636.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!D4L9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2189065-9373-46ac-a9d7-4e9ab566a57f_2266x636.png 424w, https://substackcdn.com/image/fetch/$s_!D4L9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2189065-9373-46ac-a9d7-4e9ab566a57f_2266x636.png 848w, https://substackcdn.com/image/fetch/$s_!D4L9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2189065-9373-46ac-a9d7-4e9ab566a57f_2266x636.png 1272w, https://substackcdn.com/image/fetch/$s_!D4L9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2189065-9373-46ac-a9d7-4e9ab566a57f_2266x636.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!D4L9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2189065-9373-46ac-a9d7-4e9ab566a57f_2266x636.png" width="1456" height="409" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e2189065-9373-46ac-a9d7-4e9ab566a57f_2266x636.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:409,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:256692,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2189065-9373-46ac-a9d7-4e9ab566a57f_2266x636.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!D4L9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2189065-9373-46ac-a9d7-4e9ab566a57f_2266x636.png 424w, https://substackcdn.com/image/fetch/$s_!D4L9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2189065-9373-46ac-a9d7-4e9ab566a57f_2266x636.png 848w, https://substackcdn.com/image/fetch/$s_!D4L9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2189065-9373-46ac-a9d7-4e9ab566a57f_2266x636.png 1272w, https://substackcdn.com/image/fetch/$s_!D4L9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2189065-9373-46ac-a9d7-4e9ab566a57f_2266x636.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [4])</figcaption></figure></div><p><strong>Rubrics for DR.</strong> Evaluating a DR agent is a tough task. These agents generate lengthy outputs with detailed information, so there are many ways that an output could be good or bad&#8212;<em>a static or predefined set of rubrics will not capture the detailed quality dimensions required for this task.</em> Additionally, evaluation varies depending on the query (e.g., asking for a vacation plan versus an AI research survey). </p><p>Given that most DR queries are knowledge-intensive, we must also verify key information against known world knowledge. For this reason, synthetically generating instance-specific rubrics with an LLM&#8212;<em>as in [1, 3]</em>&#8212;is insufficient. This approach relies upon the parametric knowledge of the LLM rather than grounding on external knowledge that can be used to verify correctness. Ideally, we should ground the evaluation process in knowledge retrieved via search tools rather than relying on the (incomplete) parametric knowledge of an LLM. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rj--!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F793e6538-cd2e-4e52-8fd6-d41fd544c6af_2394x1120.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rj--!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F793e6538-cd2e-4e52-8fd6-d41fd544c6af_2394x1120.png 424w, https://substackcdn.com/image/fetch/$s_!rj--!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F793e6538-cd2e-4e52-8fd6-d41fd544c6af_2394x1120.png 848w, https://substackcdn.com/image/fetch/$s_!rj--!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F793e6538-cd2e-4e52-8fd6-d41fd544c6af_2394x1120.png 1272w, https://substackcdn.com/image/fetch/$s_!rj--!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F793e6538-cd2e-4e52-8fd6-d41fd544c6af_2394x1120.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rj--!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F793e6538-cd2e-4e52-8fd6-d41fd544c6af_2394x1120.png" width="1456" height="681" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/793e6538-cd2e-4e52-8fd6-d41fd544c6af_2394x1120.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:681,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1010811,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F793e6538-cd2e-4e52-8fd6-d41fd544c6af_2394x1120.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!rj--!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F793e6538-cd2e-4e52-8fd6-d41fd544c6af_2394x1120.png 424w, https://substackcdn.com/image/fetch/$s_!rj--!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F793e6538-cd2e-4e52-8fd6-d41fd544c6af_2394x1120.png 848w, https://substackcdn.com/image/fetch/$s_!rj--!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F793e6538-cd2e-4e52-8fd6-d41fd544c6af_2394x1120.png 1272w, https://substackcdn.com/image/fetch/$s_!rj--!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F793e6538-cd2e-4e52-8fd6-d41fd544c6af_2394x1120.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [4])</figcaption></figure></div><p><strong>Evolving rubrics.</strong> To address the unique considerations of DR tasks, Dr. Tulu is trained using a modified rubric-based RL technique, called Reinforcement Learning with Evolving Rubrics (RLER), that derives a reward from instance-specific rubrics that <em>i)</em> evolve alongside the policy during training and <em>ii)</em> are grounded in knowledge from the internet; see above. Similarly to prior work, rubrics are defined as a set of weighted criteria. Each of these criteria can be scored with a separate LLM judge to derive a final score as shown below. This formulation matches the explicit aggregation strategy proposed in [1]. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qMID!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a24602f-f704-4f85-889a-fe212299727f_1966x950.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qMID!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a24602f-f704-4f85-889a-fe212299727f_1966x950.png 424w, https://substackcdn.com/image/fetch/$s_!qMID!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a24602f-f704-4f85-889a-fe212299727f_1966x950.png 848w, https://substackcdn.com/image/fetch/$s_!qMID!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a24602f-f704-4f85-889a-fe212299727f_1966x950.png 1272w, https://substackcdn.com/image/fetch/$s_!qMID!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a24602f-f704-4f85-889a-fe212299727f_1966x950.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qMID!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a24602f-f704-4f85-889a-fe212299727f_1966x950.png" width="475" height="229.67032967032966" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6a24602f-f704-4f85-889a-fe212299727f_1966x950.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:704,&quot;width&quot;:1456,&quot;resizeWidth&quot;:475,&quot;bytes&quot;:229309,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a24602f-f704-4f85-889a-fe212299727f_1966x950.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qMID!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a24602f-f704-4f85-889a-fe212299727f_1966x950.png 424w, https://substackcdn.com/image/fetch/$s_!qMID!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a24602f-f704-4f85-889a-fe212299727f_1966x950.png 848w, https://substackcdn.com/image/fetch/$s_!qMID!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a24602f-f704-4f85-889a-fe212299727f_1966x950.png 1272w, https://substackcdn.com/image/fetch/$s_!qMID!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a24602f-f704-4f85-889a-fe212299727f_1966x950.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>During training, we have a buffer of rubrics for each prompt that stores a set of evolving rubrics specific to that prompt. Within this buffer, we designate certain rubrics as active, and these active rubrics are used for deriving the reward in the current training iterations. To initialize the buffer, we first create a set of search-based rubrics using an LLM with access to search tools. These initial rubrics are used persistently&#8212;<em>meaning they are always included in the active set of rubrics</em>&#8212;throughout training. At each training step, we prompt an LLM to generate a set of new (or evolving) rubrics given a prompt, a group of corresponding rollouts, and the set of active rubrics for that prompt as context; see below. Specifically, there are two types of rubrics that can be created by the LLM:</p><ol><li><p><em>Positive Rubrics</em>: capture strengths of new relevant knowledge explored by the current policy but not yet present in any rubric.</p></li><li><p><em>Negative Rubrics</em>: address common undesirable behaviors of the current policy (e.g., reward hacking).</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iVsF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74b511c4-0ca4-4892-9ab7-4c075a0b52d9_2148x1406.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iVsF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74b511c4-0ca4-4892-9ab7-4c075a0b52d9_2148x1406.png 424w, https://substackcdn.com/image/fetch/$s_!iVsF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74b511c4-0ca4-4892-9ab7-4c075a0b52d9_2148x1406.png 848w, https://substackcdn.com/image/fetch/$s_!iVsF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74b511c4-0ca4-4892-9ab7-4c075a0b52d9_2148x1406.png 1272w, https://substackcdn.com/image/fetch/$s_!iVsF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74b511c4-0ca4-4892-9ab7-4c075a0b52d9_2148x1406.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iVsF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74b511c4-0ca4-4892-9ab7-4c075a0b52d9_2148x1406.png" width="2148" height="1406" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/74b511c4-0ca4-4892-9ab7-4c075a0b52d9_2148x1406.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1406,&quot;width&quot;:2148,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:986716,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26588f08-4b55-4287-b3e7-142ba7835ed3_2148x1406.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iVsF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74b511c4-0ca4-4892-9ab7-4c075a0b52d9_2148x1406.png 424w, https://substackcdn.com/image/fetch/$s_!iVsF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74b511c4-0ca4-4892-9ab7-4c075a0b52d9_2148x1406.png 848w, https://substackcdn.com/image/fetch/$s_!iVsF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74b511c4-0ca4-4892-9ab7-4c075a0b52d9_2148x1406.png 1272w, https://substackcdn.com/image/fetch/$s_!iVsF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74b511c4-0ca4-4892-9ab7-4c075a0b52d9_2148x1406.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Prompt for generating evolving rubrics (from [4])</figcaption></figure></div><p>During RLER, the number of evolving rubrics can become large. To avoid this, we maintain a subset of active rubrics&#8212;<em>always containing the initial persistent rubrics</em>&#8212;via an explicit management strategy that filters and ranks rubrics based on their discriminative power. To measure a rubric&#8217;s discriminative power, we rely upon the group of completions created for advantage computation in GRPO. During each policy update, the group of completions for a given prompt is scored using all active rubrics for that prompt, and rubrics with zero reward variance (i.e., no discriminative value) are removed. Remaining rubrics are ranked in descending order based on the standard deviation of rewards across the group. Only rubrics with the top <code>K</code> standard deviations&#8212;<em>and persistent rubrics</em>&#8212;remain active. </p><div class="pullquote"><p>&#8220;Instead of trying to exhaustively enumerate all possible desiderata, our method generates rubrics tailored to the current policy model&#8217;s behaviors, offering on-policy feedback the model can effectively learn from. Furthermore, the rubrics are generated with retrieval, ensuring it can cover the needed knowledge to assess the generation.&#8221; - from [4]</p></div><p>The evolving rubrics in [4] are grounded in external knowledge and allow the reward for RL to adapt to the current state of our policy. As the model discovers new behaviors (e.g., a reward hack), these changes can be identified and captured in a new or modified rubric to maintain training fidelity. For this reason, we do not need to create a rubric a priori that exhaustively captures all desiderata for evaluation, <em>which is difficult for DR tasks</em>. Rather, this system can observe policy behavior and automatically incorporate key trends into new rubrics. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sLQ4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb503197-6f95-40a8-9583-8b18b9891c95_2380x898.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sLQ4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb503197-6f95-40a8-9583-8b18b9891c95_2380x898.png 424w, https://substackcdn.com/image/fetch/$s_!sLQ4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb503197-6f95-40a8-9583-8b18b9891c95_2380x898.png 848w, https://substackcdn.com/image/fetch/$s_!sLQ4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb503197-6f95-40a8-9583-8b18b9891c95_2380x898.png 1272w, https://substackcdn.com/image/fetch/$s_!sLQ4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb503197-6f95-40a8-9583-8b18b9891c95_2380x898.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sLQ4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb503197-6f95-40a8-9583-8b18b9891c95_2380x898.png" width="1456" height="549" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cb503197-6f95-40a8-9583-8b18b9891c95_2380x898.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:549,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:399470,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb503197-6f95-40a8-9583-8b18b9891c95_2380x898.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sLQ4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb503197-6f95-40a8-9583-8b18b9891c95_2380x898.png 424w, https://substackcdn.com/image/fetch/$s_!sLQ4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb503197-6f95-40a8-9583-8b18b9891c95_2380x898.png 848w, https://substackcdn.com/image/fetch/$s_!sLQ4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb503197-6f95-40a8-9583-8b18b9891c95_2380x898.png 1272w, https://substackcdn.com/image/fetch/$s_!sLQ4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb503197-6f95-40a8-9583-8b18b9891c95_2380x898.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [4])</figcaption></figure></div><p>The rubric evolution process is found in [4] to have interesting characteristics, such as producing rubrics with measurably higher levels of specificity or even negative rubrics that penalize specific behaviors within the LLM; see above. </p><p><strong>Dr. Tulu-8B</strong> is trained using a two-stage approach that includes a cold start SFT phase and online RL with GRPO. The <a href="https://huggingface.co/Qwen/Qwen3-8B">Qwen-3-8B</a> base model used in [4] does not yet possess the necessary atomic skillset (e.g., proper planning or citations) for solving DR tasks. If we were to begin RL training directly from this model, most rollouts would be of low quality, and the training process would likely struggle to efficiently discover high-reward solutions via exploration. To solve this issue, a cold start SFT phase is performed in [4] prior to RL training by sampling DR trajectories from a strong teacher model&#8212;<em>in this case GPT-5 with a detailed system prompt describing the DR task</em>&#8212;for supervised training. By finetuning the Qwen-3 base model on these trajectories, we allow the model to quickly learn a better initial policy for searching, planning, and citing sources prior to online RL. Given that most open DR agents are trained on short-form QA tasks, these supervised trajectories, which are <a href="https://huggingface.co/datasets/rl-research/dr-tulu-sft-data">openly available</a>, are by themselves a useful artifact.</p><p>After cold start SFT, we perform online RLER using GRPO (with <a href="https://cameronrwolfe.substack.com/i/181791956/dapo-an-open-source-llm-reinforcement-learning-system-at-scale-1">token-level loss aggregation</a>) as the RL optimizer. Efficiently generating rollouts for online RL with a DR agent is a non-trivial systems problem due to output length and the frequency of tool calls. Rollouts are already the largest bottleneck in RL. Adding tool calls into the mix (i.e., &#8220;agentic&#8221; rollouts) makes this problem even worse. To improve efficiency, authors in [4] use one-step asynchronous RL training. Rollout generation and policy updates are performed at the same time, but policy updates are performed on rollouts from the prior training step. Additionally, tool calls are executed immediately to overlap generation and tool calling as much as possible. </p><blockquote><p><em>&#8220;Tool requests are sent the second a given rollout triggers them, as opposed to waiting for the full batch to finish&#8230; Once a tool call is sent, we place that given generation request to sleep, allowing the inference engine to potentially continue to work on generating other responses while waiting for the tool response. This results in the generation and tool calling being overlapped wherever possible.&#8221;</em> - from [4]</p></blockquote><p>One other difficult aspect of RL training with a DR agent is the output lengths&#8212;<em>generating long outputs (obviously) increases the time taken to produce a rollout</em>. Plus, there can be high variance in output lengths. To mitigate this issue, <a href="https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook#attention">sample packing</a> is adopted during RL training, which improves efficiency by combining multiple outputs into a single, fixed length sequence. Finally, a few additional sources of heuristic rewards are used on top of RLER to encourage correct formatting and sufficient usage of search and citation tools by the agent.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hAdM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2435b96a-7d03-4623-b862-79afaa5862c3_2141x805.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hAdM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2435b96a-7d03-4623-b862-79afaa5862c3_2141x805.png 424w, https://substackcdn.com/image/fetch/$s_!hAdM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2435b96a-7d03-4623-b862-79afaa5862c3_2141x805.png 848w, https://substackcdn.com/image/fetch/$s_!hAdM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2435b96a-7d03-4623-b862-79afaa5862c3_2141x805.png 1272w, https://substackcdn.com/image/fetch/$s_!hAdM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2435b96a-7d03-4623-b862-79afaa5862c3_2141x805.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hAdM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2435b96a-7d03-4623-b862-79afaa5862c3_2141x805.png" width="1456" height="547" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2435b96a-7d03-4623-b862-79afaa5862c3_2141x805.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:547,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:344346,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2435b96a-7d03-4623-b862-79afaa5862c3_2141x805.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hAdM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2435b96a-7d03-4623-b862-79afaa5862c3_2141x805.png 424w, https://substackcdn.com/image/fetch/$s_!hAdM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2435b96a-7d03-4623-b862-79afaa5862c3_2141x805.png 848w, https://substackcdn.com/image/fetch/$s_!hAdM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2435b96a-7d03-4623-b862-79afaa5862c3_2141x805.png 1272w, https://substackcdn.com/image/fetch/$s_!hAdM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2435b96a-7d03-4623-b862-79afaa5862c3_2141x805.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [4])</figcaption></figure></div><p><strong>Performance and efficiency.</strong> Dr. Tulu-8B is evaluated on several DR benchmarks (<a href="https://arxiv.org/abs/2504.10861">ScholarQA</a>, <a href="https://openai.com/index/healthbench/">HealthBench</a>, <a href="https://arxiv.org/abs/2509.00496">ResearchQA</a>, and <a href="https://arxiv.org/abs/2506.11763">DeepResearchBench</a>), where we see that it substantially outperforms other open DR agents&#8212;<em>even those that are larger (e.g., <a href="https://huggingface.co/Alibaba-NLP/Tongyi-DeepResearch-30B-A3B">Tongyi-DR-30B-A3B</a>)</em>&#8212;and frequently matches the performance of the top proprietary systems. Additionally, Dr. Tulu-8B is smaller and cheaper compared to other systems. Notably, Dr. Tulu-8B is up to three orders of magnitude cheaper than OpenAI DR in some cases; e.g., costs are reduced from $1.80 per query to $0.0019 per query on ScholarQA<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a>. Much of these savings come from the ability to call the correct tools and avoid excessive tool usage that drastically increases API costs. Not only does Dr. Tulu-8B generally make fewer tool calls, but authors observe in [4] that the model heavily calls free paper search tools for academic benchmarks while only using paid web search tools for more general queries. </p><h4><strong><a href="https://arxiv.org/abs/2602.01511">Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training</a> [5]</strong></h4><p>Rubrics are helpful for performing granular evaluation, assuming that the rubric we are using is of high quality. To curate a high-quality rubric, we rely upon human annotators or synthetic generation. Relying on human oversight would make it difficult to scale rubric curation. On the other hand, synthetic rubrics are scalable, but static models are often used to generate and evaluate these rubrics, which limits adaptation to new domains. To make this process more dynamic, a joint training procedure for rubric generation and evaluation is proposed in [5].</p><blockquote><p><em>&#8220;Rubric-ARM [is] a framework that jointly optimizes a rubric generator and a judge using RL from preference feedback. Unlike existing methods that rely on static rubrics or disjoint training pipelines, our approach treats rubric generation as a latent action learned to maximize judgment accuracy. We introduce an alternating optimization strategy to mitigate the non-stationarity of simultaneous updates.&#8221;</em> - from [5]</p></blockquote><p><strong>Rubric-ARM.</strong> There are two models being trained in this framework: <em>a rubric generator and an LLM judge. </em>These models are trained with an alternating RL framework that switches between training each model. This approach, called Rubric-ARM, jointly optimizes the generator&#8217;s ability to create a rubric and the judge&#8217;s ability to predict human-aligned preference scores given a rubric as input. By learning these components together (i.e., instead of using separate training pipelines), <em>we allow them to co-evolve and reinforce each other throughout training</em>. </p><p>A rubric is defined in [5] as a set of evaluation criteria that are conditionally generated given a prompt as input&#8212;<em>no explicit per-criterion weights are defined</em>. Given a rubric sampled from the rubric generator, the objective&#8212;<em>for both the rubric generator and the judge</em>&#8212;is to maximize the preference accuracy of scores output by the judge. Notably, Rubric-ARM only considers preference data. The LLM judge is trained to predict a preference label (i.e., instead of performing direct assessment) given a prompt and two possible completions as input. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4zwp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe396bca7-e97e-40e9-8332-ba003b1e2d39_2505x1405.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4zwp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe396bca7-e97e-40e9-8332-ba003b1e2d39_2505x1405.png 424w, https://substackcdn.com/image/fetch/$s_!4zwp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe396bca7-e97e-40e9-8332-ba003b1e2d39_2505x1405.png 848w, https://substackcdn.com/image/fetch/$s_!4zwp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe396bca7-e97e-40e9-8332-ba003b1e2d39_2505x1405.png 1272w, https://substackcdn.com/image/fetch/$s_!4zwp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe396bca7-e97e-40e9-8332-ba003b1e2d39_2505x1405.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4zwp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe396bca7-e97e-40e9-8332-ba003b1e2d39_2505x1405.png" width="626" height="351.2651098901099" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e396bca7-e97e-40e9-8332-ba003b1e2d39_2505x1405.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:817,&quot;width&quot;:1456,&quot;resizeWidth&quot;:626,&quot;bytes&quot;:383981,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe396bca7-e97e-40e9-8332-ba003b1e2d39_2505x1405.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4zwp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe396bca7-e97e-40e9-8332-ba003b1e2d39_2505x1405.png 424w, https://substackcdn.com/image/fetch/$s_!4zwp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe396bca7-e97e-40e9-8332-ba003b1e2d39_2505x1405.png 848w, https://substackcdn.com/image/fetch/$s_!4zwp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe396bca7-e97e-40e9-8332-ba003b1e2d39_2505x1405.png 1272w, https://substackcdn.com/image/fetch/$s_!4zwp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe396bca7-e97e-40e9-8332-ba003b1e2d39_2505x1405.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [5])</figcaption></figure></div><p><strong>Training pipeline.</strong> Prior to RL, Rubric-ARM performs a cold-start SFT phase that trains both the rubric generator and the judge over a synthetic dataset curated from a variety of open data sources (e.g., <a href="https://arxiv.org/abs/2310.01377">UltraFeedback</a>, <a href="https://arxiv.org/abs/2506.20737">Magpie</a>, and more). From here, we begin the alternating RL procedure that switches between training the rubric generator or judge while keeping the other fixed. Alternating the learning process gives each component a clear training signal by keeping the other fixed.</p><div class="pullquote"><p><em>&#8220;To ensure stable joint optimization, Rubric-ARM employs an alternating training strategy that decouples the learning dynamics while preserving a shared objective. Training alternates between (i) optimizing the reward model with a fixed rubric generator to align with target preference labels, and (ii) optimizing the rubric generator with a fixed reward model to produce discriminative rubrics that maximize prediction accuracy.&#8221; - from [5]</em></p></div><p>At each training iteration <code>t</code>, we sample a batch of preference data. A rubric is then sampled&#8212;<em>and cached for future use</em>&#8212;with the rubric generator for each prompt in the batch. First, the rubric generator is kept fixed, and we perform RL training (with GRPO) to update the judge. The reward is defined as a sum of:</p><ul><li><p><em>Preference accuracy</em>: a binary score indicating whether the predicted label matches the ground-truth label.</p></li><li><p><em>Correct formatting</em>: a heuristic that checks the judge&#8217;s trajectory for expected components (i.e., addressing each rubric criterion, providing per-criterion explanations, and finishing with an overall justification and decision). </p></li></ul><p>Rubrics are generally sampled once and used for multiple judge optimization steps. After training the judge, we then freeze the judge&#8217;s weights and update the rubric generator. The rubrics used during this phase are cached, as the rubric generator was not trained during the prior phase. To train the rubric generator, we only use a preference accuracy reward based on whether the fixed judge is able to predict a correct preference label given the generated rubric. We learn from experiments in [5] that the optimization order is important. Training the rubric generator before the judge leads to noticeably degraded performance. </p><blockquote><p><em>&#8220;Early-stage exploration by the rubric generator can dominate the learning dynamics. To mitigate this, we first stabilize the reward model under fixed rubrics before optimizing the rubric generator. This alternating schedule reduces variance and ensures robust optimization.&#8221;</em> - from [5]</p></blockquote><p><strong>Application to post-training.</strong> The rubric generator and judge obtained from Rubric-ARM can also be applied to LLM post-training. Beginning with a set of prompts, we do the following:</p><ol><li><p>Sample a rubric for each prompt with the rubric generator.</p></li><li><p>Sample two completions for each prompt using our current policy.</p></li><li><p>Score the completions using the judge with the above rubric<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-9" href="#footnote-9" target="_self">9</a>. </p></li><li><p>Perform <a href="https://cameronrwolfe.substack.com/p/direct-preference-optimization">DPO</a> using preference data created with the above steps.</p></li></ol><p>We are not restricted to offline training either! The above steps can easily be generalized to a <a href="https://cameronrwolfe.substack.com/i/169926007/direct-alignment-techniques">semi-online DPO setup</a> by regularly sampling new, on-policy completions and performing DPO training in phases to increase the freshness of preference data. We can even perform fully-online RL by modifying the above steps with a pairwise RL approach [6]. More specifically, we do the following for each prompt: </p><ol><li><p>Sample a deterministic (baseline) completion with greedy decoding.</p></li><li><p>Sample a group of rollouts using a normal sampling procedure.</p></li></ol><p>Once we have these completions, we use them to derive a direct assessment reward from the pairwise comparisons predicted by the LLM judge. To do this, Rubric-ARM creates preference pairs between each rollout in the group and the baseline completion. Then, our reward is defined as whether Rubric-ARM correctly predicts the greedy baseline as the rejected completion; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_coW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d0bf86d-09aa-41c3-b88e-79f0368c8d26_1243x1261.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_coW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d0bf86d-09aa-41c3-b88e-79f0368c8d26_1243x1261.png 424w, https://substackcdn.com/image/fetch/$s_!_coW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d0bf86d-09aa-41c3-b88e-79f0368c8d26_1243x1261.png 848w, https://substackcdn.com/image/fetch/$s_!_coW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d0bf86d-09aa-41c3-b88e-79f0368c8d26_1243x1261.png 1272w, https://substackcdn.com/image/fetch/$s_!_coW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d0bf86d-09aa-41c3-b88e-79f0368c8d26_1243x1261.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_coW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d0bf86d-09aa-41c3-b88e-79f0368c8d26_1243x1261.png" width="430" height="436.2268704746581" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1d0bf86d-09aa-41c3-b88e-79f0368c8d26_1243x1261.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1261,&quot;width&quot;:1243,&quot;resizeWidth&quot;:430,&quot;bytes&quot;:238032,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d0bf86d-09aa-41c3-b88e-79f0368c8d26_1243x1261.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_coW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d0bf86d-09aa-41c3-b88e-79f0368c8d26_1243x1261.png 424w, https://substackcdn.com/image/fetch/$s_!_coW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d0bf86d-09aa-41c3-b88e-79f0368c8d26_1243x1261.png 848w, https://substackcdn.com/image/fetch/$s_!_coW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d0bf86d-09aa-41c3-b88e-79f0368c8d26_1243x1261.png 1272w, https://substackcdn.com/image/fetch/$s_!_coW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d0bf86d-09aa-41c3-b88e-79f0368c8d26_1243x1261.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Computing a reward for online RL from pairwise preferences (from [5])</figcaption></figure></div><blockquote><p><em>&#8220;Rubric-ARM outperforms strong reasoning-based judges and prior rubric-based reward models, achieving a +4.7% average gain on reward-modeling benchmarks, and consistently improves downstream policy post-training when used as the reward signal.&#8221; </em>- from [5]</p></blockquote><p><strong>How does this perform?</strong> Rubric-ARM is trained on the general-domain portion of OpenRubrics [3]. Both the rubric generator and LLM judge use <a href="https://huggingface.co/Qwen/Qwen3-8B">Qwen-3-8B</a> as a base model, and a two-stage rubric judging process&#8212;<em>including generating and evaluating the rubric</em>&#8212;is used at inference time. Rubric-ARM is compared to several open and closed LLM judges, as well as an SFT baseline trained on the same data (i.e., the Rubric-RM model [3]). Metrics on a wide variety of alignment-related reward modeling benchmarks are provided below. As we can see, Rubric-ARM outperforms all other open models and matches or exceeds the performance of most closed judges. Additionally, Rubric-ARM improves the performance of the SFT baseline by 4.8% absolute, indicating that alternating RL is helpful for discovering more discriminative rubrics and improving judge performance.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7dES!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd1c7d2-0698-4ba3-9b69-ff7e2c9bcc2b_2280x1336.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7dES!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd1c7d2-0698-4ba3-9b69-ff7e2c9bcc2b_2280x1336.png 424w, https://substackcdn.com/image/fetch/$s_!7dES!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd1c7d2-0698-4ba3-9b69-ff7e2c9bcc2b_2280x1336.png 848w, https://substackcdn.com/image/fetch/$s_!7dES!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd1c7d2-0698-4ba3-9b69-ff7e2c9bcc2b_2280x1336.png 1272w, https://substackcdn.com/image/fetch/$s_!7dES!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd1c7d2-0698-4ba3-9b69-ff7e2c9bcc2b_2280x1336.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7dES!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd1c7d2-0698-4ba3-9b69-ff7e2c9bcc2b_2280x1336.png" width="1456" height="853" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5bd1c7d2-0698-4ba3-9b69-ff7e2c9bcc2b_2280x1336.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:853,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:449757,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd1c7d2-0698-4ba3-9b69-ff7e2c9bcc2b_2280x1336.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7dES!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd1c7d2-0698-4ba3-9b69-ff7e2c9bcc2b_2280x1336.png 424w, https://substackcdn.com/image/fetch/$s_!7dES!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd1c7d2-0698-4ba3-9b69-ff7e2c9bcc2b_2280x1336.png 848w, https://substackcdn.com/image/fetch/$s_!7dES!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd1c7d2-0698-4ba3-9b69-ff7e2c9bcc2b_2280x1336.png 1272w, https://substackcdn.com/image/fetch/$s_!7dES!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bd1c7d2-0698-4ba3-9b69-ff7e2c9bcc2b_2280x1336.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [5])</figcaption></figure></div><p>Rubric-ARM is also tested on <a href="https://writingpreferencebench.github.io/">WritingPreferenceBench</a>, an out-of-distribution benchmark, where we see that the system generalizes well to other domains and continues to outperform baselines even on a very open-ended task (i.e., creative writing). Authors also run several ablation experiments, where we learn that:</p><ul><li><p>The optimization order for alternating RL is important; i.e., training the rubric generator first (instead of the judge) degrades preference accuracy by 2.4% with the largest regressions seen on instruction-following tasks.</p></li><li><p>Removing the format reward used for the judge is harmful; i.e., LLM judges trained with only correctness rewards perform 2.2% worse than those trained on a combination of correctness and format rewards. </p></li></ul><p>Similar results hold true when Rubric-ARM is used for LLM post-training. Rubric-ARM yields a boost in policy performance in both online and offline alignment scenarios, and policies trained with Rubric-ARM outperform those trained with other open models. Of the methods that are considered, iterative DPO with Rubric-ARM yields the best results, indicating that Rubric-ARM excels in creating high-quality preference data for LLM post-training; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7wvl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb25d8ac1-08f8-434c-8bf4-d2fb72c92e16_1532x844.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7wvl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb25d8ac1-08f8-434c-8bf4-d2fb72c92e16_1532x844.png 424w, https://substackcdn.com/image/fetch/$s_!7wvl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb25d8ac1-08f8-434c-8bf4-d2fb72c92e16_1532x844.png 848w, https://substackcdn.com/image/fetch/$s_!7wvl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb25d8ac1-08f8-434c-8bf4-d2fb72c92e16_1532x844.png 1272w, https://substackcdn.com/image/fetch/$s_!7wvl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb25d8ac1-08f8-434c-8bf4-d2fb72c92e16_1532x844.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7wvl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb25d8ac1-08f8-434c-8bf4-d2fb72c92e16_1532x844.png" width="1456" height="802" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b25d8ac1-08f8-434c-8bf4-d2fb72c92e16_1532x844.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:802,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:355784,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb25d8ac1-08f8-434c-8bf4-d2fb72c92e16_1532x844.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7wvl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb25d8ac1-08f8-434c-8bf4-d2fb72c92e16_1532x844.png 424w, https://substackcdn.com/image/fetch/$s_!7wvl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb25d8ac1-08f8-434c-8bf4-d2fb72c92e16_1532x844.png 848w, https://substackcdn.com/image/fetch/$s_!7wvl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb25d8ac1-08f8-434c-8bf4-d2fb72c92e16_1532x844.png 1272w, https://substackcdn.com/image/fetch/$s_!7wvl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb25d8ac1-08f8-434c-8bf4-d2fb72c92e16_1532x844.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [5])</figcaption></figure></div><h4>Further Reading</h4><p>Although we have already covered a variety of papers, RaR<strong> </strong>is a particularly active and popular topic. To give a more comprehensive picture of the current research landscape, we close with high-level summaries of several more related works.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cxW-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadaa6c51-0537-46cb-8c2f-232f2b30cea5_2472x1170.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cxW-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadaa6c51-0537-46cb-8c2f-232f2b30cea5_2472x1170.png 424w, https://substackcdn.com/image/fetch/$s_!cxW-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadaa6c51-0537-46cb-8c2f-232f2b30cea5_2472x1170.png 848w, https://substackcdn.com/image/fetch/$s_!cxW-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadaa6c51-0537-46cb-8c2f-232f2b30cea5_2472x1170.png 1272w, https://substackcdn.com/image/fetch/$s_!cxW-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadaa6c51-0537-46cb-8c2f-232f2b30cea5_2472x1170.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cxW-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadaa6c51-0537-46cb-8c2f-232f2b30cea5_2472x1170.png" width="1456" height="689" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/adaa6c51-0537-46cb-8c2f-232f2b30cea5_2472x1170.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:689,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:422387,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadaa6c51-0537-46cb-8c2f-232f2b30cea5_2472x1170.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cxW-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadaa6c51-0537-46cb-8c2f-232f2b30cea5_2472x1170.png 424w, https://substackcdn.com/image/fetch/$s_!cxW-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadaa6c51-0537-46cb-8c2f-232f2b30cea5_2472x1170.png 848w, https://substackcdn.com/image/fetch/$s_!cxW-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadaa6c51-0537-46cb-8c2f-232f2b30cea5_2472x1170.png 1272w, https://substackcdn.com/image/fetch/$s_!cxW-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadaa6c51-0537-46cb-8c2f-232f2b30cea5_2472x1170.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [8])</figcaption></figure></div><p><strong>RL from Checklist Feedback (RLCF) [8]</strong> proposes a rubric-based approach for aligning language models to follow complex instructions. Instead of deriving rewards from a reward model trained on a static preference dataset, RLCF uses an LLM to generate instruction-specific checklists that outline the requirements of the instruction as a series of itemized steps. Each component of the checklist is an objective yes or no question that can be evaluated to derive a reward signal. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!93bd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40f57385-3aef-4efc-8d9b-37f83c89a29c_1710x653.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!93bd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40f57385-3aef-4efc-8d9b-37f83c89a29c_1710x653.png 424w, https://substackcdn.com/image/fetch/$s_!93bd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40f57385-3aef-4efc-8d9b-37f83c89a29c_1710x653.png 848w, https://substackcdn.com/image/fetch/$s_!93bd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40f57385-3aef-4efc-8d9b-37f83c89a29c_1710x653.png 1272w, https://substackcdn.com/image/fetch/$s_!93bd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40f57385-3aef-4efc-8d9b-37f83c89a29c_1710x653.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!93bd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40f57385-3aef-4efc-8d9b-37f83c89a29c_1710x653.png" width="1456" height="556" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/40f57385-3aef-4efc-8d9b-37f83c89a29c_1710x653.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:556,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:238424,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40f57385-3aef-4efc-8d9b-37f83c89a29c_1710x653.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!93bd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40f57385-3aef-4efc-8d9b-37f83c89a29c_1710x653.png 424w, https://substackcdn.com/image/fetch/$s_!93bd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40f57385-3aef-4efc-8d9b-37f83c89a29c_1710x653.png 848w, https://substackcdn.com/image/fetch/$s_!93bd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40f57385-3aef-4efc-8d9b-37f83c89a29c_1710x653.png 1272w, https://substackcdn.com/image/fetch/$s_!93bd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40f57385-3aef-4efc-8d9b-37f83c89a29c_1710x653.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [9])</figcaption></figure></div><p><strong>Rule-based rewards [9]</strong> propose an approach to LLM safety alignment that derives a reward signal from an explicit set of rules. Safety alignment is usually handled via RLHF-style preference tuning. However, this process requires collecting preference data, which is expensive, scales poorly as requirements evolve, and offers limited fine-grained control. As an alternative, the authors in [9] explore a hybrid setup in which an LLM evaluates responses against a specified set of safety rules, enabling fine-grained control over refusals and other safety-related behavior. This rule-based reward model is combined with a standard reward model for general helpfulness, allowing the model to undergo a standard alignment procedure with rule-based rewards guiding safety behavior.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JSND!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a0213a4-0890-4026-afa7-6e18be32f74d_1064x1520.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JSND!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a0213a4-0890-4026-afa7-6e18be32f74d_1064x1520.png 424w, https://substackcdn.com/image/fetch/$s_!JSND!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a0213a4-0890-4026-afa7-6e18be32f74d_1064x1520.png 848w, https://substackcdn.com/image/fetch/$s_!JSND!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a0213a4-0890-4026-afa7-6e18be32f74d_1064x1520.png 1272w, https://substackcdn.com/image/fetch/$s_!JSND!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a0213a4-0890-4026-afa7-6e18be32f74d_1064x1520.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JSND!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a0213a4-0890-4026-afa7-6e18be32f74d_1064x1520.png" width="344" height="491.42857142857144" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9a0213a4-0890-4026-afa7-6e18be32f74d_1064x1520.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1520,&quot;width&quot;:1064,&quot;resizeWidth&quot;:344,&quot;bytes&quot;:486986,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a0213a4-0890-4026-afa7-6e18be32f74d_1064x1520.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JSND!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a0213a4-0890-4026-afa7-6e18be32f74d_1064x1520.png 424w, https://substackcdn.com/image/fetch/$s_!JSND!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a0213a4-0890-4026-afa7-6e18be32f74d_1064x1520.png 848w, https://substackcdn.com/image/fetch/$s_!JSND!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a0213a4-0890-4026-afa7-6e18be32f74d_1064x1520.png 1272w, https://substackcdn.com/image/fetch/$s_!JSND!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a0213a4-0890-4026-afa7-6e18be32f74d_1064x1520.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [10])</figcaption></figure></div><p><strong>Context-Aware Reward Modeling (CARMO) [10]</strong> attempts to mitigate problems with reward hacking in human preference alignment with RLHF. Going beyond static evaluation rubrics, an LLM first dynamically generates evaluation criteria for each prompt. Then, these criteria are used by the LLM to score the response, and the score can be directly used as a reward signal for preference alignment. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ktZr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c53199e-bb4e-4384-a1ec-0766622dfcf9_1848x1112.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ktZr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c53199e-bb4e-4384-a1ec-0766622dfcf9_1848x1112.png 424w, https://substackcdn.com/image/fetch/$s_!ktZr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c53199e-bb4e-4384-a1ec-0766622dfcf9_1848x1112.png 848w, https://substackcdn.com/image/fetch/$s_!ktZr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c53199e-bb4e-4384-a1ec-0766622dfcf9_1848x1112.png 1272w, https://substackcdn.com/image/fetch/$s_!ktZr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c53199e-bb4e-4384-a1ec-0766622dfcf9_1848x1112.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ktZr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c53199e-bb4e-4384-a1ec-0766622dfcf9_1848x1112.png" width="1456" height="876" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0c53199e-bb4e-4384-a1ec-0766622dfcf9_1848x1112.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:876,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:477049,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c53199e-bb4e-4384-a1ec-0766622dfcf9_1848x1112.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ktZr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c53199e-bb4e-4384-a1ec-0766622dfcf9_1848x1112.png 424w, https://substackcdn.com/image/fetch/$s_!ktZr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c53199e-bb4e-4384-a1ec-0766622dfcf9_1848x1112.png 848w, https://substackcdn.com/image/fetch/$s_!ktZr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c53199e-bb4e-4384-a1ec-0766622dfcf9_1848x1112.png 1272w, https://substackcdn.com/image/fetch/$s_!ktZr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c53199e-bb4e-4384-a1ec-0766622dfcf9_1848x1112.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [11])</figcaption></figure></div><p><strong>Reinforcement Learning with Adversarial Critic (RLAC) [11]</strong> proposes an adversarial approach for training LLMs on open-ended generation tasks. This framework has three components:</p><ul><li><p><em>Generator</em>: the LLM being trained.</p></li><li><p><em>Critic</em>: another LLM that identifies potential failure modes.</p></li><li><p><em>Validator</em>: a domain-specific verification tool. </p></li></ul><p>For each prompt, the generator produces multiple outputs, the critic proposes validation criteria&#8212;<em>or a rubric</em>&#8212;for each output, and the validator provides binary feedback based on correctness. Preference pairs can be formed between outputs that are validated and those that fail, naturally providing data to update the generator with DPO. At the same time, the critic is actively trained to identify criteria that the generator is unable to satisfy. This creates a dynamic in which the generator constantly improves its outputs as the critic finds weaknesses. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Fjct!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1fa7955-0932-4f88-861c-f54cf1afe289_2006x1322.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Fjct!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1fa7955-0932-4f88-861c-f54cf1afe289_2006x1322.png 424w, https://substackcdn.com/image/fetch/$s_!Fjct!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1fa7955-0932-4f88-861c-f54cf1afe289_2006x1322.png 848w, https://substackcdn.com/image/fetch/$s_!Fjct!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1fa7955-0932-4f88-861c-f54cf1afe289_2006x1322.png 1272w, https://substackcdn.com/image/fetch/$s_!Fjct!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1fa7955-0932-4f88-861c-f54cf1afe289_2006x1322.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Fjct!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1fa7955-0932-4f88-861c-f54cf1afe289_2006x1322.png" width="1456" height="960" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a1fa7955-0932-4f88-861c-f54cf1afe289_2006x1322.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:960,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:489458,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/186046978?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1fa7955-0932-4f88-861c-f54cf1afe289_2006x1322.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Fjct!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1fa7955-0932-4f88-861c-f54cf1afe289_2006x1322.png 424w, https://substackcdn.com/image/fetch/$s_!Fjct!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1fa7955-0932-4f88-861c-f54cf1afe289_2006x1322.png 848w, https://substackcdn.com/image/fetch/$s_!Fjct!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1fa7955-0932-4f88-861c-f54cf1afe289_2006x1322.png 1272w, https://substackcdn.com/image/fetch/$s_!Fjct!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1fa7955-0932-4f88-861c-f54cf1afe289_2006x1322.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [12])</figcaption></figure></div><p><strong>Auto-Rubric [12]</strong> aims to avoid the need for extensive preference data collection in LLM alignment by extracting generalizable evaluation rubrics from a minimal amount of data with a training-free approach. These rubrics are transparent and interpretable, unlike standard reward models that are trained over large volumes of preference data. To derive these rubrics, authors adopt a two-stage approach:</p><ol><li><p><em>Query-Specific Rubric Generation</em> focuses on creating rubrics that agree with observed preference data. After proposing an initial rubric set, we can check whether these rubrics yield correct preference scores and, if not, propose a set of revisions to derive an improved rubric set. This process repeats until the rubrics correctly predict human preference labels. </p></li><li><p><em>Query-Agnostic Rubric Aggregation</em> eliminates redundancy and unnecessary complexity in the resulting rubric set. With an information-theoretic approach, the rubric set is narrowed to a subset of rubrics that maximize evaluation diversity without introducing redundancy. </p></li></ol><p>Using this approach, Auto-Rubric can extract underlying general principles from preference data, allowing smaller LLMs to outperform large and specialized LLMs on reward modeling benchmarks with minimal training data.</p><h2>Conclusion</h2><p>Rubrics decompose desired LLM behavior into self-contained criteria that an LLM judge can score and then aggregate into an overall evaluation or reward. Put simply, rubrics are a practical middle ground between deterministic verifiers and preference labels that allow us to extend RLVR beyond verifiable domains while retaining granular control over output quality. The work we have studied suggests rubric rewards are most reliable when criteria are specific (often instance-level), grounded (via references or retrieval), and carefully curated (usually with human oversight). In more advanced setups, rubrics can also be updated based on on-policy behavior, <em>allowing the rubric to adapt instead of becoming stale or exploitable</em>. Despite promising results, key challenges remain; e.g., reducing reliance on human supervision and improving robustness in highly subjective domains. As reasoning models and LLM judges become more capable, however, rubric-based RL is becoming a viable and general tool across a wider variety of domains. </p><h4>New to the newsletter?</h4><p>Hi! I&#8217;m <a href="https://cameronrwolfe.me/">Cameron R. Wolfe</a>, Deep Learning Ph.D. and Senior Research Scientist at <a href="https://research.netflix.com/research-area/nlp-and-conversations">Netflix</a>. This is the Deep (Learning) Focus newsletter, where I help readers better understand important topics in AI research. The newsletter will always be free and open to read. If you like the newsletter, please subscribe, consider a paid subscription, share it, or follow me on <a href="https://twitter.com/cwolferesearch">X</a> and <a href="https://www.linkedin.com/in/cameron-r-wolfe-ph-d-04744a238/">LinkedIn</a>!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://cameronrwolfe.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://cameronrwolfe.substack.com/subscribe?"><span>Subscribe now</span></a></p><h4>Bibliography</h4><p>[1] Gunjal, Anisha, et al. &#8220;Rubrics as rewards: Reinforcement learning beyond verifiable domains.&#8221; <em>arXiv preprint arXiv:2507.17746</em> (2025).</p><p>[2] Huang, Zenan, et al. &#8220;Reinforcement learning with rubric anchors.&#8221; <em>arXiv preprint arXiv:2508.12790</em> (2025).</p><p>[3] Liu, Tianci, et al. &#8220;Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment.&#8221; <em>arXiv preprint arXiv:2510.07743</em> (2025).</p><p>[4] Shao, Rulin, et al. &#8220;Dr tulu: Reinforcement learning with evolving rubrics for deep research.&#8221; <em>arXiv preprint arXiv:2511.19399</em> (2025).</p><p>[5] Xu, Ran, et al. &#8220;Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training.&#8221; <em>arXiv preprint arXiv:2602.01511</em> (2026).</p><p>[6] Xu, Wenyuan, et al. &#8220;A Unified Pairwise Framework for RLHF: Bridging Generative Reward Modeling and Policy Optimization.&#8221; <em>arXiv preprint arXiv:2504.04950</em> (2025).</p><p>[7] Zheng, Lianmin, et al. &#8220;Judging llm-as-a-judge with mt-bench and chatbot arena.&#8221; <em>Advances in neural information processing systems</em> 36 (2023): 46595-46623.</p><p>[8] Viswanathan, Vijay, et al. &#8220;Checklists are better than reward models for aligning language models.&#8221; <em>arXiv preprint arXiv:2507.18624</em> (2025).</p><p>[9] Mu, Tong, et al. &#8220;Rule based rewards for language model safety.&#8221; <em>Advances in Neural Information Processing Systems</em> 37 (2024): 108877-108901.</p><p>[10] Gupta, Taneesh, et al. &#8220;CARMO: Dynamic Criteria Generation for Context Aware Reward Modelling.&#8221; <em>Findings of the Association for Computational Linguistics: ACL 2025</em>. 2025.</p><p>[11] Wu, Mian, et al. &#8220;Rlac: Reinforcement learning with adversarial critic for free-form generation tasks.&#8221; <em>arXiv preprint arXiv:2511.01758</em> (2025).</p><p>[12] Xie, Lipeng, et al. &#8220;Auto-rubric: Learning to extract generalizable criteria for reward modeling.&#8221; <em>arXiv preprint arXiv:2510.17314</em> (2025).</p><p>[13] Bai, Yuntao, et al. &#8220;Constitutional ai: Harmlessness from ai feedback.&#8221; <em>arXiv preprint arXiv:2212.08073</em> (2022).</p><p>[14] Guan, Melody Y., et al. &#8220;Deliberative alignment: Reasoning enables safer language models.&#8221; <em>arXiv preprint arXiv:2412.16339</em> (2024).</p><p>[15] Liu, Yang, et al. &#8220;G-eval: NLG evaluation using gpt-4 with better human alignment.&#8221; <em>arXiv preprint arXiv:2303.16634</em> (2023).</p><p>[16] Arora, Rahul K., et al. &#8220;Healthbench: Evaluating large language models towards improved human health.&#8221; <em>arXiv preprint arXiv:2505.08775</em> (2025).</p><p>[17] Deshpande, Kaustubh, et al. &#8220;Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier llms.&#8221; <em>Findings of the Association for Computational Linguistics: ACL 2025</em>. 2025.</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Notably, this need to create ground truth labels for verification means that RLVR is still dependent upon access to validated data!</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>The numerical weights used for categories of importance in [1] are as follows: <code>{Essential: 1.0, Important: 0.7, Optional: 0.3, Pitfall: 0.9}</code></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Offline difficulty filtering is a popular approach used by papers like <a href="https://cameronrwolfe.substack.com/i/181791956/dapo-an-open-source-llm-reinforcement-learning-system-at-scale-1">DAPO</a> (in the form of dynamic sampling) or <a href="https://cameronrwolfe.substack.com/i/179769076/rlvr-with-grpo">Olmo 3</a>, which uses a nearly identical technique. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>In particular, running RL for a very long time allows the model to continue exploring and (eventually) find an exploit to hack the neural reward model. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>This is basically a form of <a href="https://rlhfbook.com/c/09-rejection-sampling">rejection sampling</a> that is anchored on human data!</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>In this case, the preference label is binary, so we can treat this as a next token prediction problem. For example, the reward model can predict a token of <code>0</code> or <code>1</code> to indicate its preference ranking. This is in contrast to the <a href="https://cameronrwolfe.substack.com/i/166169560/how-do-rms-work">standard definition of a reward model</a>, which uses a ranking loss for training. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>All <a href="https://github.com/rlresearch/dr-tulu">code</a>, <a href="https://huggingface.co/collections/rl-research/dr-tulu">data</a>, <a href="https://huggingface.co/collections/rl-research/dr-tulu">models</a>, and technical details are openly released for Dr. Tulu-8B, which is consistent with <a href="https://cameronrwolfe.substack.com/p/olmo-3">other fully-open releases from Ai2</a>. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>These costs consider both hosting costs of the model on OpenRouter and the costs of any API calls made by the DR agent when generating its final answer. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-9" href="#footnote-anchor-9" class="footnote-number" contenteditable="false" target="_self">9</a><div class="footnote-content"><p>More specifically, authors in [5] score each example twice, where the order of completions are flipped when generating the two scores. Then, only data that yields the same score for both orderings is retained for training. </p></div></div>]]></content:encoded></item><item><title><![CDATA[Continual Learning with RL for LLMs]]></title><description><![CDATA[Exploring the impressive continual learning capabilities of RL training...]]></description><link>https://cameronrwolfe.substack.com/p/rl-continual-learning</link><guid isPermaLink="false">https://cameronrwolfe.substack.com/p/rl-continual-learning</guid><dc:creator><![CDATA[Cameron R. Wolfe, Ph.D.]]></dc:creator><pubDate>Mon, 26 Jan 2026 10:33:14 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/3374fbcb-9fae-40e0-b756-fe0889f4aef4_2116x1183.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SF1W!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bb30a35-5800-4256-b07b-21dce1b0af7e_2489x1391.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SF1W!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bb30a35-5800-4256-b07b-21dce1b0af7e_2489x1391.png 424w, https://substackcdn.com/image/fetch/$s_!SF1W!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bb30a35-5800-4256-b07b-21dce1b0af7e_2489x1391.png 848w, https://substackcdn.com/image/fetch/$s_!SF1W!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bb30a35-5800-4256-b07b-21dce1b0af7e_2489x1391.png 1272w, https://substackcdn.com/image/fetch/$s_!SF1W!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bb30a35-5800-4256-b07b-21dce1b0af7e_2489x1391.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SF1W!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bb30a35-5800-4256-b07b-21dce1b0af7e_2489x1391.png" width="1456" height="814" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9bb30a35-5800-4256-b07b-21dce1b0af7e_2489x1391.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:814,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1497628,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bb30a35-5800-4256-b07b-21dce1b0af7e_2489x1391.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SF1W!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bb30a35-5800-4256-b07b-21dce1b0af7e_2489x1391.png 424w, https://substackcdn.com/image/fetch/$s_!SF1W!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bb30a35-5800-4256-b07b-21dce1b0af7e_2489x1391.png 848w, https://substackcdn.com/image/fetch/$s_!SF1W!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bb30a35-5800-4256-b07b-21dce1b0af7e_2489x1391.png 1272w, https://substackcdn.com/image/fetch/$s_!SF1W!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bb30a35-5800-4256-b07b-21dce1b0af7e_2489x1391.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1, 2, 3, 6, 11])</figcaption></figure></div><p>Continual learning, which refers to the ability of an AI model to learn from new tasks and data over time, has become a popular topic in the discussion of <a href="https://en.wikipedia.org/wiki/Artificial_general_intelligence">Artificial General Intelligence (AGI)</a>. Put simply, general intelligence should be adaptable, which has led some to believe that continual learning abilities are a prerequisite for AGI. The reasoning behind this argument is clear&#8212;<em>dynamically adapting to arbitrary tasks (i.e., &#8220;on-the-job&#8221; learning) is a common trait of humans</em>&#8212;but rigorously studying this concept is hard. In the real world, continual learning is unstructured, noisy, and open-ended. In order to make meaningful progress, we must transform this complex process into a more structured empirical setting.</p><blockquote><p><em>&#8220;LLMs don&#8217;t get better over time the way a human would. The lack of continual learning is a huge huge problem. The LLM baseline at many tasks might be higher than an average human&#8217;s. But there&#8217;s no way to give a model high level feedback. You&#8217;re stuck with the abilities you get out of the box.&#8221;</em> - <a href="https://www.dwarkesh.com/p/timelines-june-2025">Dwarkesh Patel</a></p></blockquote><p>To do this, we can pull from decades of prior research on the topic of continual learning for neural networks [10]. Although much of this work predates LLMs, such research provides a foundational understanding of continual learning and addresses key questions that are still relevant in the modern era:</p><ul><li><p>Why is continual learning difficult?</p></li><li><p>How should we structure continual learning experiments?</p></li><li><p>Which techniques are effective in practice?</p></li></ul><p>In this overview, we will bridge decades of continual learning research with more recent work on LLMs to develop a comprehensive perspective on the topic. While core concepts (e.g., catastrophic forgetting, experimental frameworks, method categories, etc.) carry over directly, continual learning for LLMs is unique because of scale. Even simple techniques become complex systems problems when considering the vast data and prior knowledge of modern LLMs. As we will learn, however, continual learning is not disjoint from current LLM research. Rather, existing post-training techniques&#8212;<em>especially on-policy reinforcement learning (RL)</em>&#8212;can naturally mitigate catastrophic forgetting, providing hope that continual learning is within reach given the current trajectory of LLM research.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://cameronrwolfe.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Join 60,000 others who use Deep (Learning) Focus to understand AI research. Consider a paid subscription if you would like to help support the newsletter.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Basics of Continual Learning</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_YYv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0872d8f5-5798-4764-aac3-d6558dd69dad_2326x950.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_YYv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0872d8f5-5798-4764-aac3-d6558dd69dad_2326x950.png 424w, https://substackcdn.com/image/fetch/$s_!_YYv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0872d8f5-5798-4764-aac3-d6558dd69dad_2326x950.png 848w, https://substackcdn.com/image/fetch/$s_!_YYv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0872d8f5-5798-4764-aac3-d6558dd69dad_2326x950.png 1272w, https://substackcdn.com/image/fetch/$s_!_YYv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0872d8f5-5798-4764-aac3-d6558dd69dad_2326x950.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_YYv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0872d8f5-5798-4764-aac3-d6558dd69dad_2326x950.png" width="1456" height="595" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0872d8f5-5798-4764-aac3-d6558dd69dad_2326x950.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:595,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:638447,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0872d8f5-5798-4764-aac3-d6558dd69dad_2326x950.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_YYv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0872d8f5-5798-4764-aac3-d6558dd69dad_2326x950.png 424w, https://substackcdn.com/image/fetch/$s_!_YYv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0872d8f5-5798-4764-aac3-d6558dd69dad_2326x950.png 848w, https://substackcdn.com/image/fetch/$s_!_YYv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0872d8f5-5798-4764-aac3-d6558dd69dad_2326x950.png 1272w, https://substackcdn.com/image/fetch/$s_!_YYv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0872d8f5-5798-4764-aac3-d6558dd69dad_2326x950.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">LLM training pipeline</figcaption></figure></div><p>The continual learning paradigm is starkly different from how neural networks are typically trained: <em>for several epochs over a large, fixed dataset</em>. Modern LLM training pipelines already include a mix of offline and more iterative components. Some stages (e.g., pretraining) closely resemble classical offline training, while others (e.g., iterative RLHF or <a href="https://cameronrwolfe.substack.com/p/online-rl">online RL</a>) begin to capture aspects of continual learning. In this section, we will develop a foundational understanding of continual learning&#8212;<em>how it is studied, common experimental frameworks, and the major categories of methods proposed for both LLMs and neural networks more broadl</em>y.</p><h4>Catastrophic Forgetting</h4><p>Historically, the difficulty of continual learning does not stem from a model&#8217;s inability to learn new tasks, but rather its tendency to degrade in performance on old tasks when training on new data. For example, running supervised training of an LLM over a new dataset will quickly enhance its in-domain performance. But, the same model may significantly deteriorate in its performance across general benchmarks or tasks that were observed previously in the training process.</p><div class="pullquote"><p>&#8220;Disruption of old knowledge by new learning is a recognized feature of connectionist models with distributed representations. However, the interference is sometimes described [as] mild or readily avoided. Perhaps for this reason, the interference phenomenon has received surprisingly little attention, and its implications for connectionist modeling of human cognition have not been systematically explored.&#8221; - from [10]</p></div><p>In continual learning research, this phenomenon is referred to as &#8220;catastrophic forgetting&#8221;<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> [11]. Training our model over new data tends to come at the cost of a significant&#8212;<em>or catastrophic</em>&#8212;degradation in performance on other tasks. The goal of research in this area is, therefore, to mitigate catastrophic forgetting. The figure below helps us to better understand this phenomenon. Here, a model is initially trained on task <code>A</code> (grey) before being exposed to a new task (yellow).</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OMBQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8f8919a-4168-4d1f-bb04-0d2e5897d029_1024x468.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OMBQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8f8919a-4168-4d1f-bb04-0d2e5897d029_1024x468.png 424w, https://substackcdn.com/image/fetch/$s_!OMBQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8f8919a-4168-4d1f-bb04-0d2e5897d029_1024x468.png 848w, https://substackcdn.com/image/fetch/$s_!OMBQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8f8919a-4168-4d1f-bb04-0d2e5897d029_1024x468.png 1272w, https://substackcdn.com/image/fetch/$s_!OMBQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8f8919a-4168-4d1f-bb04-0d2e5897d029_1024x468.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OMBQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8f8919a-4168-4d1f-bb04-0d2e5897d029_1024x468.png" width="506" height="231.2578125" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a8f8919a-4168-4d1f-bb04-0d2e5897d029_1024x468.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:468,&quot;width&quot;:1024,&quot;resizeWidth&quot;:506,&quot;bytes&quot;:76652,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdc3a93d-8c95-4dbd-aa8a-8fc4b1d1bd40_1024x468.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OMBQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8f8919a-4168-4d1f-bb04-0d2e5897d029_1024x468.png 424w, https://substackcdn.com/image/fetch/$s_!OMBQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8f8919a-4168-4d1f-bb04-0d2e5897d029_1024x468.png 848w, https://substackcdn.com/image/fetch/$s_!OMBQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8f8919a-4168-4d1f-bb04-0d2e5897d029_1024x468.png 1272w, https://substackcdn.com/image/fetch/$s_!OMBQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8f8919a-4168-4d1f-bb04-0d2e5897d029_1024x468.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(from [11])</figcaption></figure></div><p>The three arrows in the figure depict three possible solutions that can emerge when trying to solve this continual learning problem. The red arrow depicts a solution that performs well on both tasks, while the blue and green arrows perform well on only the new task or neither task, respectively. Put simply, <em>the goal of continual learning is to develop techniques that reliably follow the red arrow</em>. More specifically, an effective continual learning system should both:</p><ol><li><p>Perform well on new tasks to which it is exposed.</p></li><li><p>Maintain comparable (or better) levels of performance on prior tasks.</p></li></ol><p>As we will see throughout this overview, these two objectives are usually at odds&#8212;<em>we are constantly balancing general capacities with new tasks</em>. Simply specializing our model to each new incoming task is not a valid approach because new tasks will always continue to emerge in a real-world setting. We must maintain the model&#8217;s generality while maximizing adaptability to arbitrary future tasks.</p><h4>Experimental Frameworks for Continual Learning</h4><p>There are many continual learning variants that have been studied in the literature; e.g., <a href="https://arxiv.org/abs/1706.08840">continual learning</a>, <a href="https://arxiv.org/abs/1611.06194">lifelong learning</a>, <a href="https://arxiv.org/abs/1611.07725">incremental learning</a>, <a href="https://arxiv.org/abs/2211.04624">streaming learning</a>, and more. Despite the many variants of continual learning that exist, all of these variants share the same sequential nature of the training process&#8212;<em>the model is exposed to new data over time and cannot return to data from the past (unless explicitly stored in a buffer) when learning from new data</em>; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gfq6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf00ae2c-e93c-467a-af25-712826836cc9_950x433.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gfq6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf00ae2c-e93c-467a-af25-712826836cc9_950x433.png 424w, https://substackcdn.com/image/fetch/$s_!gfq6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf00ae2c-e93c-467a-af25-712826836cc9_950x433.png 848w, https://substackcdn.com/image/fetch/$s_!gfq6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf00ae2c-e93c-467a-af25-712826836cc9_950x433.png 1272w, https://substackcdn.com/image/fetch/$s_!gfq6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf00ae2c-e93c-467a-af25-712826836cc9_950x433.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gfq6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf00ae2c-e93c-467a-af25-712826836cc9_950x433.png" width="484" height="220.60210526315788" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cf00ae2c-e93c-467a-af25-712826836cc9_950x433.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:433,&quot;width&quot;:950,&quot;resizeWidth&quot;:484,&quot;bytes&quot;:53094,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb4451d1-c111-49de-beed-6cc6ed8a0884_950x486.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gfq6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf00ae2c-e93c-467a-af25-712826836cc9_950x433.png 424w, https://substackcdn.com/image/fetch/$s_!gfq6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf00ae2c-e93c-467a-af25-712826836cc9_950x433.png 848w, https://substackcdn.com/image/fetch/$s_!gfq6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf00ae2c-e93c-467a-af25-712826836cc9_950x433.png 1272w, https://substackcdn.com/image/fetch/$s_!gfq6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf00ae2c-e93c-467a-af25-712826836cc9_950x433.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(from [12])</figcaption></figure></div><p><strong>Non-IID data.</strong> First, we must consider the kind of data being exposed to our model. If the incremental data over which the model is trained is sampled from the model&#8217;s training distribution, then training on this data is unlikely to cause forgetting. This setup resembles a continued training approach, which is used frequently for LLM pre and post-training. However, if the incremental data is <a href="https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables">non-IID</a>&#8212;<em>or sampled from a distribution that is new or different from the training data distribution</em>&#8212;then catastrophic forgetting becomes very likely; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IPQK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11f29f73-d852-4686-a7e5-ba8ea232a3ac_1536x856.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IPQK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11f29f73-d852-4686-a7e5-ba8ea232a3ac_1536x856.png 424w, https://substackcdn.com/image/fetch/$s_!IPQK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11f29f73-d852-4686-a7e5-ba8ea232a3ac_1536x856.png 848w, https://substackcdn.com/image/fetch/$s_!IPQK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11f29f73-d852-4686-a7e5-ba8ea232a3ac_1536x856.png 1272w, https://substackcdn.com/image/fetch/$s_!IPQK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11f29f73-d852-4686-a7e5-ba8ea232a3ac_1536x856.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IPQK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11f29f73-d852-4686-a7e5-ba8ea232a3ac_1536x856.png" width="564" height="314.1510989010989" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/11f29f73-d852-4686-a7e5-ba8ea232a3ac_1536x856.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:811,&quot;width&quot;:1456,&quot;resizeWidth&quot;:564,&quot;bytes&quot;:127384,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11f29f73-d852-4686-a7e5-ba8ea232a3ac_1536x856.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IPQK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11f29f73-d852-4686-a7e5-ba8ea232a3ac_1536x856.png 424w, https://substackcdn.com/image/fetch/$s_!IPQK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11f29f73-d852-4686-a7e5-ba8ea232a3ac_1536x856.png 848w, https://substackcdn.com/image/fetch/$s_!IPQK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11f29f73-d852-4686-a7e5-ba8ea232a3ac_1536x856.png 1272w, https://substackcdn.com/image/fetch/$s_!IPQK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11f29f73-d852-4686-a7e5-ba8ea232a3ac_1536x856.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>For this reason, most experimental frameworks for continual learning assume the use of non-IID data. For example, when training an image classification model, we can derive incoming data from previously unseen classes. Similarly, we can continually train an LLM on an unseen task. In both cases, <em>we expose the model to an unseen or different distribution of data that can induce catastrophic forgetting</em>.</p><p><strong>Data increments.</strong> We now need to understand the different approaches for exposing data to the model during continual learning. The most common sequential learning setup is a batch-incremental learning approach, where entire batches of data are passed to the model sequentially. These batches can be arbitrarily large (e.g., an entire new dataset or task) and the model usually trains on each batch of data before moving on to the next batch; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GCGn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6962d534-c2fa-47d8-98bd-880538464115_2274x744.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GCGn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6962d534-c2fa-47d8-98bd-880538464115_2274x744.png 424w, https://substackcdn.com/image/fetch/$s_!GCGn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6962d534-c2fa-47d8-98bd-880538464115_2274x744.png 848w, https://substackcdn.com/image/fetch/$s_!GCGn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6962d534-c2fa-47d8-98bd-880538464115_2274x744.png 1272w, https://substackcdn.com/image/fetch/$s_!GCGn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6962d534-c2fa-47d8-98bd-880538464115_2274x744.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GCGn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6962d534-c2fa-47d8-98bd-880538464115_2274x744.png" width="1456" height="476" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6962d534-c2fa-47d8-98bd-880538464115_2274x744.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:476,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1159338,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6962d534-c2fa-47d8-98bd-880538464115_2274x744.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!GCGn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6962d534-c2fa-47d8-98bd-880538464115_2274x744.png 424w, https://substackcdn.com/image/fetch/$s_!GCGn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6962d534-c2fa-47d8-98bd-880538464115_2274x744.png 848w, https://substackcdn.com/image/fetch/$s_!GCGn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6962d534-c2fa-47d8-98bd-880538464115_2274x744.png 1272w, https://substackcdn.com/image/fetch/$s_!GCGn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6962d534-c2fa-47d8-98bd-880538464115_2274x744.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>Formally, we have a sequence of <code>T</code> tasks, each with an associated dataset or batch <code>{D_1, D_2, &#8230;, D_T}</code>. The model is sequentially trained on each task (i.e., one-by-one and in order), leading to a sequence of <code>T</code> models throughout the continual learning process. When training on a new task, we do not have access to prior tasks&#8217; data. The simplest variant of batch-incremental learning is a domain adaptation setup where <code>T = 1</code>. For this setup, a pretrained model is trained on data from only a single new domain. The goal of continual learning in this scenario is the same, but the model only undergoes one stage of adaptation. </p><p>The batch-incremental framework may not always be realistic, as our model may receive data in much smaller increments. For these cases, a streaming learning setup may be more appropriate. Streaming learning uses brief, online updates (i.e., one or a few forward and backward passes) for each piece of incoming data, forcing learning of new data to happen in real-time; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UxQS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6490dc56-1f92-4e73-bcf0-53b1cf36cfa1_1524x760.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UxQS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6490dc56-1f92-4e73-bcf0-53b1cf36cfa1_1524x760.png 424w, https://substackcdn.com/image/fetch/$s_!UxQS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6490dc56-1f92-4e73-bcf0-53b1cf36cfa1_1524x760.png 848w, https://substackcdn.com/image/fetch/$s_!UxQS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6490dc56-1f92-4e73-bcf0-53b1cf36cfa1_1524x760.png 1272w, https://substackcdn.com/image/fetch/$s_!UxQS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6490dc56-1f92-4e73-bcf0-53b1cf36cfa1_1524x760.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UxQS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6490dc56-1f92-4e73-bcf0-53b1cf36cfa1_1524x760.png" width="502" height="250.31043956043956" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6490dc56-1f92-4e73-bcf0-53b1cf36cfa1_1524x760.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:726,&quot;width&quot;:1456,&quot;resizeWidth&quot;:502,&quot;bytes&quot;:110675,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6490dc56-1f92-4e73-bcf0-53b1cf36cfa1_1524x760.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UxQS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6490dc56-1f92-4e73-bcf0-53b1cf36cfa1_1524x760.png 424w, https://substackcdn.com/image/fetch/$s_!UxQS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6490dc56-1f92-4e73-bcf0-53b1cf36cfa1_1524x760.png 848w, https://substackcdn.com/image/fetch/$s_!UxQS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6490dc56-1f92-4e73-bcf0-53b1cf36cfa1_1524x760.png 1272w, https://substackcdn.com/image/fetch/$s_!UxQS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6490dc56-1f92-4e73-bcf0-53b1cf36cfa1_1524x760.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Basic streaming learning setup</figcaption></figure></div><p>In contrast, batch-incremental learning setups usually perform a full, offline training procedure (i.e., several epochs of training) over each batch of incoming data. Although streaming and incremental learning setups are quite different, we can interpolate between these two approaches by:</p><ul><li><p>Changing the amount of data passed to the model at each phase of sequential learning (e.g., single example, batch of examples, entire dataset, etc.).</p></li><li><p>Restricting the number of model updates at each sequential learning phase (e.g., single update, multi-update, full epoch, multi-epoch, etc.).</p></li></ul><p><strong>Multi-task learning.</strong> In order to determine if a continual learning technique is performing well, we need a baseline to which our models can be compared. A common baseline is joint (multi-task) training, where the model has access to all <code>T</code> tasks and can perform offline training over all of the data. Joint training over all data is the best possible training setup and allows us to understand the ceiling in performance that we are aiming to match via continual learning. </p><p><strong>Which setup is best?</strong> In this overview, we will study a variety of continual learning papers in the LLM domain. Most of these papers adopt some variation of batch-incremental learning, where each batch is a new task that the LLM must learn. The domain-adaptation setup, in which a base LLM is trained over a single new task, is also common. These setups are useful for testing the tendency of LLMs to catastrophically forget, but one could argue that such a task-incremental setup does not reflect how LLMs would continually learn in the real world. For this reason, <em>no one continual learning setup is the best</em>. Rather, we should modify our experimental configuration within the frameworks outlined above such that the practical setting we are trying to test is most accurately reflected.</p><h4>Common Techniques for Continual Learning</h4><p>Now that we have a basic understanding of continual learning, we can overview some of the key categories of techniques for mitigating catastrophic forgetting. We will cover continual learning approaches in general, as well as highlight the methods that have been used in recent continual learning work with LLMs. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gjA9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc3c794e-bc45-4ceb-a20f-0f0d33dea592_1948x616.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gjA9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc3c794e-bc45-4ceb-a20f-0f0d33dea592_1948x616.png 424w, https://substackcdn.com/image/fetch/$s_!gjA9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc3c794e-bc45-4ceb-a20f-0f0d33dea592_1948x616.png 848w, https://substackcdn.com/image/fetch/$s_!gjA9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc3c794e-bc45-4ceb-a20f-0f0d33dea592_1948x616.png 1272w, https://substackcdn.com/image/fetch/$s_!gjA9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc3c794e-bc45-4ceb-a20f-0f0d33dea592_1948x616.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gjA9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc3c794e-bc45-4ceb-a20f-0f0d33dea592_1948x616.png" width="1456" height="460" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bc3c794e-bc45-4ceb-a20f-0f0d33dea592_1948x616.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:460,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:403579,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc3c794e-bc45-4ceb-a20f-0f0d33dea592_1948x616.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gjA9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc3c794e-bc45-4ceb-a20f-0f0d33dea592_1948x616.png 424w, https://substackcdn.com/image/fetch/$s_!gjA9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc3c794e-bc45-4ceb-a20f-0f0d33dea592_1948x616.png 848w, https://substackcdn.com/image/fetch/$s_!gjA9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc3c794e-bc45-4ceb-a20f-0f0d33dea592_1948x616.png 1272w, https://substackcdn.com/image/fetch/$s_!gjA9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc3c794e-bc45-4ceb-a20f-0f0d33dea592_1948x616.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [13])</figcaption></figure></div><p><strong>Replay mechanisms</strong> (depicted above) are a simple and effective technique for continual learning that maintain a buffer of prior data over which to train the model. Before being included in the replay buffer, samples usually undergo a selection process (e.g., based on importance or diversity) [14] to ensure that the buffer contains high-quality, representative samples and is not too large. The entire replay buffer can also be quantized or compressed to reduce memory [15]. In cases where data cannot be explicitly stored inside of a replay buffer, we can also train or maintain a generative model to replay synthetic examples [16, 17].</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aYEu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffae3483f-8a62-42b1-8c32-de4bb951731e_1986x882.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aYEu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffae3483f-8a62-42b1-8c32-de4bb951731e_1986x882.png 424w, https://substackcdn.com/image/fetch/$s_!aYEu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffae3483f-8a62-42b1-8c32-de4bb951731e_1986x882.png 848w, https://substackcdn.com/image/fetch/$s_!aYEu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffae3483f-8a62-42b1-8c32-de4bb951731e_1986x882.png 1272w, https://substackcdn.com/image/fetch/$s_!aYEu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffae3483f-8a62-42b1-8c32-de4bb951731e_1986x882.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aYEu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffae3483f-8a62-42b1-8c32-de4bb951731e_1986x882.png" width="1456" height="647" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fae3483f-8a62-42b1-8c32-de4bb951731e_1986x882.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:647,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:341402,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffae3483f-8a62-42b1-8c32-de4bb951731e_1986x882.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aYEu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffae3483f-8a62-42b1-8c32-de4bb951731e_1986x882.png 424w, https://substackcdn.com/image/fetch/$s_!aYEu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffae3483f-8a62-42b1-8c32-de4bb951731e_1986x882.png 848w, https://substackcdn.com/image/fetch/$s_!aYEu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffae3483f-8a62-42b1-8c32-de4bb951731e_1986x882.png 1272w, https://substackcdn.com/image/fetch/$s_!aYEu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffae3483f-8a62-42b1-8c32-de4bb951731e_1986x882.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [31])</figcaption></figure></div><p>Although replay buffers are one of the most simple and effective techniques for continual learning, applying them in the LLM domain is less straightforward. Namely, LLMs have a vast amount of prior training data and, in many cases, this data is not openly available. Therefore, constructing a replay buffer that captures the general capabilities of an LLM is non-trivial. However, several works have recently explored the use of replay buffers for continual post-training. For example, instruction tuning data has a more manageable volume, allowing a replay buffer to be constructed by retaining the most important or informative data throughout the continual post-training process [30, 31]; see above.  </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pUiZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e7616ec-350c-4bbf-adad-3a7eae044233_1708x1160.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pUiZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e7616ec-350c-4bbf-adad-3a7eae044233_1708x1160.png 424w, https://substackcdn.com/image/fetch/$s_!pUiZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e7616ec-350c-4bbf-adad-3a7eae044233_1708x1160.png 848w, https://substackcdn.com/image/fetch/$s_!pUiZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e7616ec-350c-4bbf-adad-3a7eae044233_1708x1160.png 1272w, https://substackcdn.com/image/fetch/$s_!pUiZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e7616ec-350c-4bbf-adad-3a7eae044233_1708x1160.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pUiZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e7616ec-350c-4bbf-adad-3a7eae044233_1708x1160.png" width="1456" height="989" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8e7616ec-350c-4bbf-adad-3a7eae044233_1708x1160.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:989,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:250786,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e7616ec-350c-4bbf-adad-3a7eae044233_1708x1160.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pUiZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e7616ec-350c-4bbf-adad-3a7eae044233_1708x1160.png 424w, https://substackcdn.com/image/fetch/$s_!pUiZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e7616ec-350c-4bbf-adad-3a7eae044233_1708x1160.png 848w, https://substackcdn.com/image/fetch/$s_!pUiZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e7616ec-350c-4bbf-adad-3a7eae044233_1708x1160.png 1272w, https://substackcdn.com/image/fetch/$s_!pUiZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e7616ec-350c-4bbf-adad-3a7eae044233_1708x1160.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [19])</figcaption></figure></div><p><strong>Knowledge distillation</strong> [18] can be used to mitigate catastrophic forgetting by ensuring that a model&#8217;s representations do not drift during the continual learning process. In their simplest form, distillation-based continual learning techniques just combine the training loss on new data with a distillation loss with respect to prior model outputs [19]; see above. Many variants of this approach have been proposed [12, 20, 22]. We should also note that these techniques are not mutually exclusive; e.g., replay buffers can be combined with a distillation loss [13]. </p><p><strong>Regularization</strong> in various forms can be helpful for continual learning. In fact, <em>knowledge distillation can even be considered a form of regularization</em>. Researchers have explored constraining weight updates for subgroups of parameters&#8212;<em>usually the most important parameters for a task [11, 21]&#8212;</em>or increasing plasticity for select parameters [23]. We can also regularize the output distribution of the model by applying a <a href="https://cameronrwolfe.substack.com/i/167254905/kullback-leibler-kl-divergence">KL divergence</a>&#8212;<em>similar to the use of KL to <a href="https://cameronrwolfe.substack.com/i/175107358/proximal-policy-optimization-algorithms-1">regularize the RL training objective</a></em>&#8212;and even simple changes like lowering the learning rate have been found to reduce forgetting [2]. <a href="https://cameronrwolfe.substack.com/p/model-merging">Model merging</a> has also been applied in tandem with explicit regularization to reduce catastrophic forgetting in LLMs [29]. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5k0p!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407af371-d216-4215-aff0-99f5726f37ac_2258x778.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5k0p!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407af371-d216-4215-aff0-99f5726f37ac_2258x778.png 424w, https://substackcdn.com/image/fetch/$s_!5k0p!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407af371-d216-4215-aff0-99f5726f37ac_2258x778.png 848w, https://substackcdn.com/image/fetch/$s_!5k0p!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407af371-d216-4215-aff0-99f5726f37ac_2258x778.png 1272w, https://substackcdn.com/image/fetch/$s_!5k0p!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407af371-d216-4215-aff0-99f5726f37ac_2258x778.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5k0p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407af371-d216-4215-aff0-99f5726f37ac_2258x778.png" width="1456" height="502" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/407af371-d216-4215-aff0-99f5726f37ac_2258x778.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:502,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:152196,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407af371-d216-4215-aff0-99f5726f37ac_2258x778.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5k0p!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407af371-d216-4215-aff0-99f5726f37ac_2258x778.png 424w, https://substackcdn.com/image/fetch/$s_!5k0p!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407af371-d216-4215-aff0-99f5726f37ac_2258x778.png 848w, https://substackcdn.com/image/fetch/$s_!5k0p!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407af371-d216-4215-aff0-99f5726f37ac_2258x778.png 1272w, https://substackcdn.com/image/fetch/$s_!5k0p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407af371-d216-4215-aff0-99f5726f37ac_2258x778.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [24])</figcaption></figure></div><p><strong>Architectural</strong> approaches have also been explored for continual learning that dynamically adapt the model&#8217;s architecture to handle incoming data. For example, new modules can be added to a neural network to handle new groups of data [24]; see above. Given the popularity of LoRA for LLMs, recent work has explored using LoRA modules as an architectural extension for learning new information during continual learning [26, 27]; see below. <a href="https://cameronrwolfe.substack.com/p/nano-moe">Mixture-of-Experts architectures</a> for LLMs have also been shown to be better at avoiding catastrophic forgetting [28]. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ymhf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63f85ed4-884b-4f96-a2c4-275586fbd7d3_2000x914.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ymhf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63f85ed4-884b-4f96-a2c4-275586fbd7d3_2000x914.png 424w, https://substackcdn.com/image/fetch/$s_!Ymhf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63f85ed4-884b-4f96-a2c4-275586fbd7d3_2000x914.png 848w, https://substackcdn.com/image/fetch/$s_!Ymhf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63f85ed4-884b-4f96-a2c4-275586fbd7d3_2000x914.png 1272w, https://substackcdn.com/image/fetch/$s_!Ymhf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63f85ed4-884b-4f96-a2c4-275586fbd7d3_2000x914.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ymhf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63f85ed4-884b-4f96-a2c4-275586fbd7d3_2000x914.png" width="1456" height="665" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/63f85ed4-884b-4f96-a2c4-275586fbd7d3_2000x914.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:665,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:543626,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63f85ed4-884b-4f96-a2c4-275586fbd7d3_2000x914.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ymhf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63f85ed4-884b-4f96-a2c4-275586fbd7d3_2000x914.png 424w, https://substackcdn.com/image/fetch/$s_!Ymhf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63f85ed4-884b-4f96-a2c4-275586fbd7d3_2000x914.png 848w, https://substackcdn.com/image/fetch/$s_!Ymhf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63f85ed4-884b-4f96-a2c4-275586fbd7d3_2000x914.png 1272w, https://substackcdn.com/image/fetch/$s_!Ymhf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63f85ed4-884b-4f96-a2c4-275586fbd7d3_2000x914.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [27])</figcaption></figure></div><p><strong>Further reading.</strong> We have now seen a comprehensive, high-level overview of the continual learning techniques that exist, but the literature is vast and dates all the way back to the 1980s (if not earlier)! The resources linked below will be helpful for developing a deeper understanding of continual learning research:</p><ul><li><p>A <a href="https://cameronrwolfe.substack.com/p/a-broad-and-practical-exposition-of-online-learning-techniques-a4cbc300dcd4">broad overview</a> of the categories of continual learning techniques. </p></li><li><p>A <a href="https://cameronrwolfe.substack.com/p/how-to-train-deep-neural-networks-over-data-streams-fdab15704e66">deep dive</a> on streaming learning techniques. </p></li><li><p>A <a href="https://arxiv.org/abs/2506.13045">survey</a> on continual learning for modern generative models. </p></li></ul><h2>Continual Learning for LLMs</h2><blockquote><p><em>&#8220;Surprisingly, without any data replay, continual post-training with RFT can achieve comparable performance with that of multi-task training, which is not achievable even when equipping SFT with continual learning strategies.&#8221; </em>- from [1]</p></blockquote><p>We will now take a look at several papers that study the topic of continual learning in the context of LLMs. Instead of focusing on continual learning techniques, however, these papers adopt standard LLM training methodologies&#8212;<em>supervised finetuning (SFT) and reinforcement learning (RL) in particular</em>&#8212;and analyze their natural ability to avoid catastrophic forgetting. Although SFT tends to not perform well for continual learning, RL is found to be shockingly robust to forgetting, even without employing explicit continual learning techniques (e.g., replay buffers or regularization). Given the current popularity and impact of RL in training frontier models, this inherent ability to handle continual learning makes RL an important tool for the creation of generally intelligent systems. </p><h4>More on SFT and RL</h4><p>To understand the different behaviors of SFT and RL in the continual learning setting, we need to gain a deeper understanding of the learning mechanisms that underlie these algorithms. For a full overview of each technique, please see the following resources:</p><ul><li><p><a href="https://cameronrwolfe.substack.com/p/understanding-and-using-supervised">Supervised Finetuning (SFT)</a></p></li><li><p><a href="https://cameronrwolfe.substack.com/p/grpo">Group Relative Policy Optimization (GRPO)</a></p></li></ul><p>As we will see, all of the papers in this overview adopt a <a href="https://cameronrwolfe.substack.com/i/153722335/reinforcement-learning-with-verifiable-rewards">reinforcement learning with verifiable rewards (RLVR)</a> setup with GRPO as the RL optimizer.</p><p><strong>Training objectives.</strong> In SFT, we have a fixed dataset of supervised examples over which we are training our LLM. The training objective aims to minimize the model&#8217;s negative log-likelihood over this dataset, as shown below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QiLP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18dc223b-89b0-4288-a5b9-fad18ada6adb_2446x588.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QiLP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18dc223b-89b0-4288-a5b9-fad18ada6adb_2446x588.png 424w, https://substackcdn.com/image/fetch/$s_!QiLP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18dc223b-89b0-4288-a5b9-fad18ada6adb_2446x588.png 848w, https://substackcdn.com/image/fetch/$s_!QiLP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18dc223b-89b0-4288-a5b9-fad18ada6adb_2446x588.png 1272w, https://substackcdn.com/image/fetch/$s_!QiLP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18dc223b-89b0-4288-a5b9-fad18ada6adb_2446x588.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QiLP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18dc223b-89b0-4288-a5b9-fad18ada6adb_2446x588.png" width="1456" height="350" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/18dc223b-89b0-4288-a5b9-fad18ada6adb_2446x588.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:350,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:192232,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18dc223b-89b0-4288-a5b9-fad18ada6adb_2446x588.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QiLP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18dc223b-89b0-4288-a5b9-fad18ada6adb_2446x588.png 424w, https://substackcdn.com/image/fetch/$s_!QiLP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18dc223b-89b0-4288-a5b9-fad18ada6adb_2446x588.png 848w, https://substackcdn.com/image/fetch/$s_!QiLP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18dc223b-89b0-4288-a5b9-fad18ada6adb_2446x588.png 1272w, https://substackcdn.com/image/fetch/$s_!QiLP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18dc223b-89b0-4288-a5b9-fad18ada6adb_2446x588.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">SFT training objective</figcaption></figure></div><p>In contrast, RL uses the objective shown below, which focuses on maximizing the reward&#8212;<em>such as a binary correctness signal in RLVR</em>&#8212;of on-policy completions sampled for prompts taken from a fixed dataset. Optionally, we can include a KL divergence regularization term that penalizes the model for producing an output distribution that differs significantly from some reference model<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!q06U!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9266f619-9c28-4e83-98f5-961c6c7a2cf2_2283x637.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!q06U!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9266f619-9c28-4e83-98f5-961c6c7a2cf2_2283x637.png 424w, https://substackcdn.com/image/fetch/$s_!q06U!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9266f619-9c28-4e83-98f5-961c6c7a2cf2_2283x637.png 848w, https://substackcdn.com/image/fetch/$s_!q06U!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9266f619-9c28-4e83-98f5-961c6c7a2cf2_2283x637.png 1272w, https://substackcdn.com/image/fetch/$s_!q06U!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9266f619-9c28-4e83-98f5-961c6c7a2cf2_2283x637.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!q06U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9266f619-9c28-4e83-98f5-961c6c7a2cf2_2283x637.png" width="646" height="180.1346153846154" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9266f619-9c28-4e83-98f5-961c6c7a2cf2_2283x637.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:406,&quot;width&quot;:1456,&quot;resizeWidth&quot;:646,&quot;bytes&quot;:180620,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9266f619-9c28-4e83-98f5-961c6c7a2cf2_2283x637.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!q06U!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9266f619-9c28-4e83-98f5-961c6c7a2cf2_2283x637.png 424w, https://substackcdn.com/image/fetch/$s_!q06U!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9266f619-9c28-4e83-98f5-961c6c7a2cf2_2283x637.png 848w, https://substackcdn.com/image/fetch/$s_!q06U!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9266f619-9c28-4e83-98f5-961c6c7a2cf2_2283x637.png 1272w, https://substackcdn.com/image/fetch/$s_!q06U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9266f619-9c28-4e83-98f5-961c6c7a2cf2_2283x637.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>Forward and reverse KL.</strong> One possible way to view the SFT and RL training objectives is through their relation to the KL divergence. Formally, the KL divergence is a measure for the divergence between two probability distributions; see <a href="https://huggingface.co/blog/NormalUhr/kl-divergence-estimator-rl-llm">here</a> for full details. For two probability distributions <code>P</code> and <code>Q</code>, we can define the <a href="https://agustinus.kristia.de/blog/forward-reverse-kl/">forward and reverse KL divergences</a> as shown in the figure below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4HUl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f64f424-3ca8-4257-8b1e-7b805eab05a7_1212x188.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4HUl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f64f424-3ca8-4257-8b1e-7b805eab05a7_1212x188.png 424w, https://substackcdn.com/image/fetch/$s_!4HUl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f64f424-3ca8-4257-8b1e-7b805eab05a7_1212x188.png 848w, https://substackcdn.com/image/fetch/$s_!4HUl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f64f424-3ca8-4257-8b1e-7b805eab05a7_1212x188.png 1272w, https://substackcdn.com/image/fetch/$s_!4HUl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f64f424-3ca8-4257-8b1e-7b805eab05a7_1212x188.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4HUl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f64f424-3ca8-4257-8b1e-7b805eab05a7_1212x188.png" width="626" height="97.1023102310231" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6f64f424-3ca8-4257-8b1e-7b805eab05a7_1212x188.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:188,&quot;width&quot;:1212,&quot;resizeWidth&quot;:626,&quot;bytes&quot;:71707,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f64f424-3ca8-4257-8b1e-7b805eab05a7_1212x188.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4HUl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f64f424-3ca8-4257-8b1e-7b805eab05a7_1212x188.png 424w, https://substackcdn.com/image/fetch/$s_!4HUl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f64f424-3ca8-4257-8b1e-7b805eab05a7_1212x188.png 848w, https://substackcdn.com/image/fetch/$s_!4HUl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f64f424-3ca8-4257-8b1e-7b805eab05a7_1212x188.png 1272w, https://substackcdn.com/image/fetch/$s_!4HUl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f64f424-3ca8-4257-8b1e-7b805eab05a7_1212x188.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>In the LLM domain, these probability distributions are usually the next token distributions outputted by our LLM. A key difference between the forward and reverse KL divergence lies in the sampling&#8212;<em>the distribution from which we sample in the above expectations changes</em>. Specifically, we are either sampling from our dataset (offline) in SFT or from the LLM itself (online or on-policy) in RL.</p><p><strong>SFT &#8776; forward KL.</strong> Using these concepts, we can show that the training objective used by SFT is equal to the forward KL divergence up to a constant. Let&#8217;s call the optimal (or target) distribution for our dataset &#960;<code>_*</code>. We can show the following for the relationship between this objective and the forward KL divergence, where <code>H(&#960;_*)</code> denotes the entropy<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a> of the optimal distribution over the SFT dataset. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jTKg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90a3f7d0-0acc-42df-bc4c-d5cd40464fa0_1340x380.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jTKg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90a3f7d0-0acc-42df-bc4c-d5cd40464fa0_1340x380.png 424w, https://substackcdn.com/image/fetch/$s_!jTKg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90a3f7d0-0acc-42df-bc4c-d5cd40464fa0_1340x380.png 848w, https://substackcdn.com/image/fetch/$s_!jTKg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90a3f7d0-0acc-42df-bc4c-d5cd40464fa0_1340x380.png 1272w, https://substackcdn.com/image/fetch/$s_!jTKg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90a3f7d0-0acc-42df-bc4c-d5cd40464fa0_1340x380.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jTKg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90a3f7d0-0acc-42df-bc4c-d5cd40464fa0_1340x380.png" width="644" height="182.62686567164178" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/90a3f7d0-0acc-42df-bc4c-d5cd40464fa0_1340x380.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:380,&quot;width&quot;:1340,&quot;resizeWidth&quot;:644,&quot;bytes&quot;:95243,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90a3f7d0-0acc-42df-bc4c-d5cd40464fa0_1340x380.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jTKg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90a3f7d0-0acc-42df-bc4c-d5cd40464fa0_1340x380.png 424w, https://substackcdn.com/image/fetch/$s_!jTKg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90a3f7d0-0acc-42df-bc4c-d5cd40464fa0_1340x380.png 848w, https://substackcdn.com/image/fetch/$s_!jTKg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90a3f7d0-0acc-42df-bc4c-d5cd40464fa0_1340x380.png 1272w, https://substackcdn.com/image/fetch/$s_!jTKg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90a3f7d0-0acc-42df-bc4c-d5cd40464fa0_1340x380.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>In the above expression, the entropy of the optimal distribution is a constant, so the forward KL and SFT training objective are equal up to a constant&#8212;<em>minimizing forward KL is equivalent to minimizing the negative log-likelihood objective</em>. </p><p><strong>RL &#8776; reverse KL.</strong> As mentioned previously, RL tries to maximize the reward of on-policy completions while minimizing KL divergence with respect to a reference policy. We can actually derive a closed-form expression for the optimal solution to the RL objective. The expression for the optimal policy is shown below, where <code>Z(x)</code> denotes the partition function. Notably, this optimal policy expression is also the first part of <a href="https://cameronrwolfe.substack.com/i/167254905/deriving-the-dpo-loss">deriving the training loss for DPO</a>!</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QEd4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefb26675-91a0-44a5-8919-c732b4acaabb_980x366.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QEd4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefb26675-91a0-44a5-8919-c732b4acaabb_980x366.png 424w, https://substackcdn.com/image/fetch/$s_!QEd4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefb26675-91a0-44a5-8919-c732b4acaabb_980x366.png 848w, https://substackcdn.com/image/fetch/$s_!QEd4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefb26675-91a0-44a5-8919-c732b4acaabb_980x366.png 1272w, https://substackcdn.com/image/fetch/$s_!QEd4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefb26675-91a0-44a5-8919-c732b4acaabb_980x366.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QEd4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefb26675-91a0-44a5-8919-c732b4acaabb_980x366.png" width="494" height="184.4938775510204" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/efb26675-91a0-44a5-8919-c732b4acaabb_980x366.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:366,&quot;width&quot;:980,&quot;resizeWidth&quot;:494,&quot;bytes&quot;:68538,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefb26675-91a0-44a5-8919-c732b4acaabb_980x366.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QEd4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefb26675-91a0-44a5-8919-c732b4acaabb_980x366.png 424w, https://substackcdn.com/image/fetch/$s_!QEd4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefb26675-91a0-44a5-8919-c732b4acaabb_980x366.png 848w, https://substackcdn.com/image/fetch/$s_!QEd4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefb26675-91a0-44a5-8919-c732b4acaabb_980x366.png 1272w, https://substackcdn.com/image/fetch/$s_!QEd4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefb26675-91a0-44a5-8919-c732b4acaabb_980x366.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>If we assume that this optimal policy <code>&#960;_*</code> is our target distribution, then we can show that maximizing the RL objective is equivalent to minimizing the reverse KL divergence between this target distribution and our policy <code>&#960;_&#952;</code>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yymm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e226855-7c37-43ef-95f0-ddc0cf23f769_1494x444.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yymm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e226855-7c37-43ef-95f0-ddc0cf23f769_1494x444.png 424w, https://substackcdn.com/image/fetch/$s_!yymm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e226855-7c37-43ef-95f0-ddc0cf23f769_1494x444.png 848w, https://substackcdn.com/image/fetch/$s_!yymm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e226855-7c37-43ef-95f0-ddc0cf23f769_1494x444.png 1272w, https://substackcdn.com/image/fetch/$s_!yymm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e226855-7c37-43ef-95f0-ddc0cf23f769_1494x444.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yymm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e226855-7c37-43ef-95f0-ddc0cf23f769_1494x444.png" width="1456" height="433" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0e226855-7c37-43ef-95f0-ddc0cf23f769_1494x444.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:433,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:96855,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e226855-7c37-43ef-95f0-ddc0cf23f769_1494x444.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yymm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e226855-7c37-43ef-95f0-ddc0cf23f769_1494x444.png 424w, https://substackcdn.com/image/fetch/$s_!yymm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e226855-7c37-43ef-95f0-ddc0cf23f769_1494x444.png 848w, https://substackcdn.com/image/fetch/$s_!yymm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e226855-7c37-43ef-95f0-ddc0cf23f769_1494x444.png 1272w, https://substackcdn.com/image/fetch/$s_!yymm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e226855-7c37-43ef-95f0-ddc0cf23f769_1494x444.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As we can see, the first line of this equation computes the reverse KL divergence relative to the KL divergence used in our derivation for the SFT objective. In the final line, we have the negative of our RL objective (plus a scaling factor of <code>1/&#946; </code>and an additional constant). Therefore, minimizing this reverse KL divergence objective would be the same as maximizing the RL training objective. </p><p><strong>What does this tell us? </strong>Now we understand the relation of SFT and RL to the forward and reverse KL divergence, respectively. But, <em>what do these relationships actually tell us about the objectives? </em>SFT minimizes negative log-likelihood over a dataset, which is equivalent to minimizing the forward KL divergence. This is a <strong>mode-covering</strong> objective. Our model is heavily penalized for assigning low probability to any completion that is found in the data&#8212;<em>the model must &#8220;spread&#8221; its probability mass across all possible completions or modes in the data.</em></p><p>On the other hand, RL maximizes rewards of on-policy completions, which is equivalent to a reverse KL objective and is <strong>mode-seeking</strong>. Put differently, the model prioritizes high-reward outputs, <em>even at the cost of ignoring output modes</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pPRu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F332da0cf-bfea-427b-933e-0ddcc9bbcae6_1235x853.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pPRu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F332da0cf-bfea-427b-933e-0ddcc9bbcae6_1235x853.png 424w, https://substackcdn.com/image/fetch/$s_!pPRu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F332da0cf-bfea-427b-933e-0ddcc9bbcae6_1235x853.png 848w, https://substackcdn.com/image/fetch/$s_!pPRu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F332da0cf-bfea-427b-933e-0ddcc9bbcae6_1235x853.png 1272w, https://substackcdn.com/image/fetch/$s_!pPRu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F332da0cf-bfea-427b-933e-0ddcc9bbcae6_1235x853.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pPRu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F332da0cf-bfea-427b-933e-0ddcc9bbcae6_1235x853.png" width="443" height="305.9748987854251" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/332da0cf-bfea-427b-933e-0ddcc9bbcae6_1235x853.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:853,&quot;width&quot;:1235,&quot;resizeWidth&quot;:443,&quot;bytes&quot;:122187,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F332da0cf-bfea-427b-933e-0ddcc9bbcae6_1235x853.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pPRu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F332da0cf-bfea-427b-933e-0ddcc9bbcae6_1235x853.png 424w, https://substackcdn.com/image/fetch/$s_!pPRu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F332da0cf-bfea-427b-933e-0ddcc9bbcae6_1235x853.png 848w, https://substackcdn.com/image/fetch/$s_!pPRu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F332da0cf-bfea-427b-933e-0ddcc9bbcae6_1235x853.png 1272w, https://substackcdn.com/image/fetch/$s_!pPRu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F332da0cf-bfea-427b-933e-0ddcc9bbcae6_1235x853.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In SFT, the model&#8217;s loss increases exponentially if we assign near-zero probability to any completion in the dataset&#8212;<em>this is due to the shape of the negative log-likelihood curve (shown above)!</em> Such a property is not true of RL, as we are simply maximizing the reward of on-policy completions. Assigning near-zero probability to a completion will prevent this particular completion from being sampled during RL, but reward can still be maximized over the completions that are sampled. <em>This is a fundamental property of RL that creates favorable behavior with respect to minimizing catastrophic forgetting during continual learning</em>. </p><h4><a href="https://arxiv.org/abs/2507.05386">Reinforcement Finetuning Naturally Mitigates Forgetting in Continual Post-Training</a> [1]</h4><p>Continual learning can be viewed as a continued post-training process for an LLM. In this setup, the same base LLM undergoes extensive post-training over an evolving and expanding data stream, forcing the model to adapt to new requirements and learn new skills or knowledge without losing existing capabilities. However, avoiding catastrophic forgetting in this scenario is difficult. In [1], authors consider this continual post-training setup and analyze the best learning paradigm&#8212;<em>either supervised finetuning (SFT) or reinforcement learning (RL)</em>&#8212;for maximizing performance and minimizing forgetting.</p><p><strong>Continual post-training.</strong> In the real world, continual learning is messy&#8212;<em>the LLM will be constantly exposed to new data from various sources&#8212;</em>but a more organized proxy setup is needed for research. A common way to simulate continual learning is via a sequential learning (or batch-incremental) setup, where the LLM is sequentially exposed to an ordered group of datasets. In [1], authors choose seven datasets that cover a wide scope of multi-modal (vision) use cases: <a href="https://scienceqa.github.io/">ScienceQA</a>, <a href="https://textvqa.org/">TextVQA</a>, <a href="https://vizwiz.org/tasks-and-datasets/vqa/">VizWiz</a>, <a href="https://huggingface.co/datasets/hiyouga/geometry3k">Geometry3K</a>, <a href="https://cs.stanford.edu/people/dorarad/gqa/about.html">GQA</a>, <a href="https://arxiv.org/abs/2003.10286">PathVQA</a>, and <a href="https://lizw14.github.io/project/2023_SuperCLEVR/">Super-CLEVR</a>. </p><blockquote><p><em>&#8220;A higher AvgAcc indicates better overall performance, while an FM closer to zero signifies less forgetting and better knowledge preservation.&#8221;</em> - from [1]</p></blockquote><p><strong>Evaluation metrics.</strong> Our goal in continual post-training is to <em>i)</em> maximize the LLM&#8217;s performance on each new task and <em>ii)</em> avoid performance degradation&#8212;<em>or catastrophic forgetting</em>&#8212;on prior tasks. Assume that the LLM is evaluated on all tasks after each training round, yielding performance <code>P_{t, j}</code> on task <code>j</code> after learning for task <code>t</code> is complete. We can then capture key performance properties of continual post-training via the following two metrics:</p><ol><li><p><em>Average accuracy (AvgAcc)</em>: the average accuracy of the model across all tasks after training on the final task <code>T</code> has completed.</p></li><li><p><em>Forgetting measure (FM)</em>: the average difference between the model&#8217;s final accuracy for a task and the best accuracy observed for that task throughout all <code>T</code> rounds of the training sequence. </p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CJfe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feabdbf11-60ef-4629-8ba2-6af7b9d73a9a_1466x573.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CJfe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feabdbf11-60ef-4629-8ba2-6af7b9d73a9a_1466x573.png 424w, https://substackcdn.com/image/fetch/$s_!CJfe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feabdbf11-60ef-4629-8ba2-6af7b9d73a9a_1466x573.png 848w, https://substackcdn.com/image/fetch/$s_!CJfe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feabdbf11-60ef-4629-8ba2-6af7b9d73a9a_1466x573.png 1272w, https://substackcdn.com/image/fetch/$s_!CJfe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feabdbf11-60ef-4629-8ba2-6af7b9d73a9a_1466x573.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CJfe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feabdbf11-60ef-4629-8ba2-6af7b9d73a9a_1466x573.png" width="534" height="208.68543956043956" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eabdbf11-60ef-4629-8ba2-6af7b9d73a9a_1466x573.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:569,&quot;width&quot;:1456,&quot;resizeWidth&quot;:534,&quot;bytes&quot;:138344,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feabdbf11-60ef-4629-8ba2-6af7b9d73a9a_1466x573.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CJfe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feabdbf11-60ef-4629-8ba2-6af7b9d73a9a_1466x573.png 424w, https://substackcdn.com/image/fetch/$s_!CJfe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feabdbf11-60ef-4629-8ba2-6af7b9d73a9a_1466x573.png 848w, https://substackcdn.com/image/fetch/$s_!CJfe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feabdbf11-60ef-4629-8ba2-6af7b9d73a9a_1466x573.png 1272w, https://substackcdn.com/image/fetch/$s_!CJfe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feabdbf11-60ef-4629-8ba2-6af7b9d73a9a_1466x573.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Continual post-training metrics (from [1])</figcaption></figure></div><p>After the end of the continual post-training process, the above metrics are computed over the test sets of all previously-encountered tasks. Going further, authors in [1] also measure performance on several general LLM benchmarks (i.e., <a href="https://mmmu-benchmark.github.io/">MMMU</a>, <a href="https://arxiv.org/abs/2406.01574">MMLU-Pro</a>, and <a href="https://arxiv.org/abs/2305.10355">POPE</a>) at the end of the continual post-training process to check for any impact on the model&#8217;s general capabilities.</p><p><strong>SFT versus RL.</strong> Continual post-training experiments are performed in [1] using the <a href="https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct">Qwen-2.5-VL-7B-Instruct</a> model, which is sequentially trained on data from each of the seven benchmarks. Notably, no replay buffer or data from prior tasks is used when training on new tasks, so the model&#8217;s ability to avoid forgetting is entirely dependent upon mechanics of the learning algorithm. As mentioned before, two types of learning algorithms are used:</p><ol><li><p>Supervised Finetuning</p></li><li><p>Reinforcement Learning (<a href="https://cameronrwolfe.substack.com/p/grpo">GRPO</a>, <a href="https://cameronrwolfe.substack.com/i/173306894/reinforce-leave-one-out-rloo-2">RLOO</a> and <a href="https://arxiv.org/abs/2310.10505">ReMax</a>)</p></li></ol><p>For RL, we derive rewards using a standard reasoning model setup that combines the verifiable reward with a format reward that encourages the model to <em>i)</em> wrap its reasoning trace in <code>&lt;think&gt;</code> tokens and <em>ii)</em> mark its output with a <code>\boxed{}</code> label. Models output a reasoning trace prior to their final output, though tests are performed both with and without reasoning for all training setups. </p><p><strong>RL forgets less.</strong> The results of the continual post-training experiments in [1] are depicted below. SFT clearly leads to catastrophic forgetting of previously-learned tasks, which gets worse as tasks move further into the past&#8212;<em>forgetting is worst on initial tasks in the sequence</em>. More specifically, we see an average accuracy of 54% with SFT, while multi-task training on all tasks reaches an average accuracy of 62.9%. Similarly, a FM of -10.4% is also observed for SFT, indicating that most tasks degrade noticeably in performance throughout continual post-training.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!U58y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae0c71f6-dca9-43cf-87aa-bbbe365f115a_1392x748.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!U58y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae0c71f6-dca9-43cf-87aa-bbbe365f115a_1392x748.png 424w, https://substackcdn.com/image/fetch/$s_!U58y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae0c71f6-dca9-43cf-87aa-bbbe365f115a_1392x748.png 848w, https://substackcdn.com/image/fetch/$s_!U58y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae0c71f6-dca9-43cf-87aa-bbbe365f115a_1392x748.png 1272w, https://substackcdn.com/image/fetch/$s_!U58y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae0c71f6-dca9-43cf-87aa-bbbe365f115a_1392x748.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!U58y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae0c71f6-dca9-43cf-87aa-bbbe365f115a_1392x748.png" width="1392" height="748" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ae0c71f6-dca9-43cf-87aa-bbbe365f115a_1392x748.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:748,&quot;width&quot;:1392,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:229150,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae0c71f6-dca9-43cf-87aa-bbbe365f115a_1392x748.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!U58y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae0c71f6-dca9-43cf-87aa-bbbe365f115a_1392x748.png 424w, https://substackcdn.com/image/fetch/$s_!U58y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae0c71f6-dca9-43cf-87aa-bbbe365f115a_1392x748.png 848w, https://substackcdn.com/image/fetch/$s_!U58y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae0c71f6-dca9-43cf-87aa-bbbe365f115a_1392x748.png 1272w, https://substackcdn.com/image/fetch/$s_!U58y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae0c71f6-dca9-43cf-87aa-bbbe365f115a_1392x748.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>While SFT struggles to mitigate forgetting, RL naturally adapts well to new tasks. For GRPO, we observe an average accuracy of 60% (i.e., slightly below multi-task learning) and an FM of -2.3%. Additionally, the final accuracy on ScienceQA&#8212;<em>the first task in the sequence</em>&#8212;is 93%, compared to a peak accuracy of 95.6%. These results show that RL strikes a strong balance between learning and remembering. </p><blockquote><p><em>&#8220;Without any data replay, continual post-training with RFT can achieve comparable performance with that of multi-task training, which is not achievable even when equipping SFT with continual learning strategies.&#8221;</em> - from [1]</p></blockquote><p><strong>Influence on general capabilities.</strong> In the same vein, SFT-based continual post-training also degrades general model capabilities; see below. In contrast, we see in [1] that RL maintains&#8212;<em>or even slightly enhances</em>&#8212;performance on general benchmarks. For example, models sequentially trained with GRPO improve from an initial accuracy of 52.1% to a final accuracy of 54.2% on MMMU!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-y6i!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf8c4337-9448-4614-8e2c-8c924f63098f_1836x1046.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-y6i!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf8c4337-9448-4614-8e2c-8c924f63098f_1836x1046.png 424w, https://substackcdn.com/image/fetch/$s_!-y6i!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf8c4337-9448-4614-8e2c-8c924f63098f_1836x1046.png 848w, https://substackcdn.com/image/fetch/$s_!-y6i!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf8c4337-9448-4614-8e2c-8c924f63098f_1836x1046.png 1272w, https://substackcdn.com/image/fetch/$s_!-y6i!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf8c4337-9448-4614-8e2c-8c924f63098f_1836x1046.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-y6i!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf8c4337-9448-4614-8e2c-8c924f63098f_1836x1046.png" width="1456" height="830" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bf8c4337-9448-4614-8e2c-8c924f63098f_1836x1046.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:830,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:267413,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf8c4337-9448-4614-8e2c-8c924f63098f_1836x1046.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-y6i!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf8c4337-9448-4614-8e2c-8c924f63098f_1836x1046.png 424w, https://substackcdn.com/image/fetch/$s_!-y6i!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf8c4337-9448-4614-8e2c-8c924f63098f_1836x1046.png 848w, https://substackcdn.com/image/fetch/$s_!-y6i!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf8c4337-9448-4614-8e2c-8c924f63098f_1836x1046.png 1272w, https://substackcdn.com/image/fetch/$s_!-y6i!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf8c4337-9448-4614-8e2c-8c924f63098f_1836x1046.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>Such an ability to maintain performance on general benchmarks is a desirable aspect of continual learning. Ideally, we want the LLM to adapt to new tasks while maintaining its existing, foundational capabilities as much as possible.</p><p><strong>Why does RL forget less?</strong> Given the above results, we might begin to wonder: <em>Why does RL have the ability to naturally avoid catastrophic forgetting?</em> Of course, it is possible that such continual learning abilities are directly attributable to RL itself. However, authors in [1] also consider two alternative explanations for the lack of catastrophic forgetting:</p><ul><li><p>The use of a KL divergence term in RL regularizes the training process and acts as a form of knowledge distillation that preserves prior knowledge. </p></li><li><p>The use of long CoT reasoning in models trained with RL leads to a more robust knowledge base that is better protected from forgetting. </p></li></ul><p>To test whether these factors help with avoiding catastrophic forgetting, three setups are tested that ablate the use of KL divergence and long CoT reasoning. Interestingly, we learn from testing these setups that removing KL divergence, despite degrading the stability of RL training<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>, does not lead to any degradation in performance metrics for continual post-training. Additionally, models that do not output a reasoning trace resist catastrophic forgetting similarly to those that do. Using CoT reasoning improves baseline model performance, <em>but continually trained models in either setup see the same amount of catastrophic forgetting</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HuxB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ef7cf1a-38ca-42a7-89d6-7e5b063c1aee_1828x466.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HuxB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ef7cf1a-38ca-42a7-89d6-7e5b063c1aee_1828x466.png 424w, https://substackcdn.com/image/fetch/$s_!HuxB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ef7cf1a-38ca-42a7-89d6-7e5b063c1aee_1828x466.png 848w, https://substackcdn.com/image/fetch/$s_!HuxB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ef7cf1a-38ca-42a7-89d6-7e5b063c1aee_1828x466.png 1272w, https://substackcdn.com/image/fetch/$s_!HuxB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ef7cf1a-38ca-42a7-89d6-7e5b063c1aee_1828x466.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HuxB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ef7cf1a-38ca-42a7-89d6-7e5b063c1aee_1828x466.png" width="1456" height="371" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5ef7cf1a-38ca-42a7-89d6-7e5b063c1aee_1828x466.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:371,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:135927,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ef7cf1a-38ca-42a7-89d6-7e5b063c1aee_1828x466.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HuxB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ef7cf1a-38ca-42a7-89d6-7e5b063c1aee_1828x466.png 424w, https://substackcdn.com/image/fetch/$s_!HuxB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ef7cf1a-38ca-42a7-89d6-7e5b063c1aee_1828x466.png 848w, https://substackcdn.com/image/fetch/$s_!HuxB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ef7cf1a-38ca-42a7-89d6-7e5b063c1aee_1828x466.png 1272w, https://substackcdn.com/image/fetch/$s_!HuxB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ef7cf1a-38ca-42a7-89d6-7e5b063c1aee_1828x466.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>The results of these ablation experiments are outlined in the table above. The impressive performance of RL in continual post-training experiments does not seem to stem from the use of KL divergence or long CoT reasoning. Rather, <em>the ability to perform continual learning seems to be an inherent property of RL training</em>. Insight as to how RL avoids forgetting is provided by theory in [1] showing that RL naturally scales policy updates according to the variance of the reward signal, leading to more conservative updates for important or sensitive parameters. </p><blockquote><p><em>&#8220;We offer a theoretical perspective suggesting that RFT&#8217;s updates are inherently more conservative in parameter subspaces sensitive to prior tasks. This conservatism is naturally scaled by the variance of the reward signal, creating a data-dependent regularization that dampens updates on uncertain samples, thus protecting established knowledge.&#8221;</em> - from [1]</p></blockquote><h4><a href="https://arxiv.org/abs/2510.18874">Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting</a> [2]</h4><p>Work in [2] shares a very similar focus to the paper above&#8212;<em>trying to compare SFT and RL in the context of continual learning</em>. However, a different experimental setup is used that considers three domains: instruction following (<a href="https://arxiv.org/abs/2311.07911">IFEval</a>), general skills (<a href="https://arxiv.org/abs/2009.03300">MMLU</a>), and arithmetic reasoning (<a href="https://github.com/Jiayi-Pan/TinyZero">Countdown</a>). Beyond these target tasks that are used for training and evaluation, a few non-target tasks (i.e., <a href="https://github.com/hendrycks/math">MATH</a> and two <a href="https://arxiv.org/abs/2406.18510">safety</a> <a href="https://arxiv.org/abs/2406.18495">benchmarks</a>) are included to provide a wider evaluation suite. We do not train the LLM over a sequence of tasks in [2]. Rather, the LLM is trained over one target task&#8212;<em>a domain adaptation setup</em>&#8212;and we measure performance via:</p><ul><li><p>The accuracy gain on that target task.</p></li><li><p>The average accuracy drop across all non-target tasks.</p></li></ul><p>Notably, the lack of multi-step sequential learning makes this setup less realistic. In [1], we see that the impact of catastrophic forgetting is greater after several training rounds. However, the domain adaptation setup in [2] does allow us to efficiently analyze the forgetting mechanics of different learning algorithms. The following <strong>learning algorithms</strong> are considered in [2]:</p><ol><li><p>SFT training on responses from a teacher model (<a href="https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct">Llama-3.3-70B-Instruct</a>).</p></li><li><p>Self-SFT training, which performs SFT-style training over responses from the initial policy (before training) or reference model.</p></li><li><p>RL training using GRPO with verifiable rewards&#8212;<em>a standard RLVR setup</em>. </p></li></ol><p>Both SFT variants filter completions based on correctness, as determined by deterministic verifiers for each domain. Self-SFT is a <a href="https://rlhfbook.com/c/10-rejection-sampling">rejection sampling</a> setup (i.e., incorrect responses are rejected) that is used as a simple baseline, whereas the SFT setup performs offline knowledge distillation from a larger model. Self-SFT is an offline approach as well because completions are sampled from the initial model, rather than on-policy. The same verifiable correctness signal used for filtering completions in SFT variants is also used as the reward signal in RL.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KLXE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d87d634-88de-455e-ac16-fac9c0a820b2_1174x1180.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KLXE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d87d634-88de-455e-ac16-fac9c0a820b2_1174x1180.png 424w, https://substackcdn.com/image/fetch/$s_!KLXE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d87d634-88de-455e-ac16-fac9c0a820b2_1174x1180.png 848w, https://substackcdn.com/image/fetch/$s_!KLXE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d87d634-88de-455e-ac16-fac9c0a820b2_1174x1180.png 1272w, https://substackcdn.com/image/fetch/$s_!KLXE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d87d634-88de-455e-ac16-fac9c0a820b2_1174x1180.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KLXE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d87d634-88de-455e-ac16-fac9c0a820b2_1174x1180.png" width="1174" height="1180" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5d87d634-88de-455e-ac16-fac9c0a820b2_1174x1180.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1180,&quot;width&quot;:1174,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:284186,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d87d634-88de-455e-ac16-fac9c0a820b2_1174x1180.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KLXE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d87d634-88de-455e-ac16-fac9c0a820b2_1174x1180.png 424w, https://substackcdn.com/image/fetch/$s_!KLXE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d87d634-88de-455e-ac16-fac9c0a820b2_1174x1180.png 848w, https://substackcdn.com/image/fetch/$s_!KLXE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d87d634-88de-455e-ac16-fac9c0a820b2_1174x1180.png 1272w, https://substackcdn.com/image/fetch/$s_!KLXE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d87d634-88de-455e-ac16-fac9c0a820b2_1174x1180.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p><strong>RL forgets less (again).</strong> Experiments are performed in [2] using Qwen-2.5 and Llama-3 models with up to 8B parameters. As shown above, higher levels of forgetting&#8212;<em>as measured via the average accuracy drop across non-target tasks</em>&#8212;are observed with SFT compared to RL. In fact, Qwen-2.5 models see &lt;1% average accuracy drop across all tasks and model scales for RL training, whereas the average accuracy drop with SFT reaches nearly 30% in some cases. </p><blockquote><p><em>&#8220;RL leads to less forgetting than SFT while achieving comparable or higher target task performance&#8230; SFT suffers from severe forgetting, whereas RL can achieve high target task performance without substantial forgetting.&#8221;</em> - from [2]</p></blockquote><p>Despite the ability of RL to avoid catastrophic forgetting, the results with SFT are not actually bad&#8212;<em>there is just a clear domain tradeoff</em>. We can achieve performance improvements in the target domain via RL training, but models trained via SFT actually perform better. Unfortunately, the superior performance of SFT in the target domain comes at the cost of degraded performance on non-target tasks. For this reason, the comparison is not as simple as <code>RL &gt; SFT</code>. Rather, RL and SFT lie at different points on the <a href="https://en.wikipedia.org/wiki/Pareto_front">Pareto frontier</a> of target and non-target task accuracy&#8212;<em>better performance in one domain comes at the expense of the other</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FdWJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8dbac20b-abe7-444e-bf79-ba1d0a318a06_1204x582.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FdWJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8dbac20b-abe7-444e-bf79-ba1d0a318a06_1204x582.png 424w, https://substackcdn.com/image/fetch/$s_!FdWJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8dbac20b-abe7-444e-bf79-ba1d0a318a06_1204x582.png 848w, https://substackcdn.com/image/fetch/$s_!FdWJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8dbac20b-abe7-444e-bf79-ba1d0a318a06_1204x582.png 1272w, https://substackcdn.com/image/fetch/$s_!FdWJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8dbac20b-abe7-444e-bf79-ba1d0a318a06_1204x582.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FdWJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8dbac20b-abe7-444e-bf79-ba1d0a318a06_1204x582.png" width="1204" height="582" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8dbac20b-abe7-444e-bf79-ba1d0a318a06_1204x582.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:582,&quot;width&quot;:1204,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:154810,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8dbac20b-abe7-444e-bf79-ba1d0a318a06_1204x582.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FdWJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8dbac20b-abe7-444e-bf79-ba1d0a318a06_1204x582.png 424w, https://substackcdn.com/image/fetch/$s_!FdWJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8dbac20b-abe7-444e-bf79-ba1d0a318a06_1204x582.png 848w, https://substackcdn.com/image/fetch/$s_!FdWJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8dbac20b-abe7-444e-bf79-ba1d0a318a06_1204x582.png 1272w, https://substackcdn.com/image/fetch/$s_!FdWJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8dbac20b-abe7-444e-bf79-ba1d0a318a06_1204x582.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p><strong>Benefits of on-policy data.</strong> Similar to work in [1], authors in [2] show the lack of catastrophic forgetting in RL is not due to the inclusion of a KL divergence term in the objective; see above. Interestingly, the exact advantage formulation used by GRPO is also found to have little impact on continual learning capabilities&#8212;<em>a naive <a href="https://cameronrwolfe.substack.com/p/reinforce">REINFORCE</a>-based RL setup is shown to mitigate forgetting to a similar extent</em>. It is possible, however, that the continual learning abilities of RL stem from its use of on-policy samples&#8212;<em>unlike the offline dataset used by SFT</em>&#8212;during training. To test this theory, we consider the following training setups:</p><ul><li><p><em>On-policy SFT</em>: running SFT using fully on-policy samples that are directly obtained from the RL training process. </p></li><li><p><em>Iterative SFT</em>: re-generating data for SFT after every epoch using the current policy (i.e., a partially on-policy approach). </p></li></ul><p>Put simply, these approaches adapt SFT to use on-policy data, which allows us to decouple the impact of RL training and on-policy data. The use of iterative SFT also allows us to test a semi-on-policy scenario, which samples fresh on-policy data at the end of each epoch (i.e., instead of generating new samples during each training iteration). This coarse-grained approach to on-policy data has efficiency benefits&#8212;<em>we can adjust the regularity with which we sample fresh on-policy data. </em></p><div class="pullquote"><p><em>&#8220;We find that for SFT, while generating data only from the initial policy is not enough, approximately on-policy data generated at the start of each epoch can suffice for substantially reducing forgetting. This suggests a practical guideline for LM post-training: leveraging on-policy data, potentially sampled asynchronously or at the start of each epoch for improved efficiency, can reduce unintended disruption of the model&#8217;s existing capabilities.&#8221; - from [2]</em></p></div><p>Experiments with these training algorithms provide empirical evidence that on-policy data is a key contributor to the success of RL in the continual learning domain. Specifically, models trained via on-policy SFT mitigate forgetting to a similar extent as those trained via RL. Additionally, the data used does not need to be fully on-policy&#8212;<em>similar trends are observed with iterative SFT</em>; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!A0Eh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d6891fa-b030-4bd6-8aec-6659dc4c5a25_1598x1138.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!A0Eh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d6891fa-b030-4bd6-8aec-6659dc4c5a25_1598x1138.png 424w, https://substackcdn.com/image/fetch/$s_!A0Eh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d6891fa-b030-4bd6-8aec-6659dc4c5a25_1598x1138.png 848w, https://substackcdn.com/image/fetch/$s_!A0Eh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d6891fa-b030-4bd6-8aec-6659dc4c5a25_1598x1138.png 1272w, https://substackcdn.com/image/fetch/$s_!A0Eh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d6891fa-b030-4bd6-8aec-6659dc4c5a25_1598x1138.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!A0Eh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d6891fa-b030-4bd6-8aec-6659dc4c5a25_1598x1138.png" width="1456" height="1037" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6d6891fa-b030-4bd6-8aec-6659dc4c5a25_1598x1138.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1037,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:364729,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d6891fa-b030-4bd6-8aec-6659dc4c5a25_1598x1138.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!A0Eh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d6891fa-b030-4bd6-8aec-6659dc4c5a25_1598x1138.png 424w, https://substackcdn.com/image/fetch/$s_!A0Eh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d6891fa-b030-4bd6-8aec-6659dc4c5a25_1598x1138.png 848w, https://substackcdn.com/image/fetch/$s_!A0Eh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d6891fa-b030-4bd6-8aec-6659dc4c5a25_1598x1138.png 1272w, https://substackcdn.com/image/fetch/$s_!A0Eh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d6891fa-b030-4bd6-8aec-6659dc4c5a25_1598x1138.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p><strong>Mode-seeking versus mode-covering.</strong> Intuitively, we might assume that the mode-covering nature of SFT would allow the model to maintain probability mass across all tasks and, therefore, avoid catastrophic forgetting. As we have seen, however, <em>the opposite is true in practice</em>. Such a finding is due to the fact that we are only training our model over a small subset of the model&#8217;s total data distribution in most of these experiments. Potentially our observations would be different if we were able to retain the LLM&#8217;s entire training dataset within a replay buffer, but implementing such an approach efficiently would be incredibly difficult.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ouym!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93215854-0559-4640-a298-a1894b18e851_1594x1032.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ouym!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93215854-0559-4640-a298-a1894b18e851_1594x1032.png 424w, https://substackcdn.com/image/fetch/$s_!Ouym!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93215854-0559-4640-a298-a1894b18e851_1594x1032.png 848w, https://substackcdn.com/image/fetch/$s_!Ouym!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93215854-0559-4640-a298-a1894b18e851_1594x1032.png 1272w, https://substackcdn.com/image/fetch/$s_!Ouym!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93215854-0559-4640-a298-a1894b18e851_1594x1032.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ouym!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93215854-0559-4640-a298-a1894b18e851_1594x1032.png" width="1456" height="943" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/93215854-0559-4640-a298-a1894b18e851_1594x1032.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:943,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:381801,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93215854-0559-4640-a298-a1894b18e851_1594x1032.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ouym!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93215854-0559-4640-a298-a1894b18e851_1594x1032.png 424w, https://substackcdn.com/image/fetch/$s_!Ouym!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93215854-0559-4640-a298-a1894b18e851_1594x1032.png 848w, https://substackcdn.com/image/fetch/$s_!Ouym!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93215854-0559-4640-a298-a1894b18e851_1594x1032.png 1272w, https://substackcdn.com/image/fetch/$s_!Ouym!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93215854-0559-4640-a298-a1894b18e851_1594x1032.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p>In the standard LLM post-training setup, the mode-seeking behavior of RL is more robust to catastrophic forgetting. To explain this phenomenon, authors in [2] construct a simplified setting shown above, which illustrates the dependence of forgetting on the modality of the underlying target distribution. If our target distribution is multi-modal, which is likely to be true for an LLM, then the mode-seeking nature of RL actually leads to less forgetting relative to a mode-covering objective like SFT. The simplified distribution that is constructed in [2] has two modalities corresponding to old and new knowledge. For such a distribution, the forward KL objective yields noticeable forgetting while minimizing the reverse KL allows both modes of the target distribution to be properly captured. </p><h4><strong><a href="https://arxiv.org/abs/2509.04259">RL&#8217;s Razor: Why Online RL Forgets Less</a> [3]</strong></h4><p>As we know, SFT and RL achieve comparable performance when training on a new task but have drastically different forgetting dynamics. In most cases, gains on new tasks with SFT come at the cost of erasing prior knowledge, while RL is much better at protecting old capabilities; see below. By studying this gap in performance, authors in [3] identify a metric that reliably predicts the amount of forgetting that occurs for both SFT and RL: the distributional shift&#8212;<em>measured via <a href="https://cameronrwolfe.substack.com/i/167254905/kullback-leibler-kl-divergence">KL divergence</a></em>&#8212;between the base and finetuned models on the target task.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ip1n!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f357d29-a717-4073-a60b-cbc53a89d74f_1738x802.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ip1n!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f357d29-a717-4073-a60b-cbc53a89d74f_1738x802.png 424w, https://substackcdn.com/image/fetch/$s_!ip1n!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f357d29-a717-4073-a60b-cbc53a89d74f_1738x802.png 848w, https://substackcdn.com/image/fetch/$s_!ip1n!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f357d29-a717-4073-a60b-cbc53a89d74f_1738x802.png 1272w, https://substackcdn.com/image/fetch/$s_!ip1n!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f357d29-a717-4073-a60b-cbc53a89d74f_1738x802.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ip1n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f357d29-a717-4073-a60b-cbc53a89d74f_1738x802.png" width="1456" height="672" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2f357d29-a717-4073-a60b-cbc53a89d74f_1738x802.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:672,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:360827,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f357d29-a717-4073-a60b-cbc53a89d74f_1738x802.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ip1n!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f357d29-a717-4073-a60b-cbc53a89d74f_1738x802.png 424w, https://substackcdn.com/image/fetch/$s_!ip1n!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f357d29-a717-4073-a60b-cbc53a89d74f_1738x802.png 848w, https://substackcdn.com/image/fetch/$s_!ip1n!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f357d29-a717-4073-a60b-cbc53a89d74f_1738x802.png 1272w, https://substackcdn.com/image/fetch/$s_!ip1n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f357d29-a717-4073-a60b-cbc53a89d74f_1738x802.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><p><strong>RL&#8217;s Razor.</strong> In addition to discovering this relationship between the underlying distribution shift and forgetting, we see in [3] that the finetuned models from SFT and RL have unique properties:</p><ul><li><p>RL is biased towards solutions that minimize distributional shift.</p></li><li><p>SFT can converge to solutions arbitrarily far away from the base model. </p></li></ul><p>Such a property naturally implies the improved continual learning abilities of RL. By discovering a solution that minimizes distributional shift, <em>we also minimize the amount of forgetting that occurs</em>; see above<em>.</em> The bias of RL towards nearby solutions that minimize catastrophic forgetting is referred to in [3] as &#8220;RL&#8217;s Razor&#8221;<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>. </p><blockquote><p><em>&#8220;RL&#8217;s Razor: among the many high-reward solutions for a new task, on-policy methods such as RL are inherently biased toward solutions that remain closer to the original policy in KL divergence&#8230; the KL divergence between the fine-tuned model and the base model, measured on the new task, reliably predicts&#8230; forgetting.&#8221;</em> - from [3]</p></blockquote><p><strong>Distribution shift.</strong> In the LLM domain, we often measure the KL divergence between the next token distributions of two models. For example, the RL training objective has a KL divergence term that regularizes drift between the current and reference policy, where the KL divergence is computed using on-policy samples taken from the current policy during RL training. In [3], authors compute the KL divergence over data from the task on which our policy is being finetuned (i.e., the target task). We are restricted to using the target data because we rarely have access to the pretraining data (or any prior tasks) on which an LLM was trained. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!I2iJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46334012-3aea-46b7-8292-e097e406bb00_2110x344.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!I2iJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46334012-3aea-46b7-8292-e097e406bb00_2110x344.png 424w, https://substackcdn.com/image/fetch/$s_!I2iJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46334012-3aea-46b7-8292-e097e406bb00_2110x344.png 848w, https://substackcdn.com/image/fetch/$s_!I2iJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46334012-3aea-46b7-8292-e097e406bb00_2110x344.png 1272w, https://substackcdn.com/image/fetch/$s_!I2iJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46334012-3aea-46b7-8292-e097e406bb00_2110x344.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!I2iJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46334012-3aea-46b7-8292-e097e406bb00_2110x344.png" width="1456" height="237" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/46334012-3aea-46b7-8292-e097e406bb00_2110x344.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:237,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:63660,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46334012-3aea-46b7-8292-e097e406bb00_2110x344.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!I2iJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46334012-3aea-46b7-8292-e097e406bb00_2110x344.png 424w, https://substackcdn.com/image/fetch/$s_!I2iJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46334012-3aea-46b7-8292-e097e406bb00_2110x344.png 848w, https://substackcdn.com/image/fetch/$s_!I2iJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46334012-3aea-46b7-8292-e097e406bb00_2110x344.png 1272w, https://substackcdn.com/image/fetch/$s_!I2iJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46334012-3aea-46b7-8292-e097e406bb00_2110x344.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><p>This KL divergence between base and finetuned models on the target dataset can be viewed as capturing the distributional shift from training. <em>We are computing the divergence between models before and after training over the training data itself</em>. When measured in this way, the distributional shift is found to be consistently predictive of the amount of forgetting that occurs. Given that no prior data is used to compute this KL divergence, <em>such a finding is highly non-trivial</em>!</p><p><strong>Experiments </strong>in [3] are performed using both vanilla SFT and RL with GRPO. The RL setup uses standard verifiable rewards and no KL divergence regularization. Similarly to [2], the base model (<a href="https://huggingface.co/Qwen/Qwen2.5-3B-Instruct">Qwen-2.5-3B-Instruct</a>) is trained on one target task (i.e., <a href="https://arxiv.org/abs/2503.24290">Open-Reasoner-Zero</a>, <a href="https://arxiv.org/abs/2306.05301">ToolAlpaca</a>, or the Chemistry L-3 subset of <a href="https://arxiv.org/abs/2406.09098">SciKnowEval</a>) and evaluated on both the target task and set of prior tasks (i.e., <a href="https://arxiv.org/abs/1905.07830">HellaSwag</a>, <a href="https://arxiv.org/abs/2109.07958">TruthfulQA</a>, <a href="http://TruthfulQAhttps://arxiv.org/abs/2009.03300">MMLU</a>, <a href="https://arxiv.org/abs/2311.07911">IFEval</a>, <a href="https://arxiv.org/abs/1907.10641">WinoGrande</a>, and <a href="https://arxiv.org/abs/2107.03374">HumanEval</a>). Given that hyperparameter settings can massively impact results in a continual learning setup<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a>, a wide variety of hyperparameters are tested for each task and results are visualized as a <a href="https://en.wikipedia.org/wiki/Pareto_front">Pareto frontier</a> constructed by all possible settings. </p><p><strong>Lower KL leads to less forgetting.</strong> RL training improves target task performance while keeping performance on prior tasks stable. However, improvements in performance obtained via SFT come at the cost of noticeable forgetting. The deterioration in performance is most visible in the math domain; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qbOa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2278d6ce-a052-48a4-b671-2da4ff50c95a_2246x816.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qbOa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2278d6ce-a052-48a4-b671-2da4ff50c95a_2246x816.png 424w, https://substackcdn.com/image/fetch/$s_!qbOa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2278d6ce-a052-48a4-b671-2da4ff50c95a_2246x816.png 848w, https://substackcdn.com/image/fetch/$s_!qbOa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2278d6ce-a052-48a4-b671-2da4ff50c95a_2246x816.png 1272w, https://substackcdn.com/image/fetch/$s_!qbOa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2278d6ce-a052-48a4-b671-2da4ff50c95a_2246x816.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qbOa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2278d6ce-a052-48a4-b671-2da4ff50c95a_2246x816.png" width="1456" height="529" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2278d6ce-a052-48a4-b671-2da4ff50c95a_2246x816.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:529,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:317157,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2278d6ce-a052-48a4-b671-2da4ff50c95a_2246x816.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qbOa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2278d6ce-a052-48a4-b671-2da4ff50c95a_2246x816.png 424w, https://substackcdn.com/image/fetch/$s_!qbOa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2278d6ce-a052-48a4-b671-2da4ff50c95a_2246x816.png 848w, https://substackcdn.com/image/fetch/$s_!qbOa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2278d6ce-a052-48a4-b671-2da4ff50c95a_2246x816.png 1272w, https://substackcdn.com/image/fetch/$s_!qbOa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2278d6ce-a052-48a4-b671-2da4ff50c95a_2246x816.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Identifying the cause of such forgetting is difficult due to the high computational cost of RL training&#8212;<em>testing each hypothesis is quite expensive</em>! To make this search more tractable, a toy setting is created based on the <a href="https://en.wikipedia.org/wiki/MNIST_database">MNIST</a> and <a href="https://arxiv.org/abs/1708.07747">FashionMNIST</a> datasets for which RL training is much faster. Using this setting, a variety of candidate metrics are tested for a relationship to catastrophic forgetting:</p><ul><li><p>The magnitude of changes to model parameters.</p></li><li><p>The sparsity of weight updates.</p></li><li><p>The rank of policy gradients throughout training.</p></li></ul><p>The only quantity that demonstrates a consistent relationship with the amount of catastrophic forgetting is the KL divergence between base and finetuned models over the target dataset; see below. The fact that the rank or sparsity of policy gradient updates is unrelated to forgetting is notable, as prior research [4] has shown that RL works surprisingly well even when using <a href="https://cameronrwolfe.substack.com/p/easily-train-a-specialized-llm-peft">LoRA</a> with a low rank. Such a finding indicates that the updates being produced by RL are potentially sparse or low rank, which could help to reduce forgetting. However, we see in [3] that the story is not this simple. Rather, the benefits of RL stem from an implicit KL regularization&#8212;<em>or RL&#8217;s Razor</em>&#8212;that minimizes distribution shift in training.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vTCx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83d6ac79-a6f8-46c4-a002-4a328a8e62bc_2212x1003.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vTCx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83d6ac79-a6f8-46c4-a002-4a328a8e62bc_2212x1003.png 424w, https://substackcdn.com/image/fetch/$s_!vTCx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83d6ac79-a6f8-46c4-a002-4a328a8e62bc_2212x1003.png 848w, https://substackcdn.com/image/fetch/$s_!vTCx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83d6ac79-a6f8-46c4-a002-4a328a8e62bc_2212x1003.png 1272w, https://substackcdn.com/image/fetch/$s_!vTCx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83d6ac79-a6f8-46c4-a002-4a328a8e62bc_2212x1003.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vTCx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83d6ac79-a6f8-46c4-a002-4a328a8e62bc_2212x1003.png" width="1456" height="660" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/83d6ac79-a6f8-46c4-a002-4a328a8e62bc_2212x1003.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:660,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:381704,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83d6ac79-a6f8-46c4-a002-4a328a8e62bc_2212x1003.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vTCx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83d6ac79-a6f8-46c4-a002-4a328a8e62bc_2212x1003.png 424w, https://substackcdn.com/image/fetch/$s_!vTCx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83d6ac79-a6f8-46c4-a002-4a328a8e62bc_2212x1003.png 848w, https://substackcdn.com/image/fetch/$s_!vTCx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83d6ac79-a6f8-46c4-a002-4a328a8e62bc_2212x1003.png 1272w, https://substackcdn.com/image/fetch/$s_!vTCx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83d6ac79-a6f8-46c4-a002-4a328a8e62bc_2212x1003.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><p>To further validate the relationship between the KL divergence and forgetting, authors create an &#8220;oracle&#8221; SFT distribution in their toy setting. Put simply, this experiment performs SFT on a dataset that has been analytically constructed to minimize the KL divergence between the base and finetuned models. As shown above, running SFT on this data yields an even better tradeoff than RL&#8212;<em>the model performs better on the target task without sacrificing prior task performance</em>. </p><blockquote><p><em>&#8220;RL performs well because its on-policy updates bias the solution toward low-KL regions, but when SFT is explicitly guided to the KL-minimal distribution, it can surpass RL.&#8221;</em> - from [3]</p></blockquote><p><strong>On-policy data.</strong> Beyond the toy example explained above, authors in [3] also run SFT training over on-policy data obtained during RL. The accuracy-forgetting tradeoff achieved by the resulting model matches that of those trained via RL, which aligns with prior work [2] and provides further evidence that on-policy data plays a key role in mitigating forgetting for RL. To better understand the impact of on-policy data, three different learning algorithms are tested (shown below):</p><ul><li><p><em>Standard GRPO.</em></p></li><li><p><em>Standard SFT</em>. </p></li><li><p><em>1-0 REINFORCE</em>: an on-policy RL algorithm with a very simple advantage function (i.e., 1 if the answer is correct and zero otherwise).</p></li><li><p><em>SimPO</em> [5]: an offline preference tuning algorithm that simplifies DPO by directly using the log probability of a sequence as the implicit reward and, therefore, removing the need for a reference model.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QcLQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58ffa6a9-a3c3-4d20-b406-bdda17891c86_1840x882.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QcLQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58ffa6a9-a3c3-4d20-b406-bdda17891c86_1840x882.png 424w, https://substackcdn.com/image/fetch/$s_!QcLQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58ffa6a9-a3c3-4d20-b406-bdda17891c86_1840x882.png 848w, https://substackcdn.com/image/fetch/$s_!QcLQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58ffa6a9-a3c3-4d20-b406-bdda17891c86_1840x882.png 1272w, https://substackcdn.com/image/fetch/$s_!QcLQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58ffa6a9-a3c3-4d20-b406-bdda17891c86_1840x882.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QcLQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58ffa6a9-a3c3-4d20-b406-bdda17891c86_1840x882.png" width="1456" height="698" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/58ffa6a9-a3c3-4d20-b406-bdda17891c86_1840x882.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:698,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:295447,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58ffa6a9-a3c3-4d20-b406-bdda17891c86_1840x882.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QcLQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58ffa6a9-a3c3-4d20-b406-bdda17891c86_1840x882.png 424w, https://substackcdn.com/image/fetch/$s_!QcLQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58ffa6a9-a3c3-4d20-b406-bdda17891c86_1840x882.png 848w, https://substackcdn.com/image/fetch/$s_!QcLQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58ffa6a9-a3c3-4d20-b406-bdda17891c86_1840x882.png 1272w, https://substackcdn.com/image/fetch/$s_!QcLQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58ffa6a9-a3c3-4d20-b406-bdda17891c86_1840x882.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As we can see in the left half of the above figure, these experiments ablate the use of negative examples and on-policy data within the training setup. Interestingly, the 0-1 REINFORCE algorithm performs similarly to GRPO, while results with SimPO resemble those of SFT. Such results indicate that the use of on-policy data is the key contributor to RL&#8217;s lack of forgetting. We also see above that the use of on-policy data leads to minimal KL divergence between the base and finetuned models over the target distribution. <em>Such results indicate that the implicit bias of RL towards low KL solutions stems from the online nature of training</em>. This empirical observation is also justified by further theoretical analysis in [3]. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!S8ov!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F550e03d6-70e2-4636-acdc-1c10fac5e82b_1820x292.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!S8ov!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F550e03d6-70e2-4636-acdc-1c10fac5e82b_1820x292.png 424w, https://substackcdn.com/image/fetch/$s_!S8ov!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F550e03d6-70e2-4636-acdc-1c10fac5e82b_1820x292.png 848w, https://substackcdn.com/image/fetch/$s_!S8ov!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F550e03d6-70e2-4636-acdc-1c10fac5e82b_1820x292.png 1272w, https://substackcdn.com/image/fetch/$s_!S8ov!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F550e03d6-70e2-4636-acdc-1c10fac5e82b_1820x292.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!S8ov!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F550e03d6-70e2-4636-acdc-1c10fac5e82b_1820x292.png" width="1456" height="234" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/550e03d6-70e2-4636-acdc-1c10fac5e82b_1820x292.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:234,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:70670,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F550e03d6-70e2-4636-acdc-1c10fac5e82b_1820x292.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!S8ov!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F550e03d6-70e2-4636-acdc-1c10fac5e82b_1820x292.png 424w, https://substackcdn.com/image/fetch/$s_!S8ov!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F550e03d6-70e2-4636-acdc-1c10fac5e82b_1820x292.png 848w, https://substackcdn.com/image/fetch/$s_!S8ov!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F550e03d6-70e2-4636-acdc-1c10fac5e82b_1820x292.png 1272w, https://substackcdn.com/image/fetch/$s_!S8ov!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F550e03d6-70e2-4636-acdc-1c10fac5e82b_1820x292.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><h4><strong><a href="https://arxiv.org/abs/2601.02151">Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting</a> [6]</strong></h4><blockquote><p><em>&#8220;While RL aligns with the model&#8217;s internal belief, SFT forces the model to fit external supervision. This mismatch often manifests as Confident Conflicts&#8212;tokens characterized by low probability but low entropy.&#8221;</em> - from [6]</p></blockquote><p>We have learned that RL avoids catastrophic forgetting much better than SFT due to its use of on-policy data, which allows for the discovery of a solution with minimal KL divergence between base and finetuned models on the target data. Although we know that these factors lead to less forgetting, <em>we do not yet understand why this is the case</em>. In [6], authors offer a new perspective on the forgetting properties of SFT and RL by analyzing token probabilities and entropy of models trained with these two approaches. When these two quantities are measured throughout the training process, we see that a clear gap exists:</p><ul><li><p>On-policy RL tends to cluster in regions of highly-confident and correct predictions&#8212;<em>characterized by high probability and low entropy</em>&#8212;or exploratory completions&#8212;<em>characterized by high entropy</em>.  </p></li><li><p>SFT has a significant cluster of tokens with both low entropy and low probability&#8212;<em>these are referred to as &#8220;Confident Conflicts&#8221;</em>.</p></li></ul><p>To discover this distribution mismatch, token probability and predictive entropy is measured over both the SFT dataset and model-generated rollouts. This trend is visualized below, where we see that SFT data has a noticeable cluster of confident conflict tokens that does not exist when using on-policy data.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!52UY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1981c7af-d1d0-4667-9632-6835f55d9c43_2608x1094.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!52UY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1981c7af-d1d0-4667-9632-6835f55d9c43_2608x1094.png 424w, https://substackcdn.com/image/fetch/$s_!52UY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1981c7af-d1d0-4667-9632-6835f55d9c43_2608x1094.png 848w, https://substackcdn.com/image/fetch/$s_!52UY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1981c7af-d1d0-4667-9632-6835f55d9c43_2608x1094.png 1272w, https://substackcdn.com/image/fetch/$s_!52UY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1981c7af-d1d0-4667-9632-6835f55d9c43_2608x1094.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!52UY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1981c7af-d1d0-4667-9632-6835f55d9c43_2608x1094.png" width="1456" height="611" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1981c7af-d1d0-4667-9632-6835f55d9c43_2608x1094.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:611,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2636867,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1981c7af-d1d0-4667-9632-6835f55d9c43_2608x1094.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!52UY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1981c7af-d1d0-4667-9632-6835f55d9c43_2608x1094.png 424w, https://substackcdn.com/image/fetch/$s_!52UY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1981c7af-d1d0-4667-9632-6835f55d9c43_2608x1094.png 848w, https://substackcdn.com/image/fetch/$s_!52UY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1981c7af-d1d0-4667-9632-6835f55d9c43_2608x1094.png 1272w, https://substackcdn.com/image/fetch/$s_!52UY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1981c7af-d1d0-4667-9632-6835f55d9c43_2608x1094.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [6])</figcaption></figure></div><p><strong>Why does this occur?</strong> We are using external supervision in SFT (i.e., an offline supervised dataset), whereas RL learns from on-policy or self-generated data. In some cases, training the model on external data forces it to mimic outputs that align poorly with its current next distribution&#8212;<em>confident conflicts occur when external data has a strong conflict with the model&#8217;s prior.</em> As a result, gradient updates can become large and destructive, leading to catastrophic forgetting.</p><div class="pullquote"><p><em>&#8220;Because the model strongly favors another token, fitting the target requires substantial parameter updates, which can overwrite general representations in the base model. By contrast, when the model is uncertain (high entropy), the gradients are smaller and updates are gentler, helping preserve the model&#8217;s original capabilities.&#8221; - from [6]</em></p></div><p><strong>Masking conflicts.</strong> To determine whether confident conflict tokens truly lead to forgetting, authors in [6] test simply masking the loss from such tokens during SFT. Interestingly, catastrophic forgetting is significantly reduced when these tokens are masked from the training loss, <em>indicating that confident conflict tokens play a significant role in the tendency of SFT to damage prior knowledge</em>; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!y4LD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f527807-a084-4e13-be45-9206ea761a41_1340x1106.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!y4LD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f527807-a084-4e13-be45-9206ea761a41_1340x1106.png 424w, https://substackcdn.com/image/fetch/$s_!y4LD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f527807-a084-4e13-be45-9206ea761a41_1340x1106.png 848w, https://substackcdn.com/image/fetch/$s_!y4LD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f527807-a084-4e13-be45-9206ea761a41_1340x1106.png 1272w, https://substackcdn.com/image/fetch/$s_!y4LD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f527807-a084-4e13-be45-9206ea761a41_1340x1106.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!y4LD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f527807-a084-4e13-be45-9206ea761a41_1340x1106.png" width="400" height="330.14925373134326" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7f527807-a084-4e13-be45-9206ea761a41_1340x1106.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1106,&quot;width&quot;:1340,&quot;resizeWidth&quot;:400,&quot;bytes&quot;:483854,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f527807-a084-4e13-be45-9206ea761a41_1340x1106.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!y4LD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f527807-a084-4e13-be45-9206ea761a41_1340x1106.png 424w, https://substackcdn.com/image/fetch/$s_!y4LD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f527807-a084-4e13-be45-9206ea761a41_1340x1106.png 848w, https://substackcdn.com/image/fetch/$s_!y4LD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f527807-a084-4e13-be45-9206ea761a41_1340x1106.png 1272w, https://substackcdn.com/image/fetch/$s_!y4LD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f527807-a084-4e13-be45-9206ea761a41_1340x1106.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [6])</figcaption></figure></div><p>Extending this idea, a novel training algorithm, called <strong>Entropy Adaptive Finetuning (EAFT)</strong>, is proposed in [6] that scales the token-level cross-entropy loss by a dynamic entropy factor. The new loss formulation is outlined below, which multiplies the supervised loss by the token&#8217;s normalized entropy<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a>. By using this token-level entropy scaling factor, we can effectively mask the loss of low entropy tokens that lead to destructive gradient updates while maintaining the full update for high entropy tokens that are beneficial for exploration.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nydC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02b36936-f4e2-4647-97d4-08ce0a0138e9_1916x913.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nydC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02b36936-f4e2-4647-97d4-08ce0a0138e9_1916x913.png 424w, https://substackcdn.com/image/fetch/$s_!nydC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02b36936-f4e2-4647-97d4-08ce0a0138e9_1916x913.png 848w, https://substackcdn.com/image/fetch/$s_!nydC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02b36936-f4e2-4647-97d4-08ce0a0138e9_1916x913.png 1272w, https://substackcdn.com/image/fetch/$s_!nydC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02b36936-f4e2-4647-97d4-08ce0a0138e9_1916x913.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nydC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02b36936-f4e2-4647-97d4-08ce0a0138e9_1916x913.png" width="615" height="293.1387362637363" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/02b36936-f4e2-4647-97d4-08ce0a0138e9_1916x913.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:694,&quot;width&quot;:1456,&quot;resizeWidth&quot;:615,&quot;bytes&quot;:215643,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02b36936-f4e2-4647-97d4-08ce0a0138e9_1916x913.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nydC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02b36936-f4e2-4647-97d4-08ce0a0138e9_1916x913.png 424w, https://substackcdn.com/image/fetch/$s_!nydC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02b36936-f4e2-4647-97d4-08ce0a0138e9_1916x913.png 848w, https://substackcdn.com/image/fetch/$s_!nydC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02b36936-f4e2-4647-97d4-08ce0a0138e9_1916x913.png 1272w, https://substackcdn.com/image/fetch/$s_!nydC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02b36936-f4e2-4647-97d4-08ce0a0138e9_1916x913.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">EAFT loss formulation (from [6])</figcaption></figure></div><blockquote><p><em>&#8220;EAFT employs a soft gating mechanism that dynamically modulates the training loss based on token-level entropy.&#8221;</em> - from [6]</p></blockquote><p>To improve the efficiency of EAFT, authors in [6] only compute entropy over the Top-<code>K</code> (where <code>K = 20</code>) tokens in the distribution. As shown in the figure below, this setting balances the tradeoff between compute and memory overhead and ensures that added computational overhead relative to vanilla SFT is minimal. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ksfZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f087448-711c-4f84-8a7e-4816673f3220_1132x1166.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ksfZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f087448-711c-4f84-8a7e-4816673f3220_1132x1166.png 424w, https://substackcdn.com/image/fetch/$s_!ksfZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f087448-711c-4f84-8a7e-4816673f3220_1132x1166.png 848w, https://substackcdn.com/image/fetch/$s_!ksfZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f087448-711c-4f84-8a7e-4816673f3220_1132x1166.png 1272w, https://substackcdn.com/image/fetch/$s_!ksfZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f087448-711c-4f84-8a7e-4816673f3220_1132x1166.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ksfZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f087448-711c-4f84-8a7e-4816673f3220_1132x1166.png" width="400" height="412.01413427561835" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4f087448-711c-4f84-8a7e-4816673f3220_1132x1166.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1166,&quot;width&quot;:1132,&quot;resizeWidth&quot;:400,&quot;bytes&quot;:272033,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f087448-711c-4f84-8a7e-4816673f3220_1132x1166.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ksfZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f087448-711c-4f84-8a7e-4816673f3220_1132x1166.png 424w, https://substackcdn.com/image/fetch/$s_!ksfZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f087448-711c-4f84-8a7e-4816673f3220_1132x1166.png 848w, https://substackcdn.com/image/fetch/$s_!ksfZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f087448-711c-4f84-8a7e-4816673f3220_1132x1166.png 1272w, https://substackcdn.com/image/fetch/$s_!ksfZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f087448-711c-4f84-8a7e-4816673f3220_1132x1166.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [6])</figcaption></figure></div><p><strong>Results on Math.</strong> EAFT is validated in the Math domain using models across multiple families ranging from 4B to 32B parameters. Training prompts are sourced from <a href="http://faculty.bicmr.pku.edu.cn/~dongbin/Publications/numina_dataset.pdf">NuminaMath</a>, <a href="https://arxiv.org/abs/2502.17387">BigMathVerified</a>, and <a href="https://arxiv.org/abs/2504.13941">Nemotron-CrossThink</a>, while completions are sampled from <a href="https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507">Qwen-3-235B-A22B-Instruct</a>. Both in-domain and general benchmarks are used for evaluation. Models trained with EAFT perform well in the target domain while maintaining performance on general benchmarks; see below. Additionally, EAFT is found to effectively filter confident conflict samples during the training process, as demonstrated by the visible reduction in gradient magnitude within the confident conflict zone of the below figure. These results are further validated in experiments in the medical and tool use domains. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LQc2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7477fc70-c4dd-408d-8dbe-84db2fc1d817_2263x1217.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LQc2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7477fc70-c4dd-408d-8dbe-84db2fc1d817_2263x1217.png 424w, https://substackcdn.com/image/fetch/$s_!LQc2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7477fc70-c4dd-408d-8dbe-84db2fc1d817_2263x1217.png 848w, https://substackcdn.com/image/fetch/$s_!LQc2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7477fc70-c4dd-408d-8dbe-84db2fc1d817_2263x1217.png 1272w, https://substackcdn.com/image/fetch/$s_!LQc2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7477fc70-c4dd-408d-8dbe-84db2fc1d817_2263x1217.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LQc2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7477fc70-c4dd-408d-8dbe-84db2fc1d817_2263x1217.png" width="1456" height="783" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7477fc70-c4dd-408d-8dbe-84db2fc1d817_2263x1217.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:783,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:939874,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7477fc70-c4dd-408d-8dbe-84db2fc1d817_2263x1217.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LQc2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7477fc70-c4dd-408d-8dbe-84db2fc1d817_2263x1217.png 424w, https://substackcdn.com/image/fetch/$s_!LQc2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7477fc70-c4dd-408d-8dbe-84db2fc1d817_2263x1217.png 848w, https://substackcdn.com/image/fetch/$s_!LQc2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7477fc70-c4dd-408d-8dbe-84db2fc1d817_2263x1217.png 1272w, https://substackcdn.com/image/fetch/$s_!LQc2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7477fc70-c4dd-408d-8dbe-84db2fc1d817_2263x1217.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [6])</figcaption></figure></div><h2>Does RL Generalize Well?</h2><p>So far, we have focused on retaining old skills while learning new ones. A closely related question is whether the same mechanisms that reduce forgetting also improve transfer and out-of-distribution generalization. The fact that RL performs well in a continual learning setting has important implications for its generalization properties. Put simply, <em>RL training tends to benefit more than just the target domain</em>. As we will see in the next few papers, there are many examples of RL training yielding cross-domain performance benefits or improving the generalization of an LLM to some other task. Much of this analysis is similar in nature to what we have seen for continual learning, but the emphasis shifts from remembering prior tasks to generalizing beyond the training distribution.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ha_n!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f287cdb-5305-4abd-a2d1-7c081df8cf82_1722x1450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ha_n!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f287cdb-5305-4abd-a2d1-7c081df8cf82_1722x1450.png 424w, https://substackcdn.com/image/fetch/$s_!ha_n!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f287cdb-5305-4abd-a2d1-7c081df8cf82_1722x1450.png 848w, https://substackcdn.com/image/fetch/$s_!ha_n!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f287cdb-5305-4abd-a2d1-7c081df8cf82_1722x1450.png 1272w, https://substackcdn.com/image/fetch/$s_!ha_n!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f287cdb-5305-4abd-a2d1-7c081df8cf82_1722x1450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ha_n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f287cdb-5305-4abd-a2d1-7c081df8cf82_1722x1450.png" width="417" height="351.12774725274727" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3f287cdb-5305-4abd-a2d1-7c081df8cf82_1722x1450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1226,&quot;width&quot;:1456,&quot;resizeWidth&quot;:417,&quot;bytes&quot;:630478,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f287cdb-5305-4abd-a2d1-7c081df8cf82_1722x1450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ha_n!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f287cdb-5305-4abd-a2d1-7c081df8cf82_1722x1450.png 424w, https://substackcdn.com/image/fetch/$s_!ha_n!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f287cdb-5305-4abd-a2d1-7c081df8cf82_1722x1450.png 848w, https://substackcdn.com/image/fetch/$s_!ha_n!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f287cdb-5305-4abd-a2d1-7c081df8cf82_1722x1450.png 1272w, https://substackcdn.com/image/fetch/$s_!ha_n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f287cdb-5305-4abd-a2d1-7c081df8cf82_1722x1450.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [7])</figcaption></figure></div><p><strong>SFT Memorizes, RL Generalizes [7] </strong>performs a comparative post-training analysis between SFT and RL on both language-only and vision-language tasks. The main results of this analysis are depicted above, where we see that:</p><ul><li><p>Both SFT and RL improve in-domain performance.</p></li><li><p>Only RL generalizes well to new tasks or data.</p></li></ul><p>Experiments in [7] use <a href="https://huggingface.co/meta-llama/Llama-3.2-11B-Vision">Llama-3.2-Vision-11B</a> as the base model and train over two synthetic tasks (shown below) that test distinct forms of generalization:</p><ol><li><p><em>GeneralPoints</em>: A card game that requires the model to create equations to reach a target number using four given cards. We can test rule-based generalization by changing the mapping of face cards to numbers.</p></li><li><p><em>V-IRL</em>: A navigation task that has the model reach a destination using visual landmarks and spatial reasoning. We can test generalization by varying the available action space or visual context.</p></li></ol><p>Each task can be setup as both a language-only and vision-language problem. In all experiments, RL tends to promote out-of-distribution generalization while SFT actually damages it. For example, the out-of-distribution performance of models trained with RL improves by 3.5% and 11.0% on language-only GP and V-IRL. For vision-language variants, this performance improvement is slightly less pronounced (i.e., 3.0% and 9.3% on GP and V-IRL) but still present. In stark contrast, SFT degrades out-of-distribution performance by as much as 79.5%. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PgSy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4edcbf38-9810-4319-8dcd-70a2b9ae6e23_1374x1416.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PgSy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4edcbf38-9810-4319-8dcd-70a2b9ae6e23_1374x1416.png 424w, https://substackcdn.com/image/fetch/$s_!PgSy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4edcbf38-9810-4319-8dcd-70a2b9ae6e23_1374x1416.png 848w, https://substackcdn.com/image/fetch/$s_!PgSy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4edcbf38-9810-4319-8dcd-70a2b9ae6e23_1374x1416.png 1272w, https://substackcdn.com/image/fetch/$s_!PgSy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4edcbf38-9810-4319-8dcd-70a2b9ae6e23_1374x1416.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PgSy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4edcbf38-9810-4319-8dcd-70a2b9ae6e23_1374x1416.png" width="1374" height="1416" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4edcbf38-9810-4319-8dcd-70a2b9ae6e23_1374x1416.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1416,&quot;width&quot;:1374,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1677331,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4edcbf38-9810-4319-8dcd-70a2b9ae6e23_1374x1416.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PgSy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4edcbf38-9810-4319-8dcd-70a2b9ae6e23_1374x1416.png 424w, https://substackcdn.com/image/fetch/$s_!PgSy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4edcbf38-9810-4319-8dcd-70a2b9ae6e23_1374x1416.png 848w, https://substackcdn.com/image/fetch/$s_!PgSy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4edcbf38-9810-4319-8dcd-70a2b9ae6e23_1374x1416.png 1272w, https://substackcdn.com/image/fetch/$s_!PgSy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4edcbf38-9810-4319-8dcd-70a2b9ae6e23_1374x1416.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [7])</figcaption></figure></div><p>As an interesting side note, authors in [7] also find that RL benefits the model&#8217;s underlying perception capabilities. Namely, the model actually improves in its ability to identify key vision features during training, <em>indicating that RL is not just learning reasoning patterns but also improving fundamental abilities (i.e., perception)</em>. </p><blockquote><p><em>&#8220;Analysis of the GP-VL task showed that RL improved the model's ability to correctly identify card values from images, suggesting that outcome-based rewards can refine perceptual processing beyond what supervised training achieves.&#8221;</em> - from [7]</p></blockquote><p><strong>From Atomic to Composite [8]</strong> tests the generalization impact of RL training on problems that require complementary reasoning&#8212;<em>the ability to integrate external context with the model&#8217;s parametric knowledge</em>. To test this style of reasoning, a controlled synthetic dataset is created; see below. The dataset is based on a knowledge graph of human biographies with fixed relationships. Using this graph, we can construct multi-hop questions that test complementary reasoning by design. More specifically, questions are specifically constructed to test three levels of reasoning with increasing complexity (depicted below):</p><ol><li><p><em>IID reasoning</em> applies known patterns to new entities. </p></li><li><p><em>Compositional reasoning</em> applies known relationships to new relational paths.</p></li><li><p><em>Zero-shot reasoning</em> requires generalizing to unseen relations. </p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1qsp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc13f164c-7142-4dac-a7a7-f93b9668fd37_1624x1196.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1qsp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc13f164c-7142-4dac-a7a7-f93b9668fd37_1624x1196.png 424w, https://substackcdn.com/image/fetch/$s_!1qsp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc13f164c-7142-4dac-a7a7-f93b9668fd37_1624x1196.png 848w, https://substackcdn.com/image/fetch/$s_!1qsp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc13f164c-7142-4dac-a7a7-f93b9668fd37_1624x1196.png 1272w, https://substackcdn.com/image/fetch/$s_!1qsp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc13f164c-7142-4dac-a7a7-f93b9668fd37_1624x1196.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1qsp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc13f164c-7142-4dac-a7a7-f93b9668fd37_1624x1196.png" width="1456" height="1072" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c13f164c-7142-4dac-a7a7-f93b9668fd37_1624x1196.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1072,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:599843,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc13f164c-7142-4dac-a7a7-f93b9668fd37_1624x1196.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1qsp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc13f164c-7142-4dac-a7a7-f93b9668fd37_1624x1196.png 424w, https://substackcdn.com/image/fetch/$s_!1qsp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc13f164c-7142-4dac-a7a7-f93b9668fd37_1624x1196.png 848w, https://substackcdn.com/image/fetch/$s_!1qsp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc13f164c-7142-4dac-a7a7-f93b9668fd37_1624x1196.png 1272w, https://substackcdn.com/image/fetch/$s_!1qsp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc13f164c-7142-4dac-a7a7-f93b9668fd37_1624x1196.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [8])</figcaption></figure></div><p>The training process in [8] starts with <a href="https://huggingface.co/Qwen/Qwen2.5-1.5B">Qwen-2.5-1.5B</a>, performs an initial SFT stage, then tests several combinations of SFT and RL (using GRPO with binary verifiable rewards) training. The main results of these experiments are below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gp-L!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f7b1c9b-d8aa-44fe-9f42-2dcb07372352_2700x1374.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gp-L!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f7b1c9b-d8aa-44fe-9f42-2dcb07372352_2700x1374.png 424w, https://substackcdn.com/image/fetch/$s_!gp-L!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f7b1c9b-d8aa-44fe-9f42-2dcb07372352_2700x1374.png 848w, https://substackcdn.com/image/fetch/$s_!gp-L!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f7b1c9b-d8aa-44fe-9f42-2dcb07372352_2700x1374.png 1272w, https://substackcdn.com/image/fetch/$s_!gp-L!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f7b1c9b-d8aa-44fe-9f42-2dcb07372352_2700x1374.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gp-L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f7b1c9b-d8aa-44fe-9f42-2dcb07372352_2700x1374.png" width="1456" height="741" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0f7b1c9b-d8aa-44fe-9f42-2dcb07372352_2700x1374.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:741,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:515895,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f7b1c9b-d8aa-44fe-9f42-2dcb07372352_2700x1374.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gp-L!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f7b1c9b-d8aa-44fe-9f42-2dcb07372352_2700x1374.png 424w, https://substackcdn.com/image/fetch/$s_!gp-L!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f7b1c9b-d8aa-44fe-9f42-2dcb07372352_2700x1374.png 848w, https://substackcdn.com/image/fetch/$s_!gp-L!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f7b1c9b-d8aa-44fe-9f42-2dcb07372352_2700x1374.png 1272w, https://substackcdn.com/image/fetch/$s_!gp-L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f7b1c9b-d8aa-44fe-9f42-2dcb07372352_2700x1374.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [8])</figcaption></figure></div><p>As we can see, this analysis shows us that RL is capable of synthesizing multiple atomic reasoning capabilities into higher-level (composite) reasoning patterns. However, this is only possible when the model is trained with SFT prior to RL. In contrast, pure SFT training yields high in-domain performance and poor out-of-domain generalization, which reflects findings in prior work. In other words, <em>SFT tends to memorize reasoning patterns rather than learn them</em>. When a model is first trained via SFT to acquire primitive reasoning capabilities, then RL serves as a &#8220;synthesizer&#8221; through which the model learns how to properly combine these capabilities for solving complex, compositional reasoning problems. </p><blockquote><p><em>&#8220;[We demonstrate] that RL synthesizes novel reasoning strategies and enables robust zero-shot generalization when LLMs are first pre-trained on foundational atomic reasoning skills via Supervised Fine-Tuning.&#8221;</em> - from [8]</p></blockquote><p><strong>Does math reasoning improve general capabilities? </strong>A large-scale empirical analysis is performed in [9] to determine whether math-oriented reasoning training is also helpful in other domains. This analysis includes both a wide audit of existing models across math reasoning, general reasoning, and non-reasoning benchmarks, as well as a comparison of SFT and RL-based finetuning on Math-only data (i.e., ~47K prompts sourced from <a href="https://openreview.net/forum?id=I6GzDCne7U">DeepScaler</a> and <a href="https://arxiv.org/abs/2503.18892">SimpleRL</a>). </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qxVz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F371e172e-4696-496f-ad69-df1be408d98b_1876x808.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qxVz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F371e172e-4696-496f-ad69-df1be408d98b_1876x808.png 424w, https://substackcdn.com/image/fetch/$s_!qxVz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F371e172e-4696-496f-ad69-df1be408d98b_1876x808.png 848w, https://substackcdn.com/image/fetch/$s_!qxVz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F371e172e-4696-496f-ad69-df1be408d98b_1876x808.png 1272w, https://substackcdn.com/image/fetch/$s_!qxVz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F371e172e-4696-496f-ad69-df1be408d98b_1876x808.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qxVz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F371e172e-4696-496f-ad69-df1be408d98b_1876x808.png" width="1456" height="627" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/371e172e-4696-496f-ad69-df1be408d98b_1876x808.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:627,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:297202,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F371e172e-4696-496f-ad69-df1be408d98b_1876x808.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qxVz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F371e172e-4696-496f-ad69-df1be408d98b_1876x808.png 424w, https://substackcdn.com/image/fetch/$s_!qxVz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F371e172e-4696-496f-ad69-df1be408d98b_1876x808.png 848w, https://substackcdn.com/image/fetch/$s_!qxVz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F371e172e-4696-496f-ad69-df1be408d98b_1876x808.png 1272w, https://substackcdn.com/image/fetch/$s_!qxVz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F371e172e-4696-496f-ad69-df1be408d98b_1876x808.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [9])</figcaption></figure></div><p>As shown in the above plot, SFT-trained models tend to have poor transferability to non-reasoning tasks, while models trained with RL generalize across both reasoning and non-reasoning tasks&#8212;<em>RL models generalize well beyond math and naturally avoid catastrophic forgetting</em>. Similar trends are observed when analyzing the transferability of other open SFT or reasoning models across reasoning and non-reasoning benchmarks; see below. Further analysis in [9] reveals that on-policy data&#8212;<em>as we might expect from [2, 3]</em>&#8212;and the presence of a <a href="https://cameronrwolfe.substack.com/i/169926007/preference-fine-tuning-of-llms-should-leverage-suboptimal-on-policy-data-7">negative gradient</a> in the RL objective are key contributors to favorable generalization properties.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!crTD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F308a6aad-0ce2-4777-a880-c32a7d220fcc_1864x1516.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!crTD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F308a6aad-0ce2-4777-a880-c32a7d220fcc_1864x1516.png 424w, https://substackcdn.com/image/fetch/$s_!crTD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F308a6aad-0ce2-4777-a880-c32a7d220fcc_1864x1516.png 848w, https://substackcdn.com/image/fetch/$s_!crTD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F308a6aad-0ce2-4777-a880-c32a7d220fcc_1864x1516.png 1272w, https://substackcdn.com/image/fetch/$s_!crTD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F308a6aad-0ce2-4777-a880-c32a7d220fcc_1864x1516.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!crTD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F308a6aad-0ce2-4777-a880-c32a7d220fcc_1864x1516.png" width="1456" height="1184" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/308a6aad-0ce2-4777-a880-c32a7d220fcc_1864x1516.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1184,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:532319,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/183759600?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F308a6aad-0ce2-4777-a880-c32a7d220fcc_1864x1516.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!crTD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F308a6aad-0ce2-4777-a880-c32a7d220fcc_1864x1516.png 424w, https://substackcdn.com/image/fetch/$s_!crTD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F308a6aad-0ce2-4777-a880-c32a7d220fcc_1864x1516.png 848w, https://substackcdn.com/image/fetch/$s_!crTD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F308a6aad-0ce2-4777-a880-c32a7d220fcc_1864x1516.png 1272w, https://substackcdn.com/image/fetch/$s_!crTD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F308a6aad-0ce2-4777-a880-c32a7d220fcc_1864x1516.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [9])</figcaption></figure></div><h2>Conclusion</h2><p>In continual learning, we want the model to learn new tasks quickly while preserving old capabilities. When studying recent work on continual learning for LLMs, a consistent pattern emerges: <em>on-policy RL is naturally more robust to catastrophic forgetting relative to SFT, even without explicit mechanisms to aid the continual learning process</em>. This advantage appears to stem from the online nature of RL, which biases learning toward low distribution shift (or low KL) solutions and avoids destructive updates induced by offline data. The natural continual learning abilities of RL have broader implications for the emergence of AGI, as adaptability is a key prerequisite for generally intelligent systems. The studies seen in this overview use only simple, structured proxies for continual learning in the real world, which will be much messier. However, these results show that RL&#8212;<em>an already impactful training paradigm</em>&#8212;is a promising starting point for building general systems that can adapt to any task. In this way, continuing the existing trajectory of LLM research may yield natural progress for continual learning. </p><h4>New to the newsletter?</h4><p>Hi! I&#8217;m <a href="https://cameronrwolfe.me/">Cameron R. Wolfe</a>, Deep Learning Ph.D. and Senior Research Scientist at <a href="https://research.netflix.com/research-area/nlp-and-conversations">Netflix</a>. This is the Deep (Learning) Focus newsletter, where I help readers better understand important topics in AI research. The newsletter will always be free and open to read. If you like the newsletter, please subscribe, consider a paid subscription, share it, or follow me on <a href="https://twitter.com/cwolferesearch">X</a> and <a href="https://www.linkedin.com/in/cameron-r-wolfe-ph-d-04744a238/">LinkedIn</a>!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://cameronrwolfe.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://cameronrwolfe.substack.com/subscribe?"><span>Subscribe now</span></a></p><h4>Bibliography</h4><p>[1] Lai, Song, et al. &#8220;Reinforcement fine-tuning naturally mitigates forgetting in continual post-training.&#8221; <em>arXiv preprint arXiv:2507.05386</em> (2025).</p><p>[2] Chen, Howard, et al. &#8220;Retaining by doing: The role of on-policy data in mitigating forgetting.&#8221; <em>arXiv preprint arXiv:2510.18874</em> (2025).</p><p>[3] Shenfeld, Idan, Jyothish Pari, and Pulkit Agrawal. &#8220;Rl&#8217;s razor: Why online reinforcement learning forgets less.&#8221; <em>arXiv preprint arXiv:2509.04259</em> (2025).</p><p>[4] Lu, Kevin et al. &#8220;On-Policy Distillation.&#8221; <a href="https://thinkingmachines.ai/blog/on-policy-distillation/">https://thinkingmachines.ai/blog/on-policy-distillation/</a> (2025).</p><p>[5] Meng, Yu, Mengzhou Xia, and Danqi Chen. &#8220;Simpo: Simple preference optimization with a reference-free reward.&#8221; <em>Advances in Neural Information Processing Systems</em> 37 (2024): 124198-124235.</p><p>[6] Diao, Muxi, et al. &#8220;Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting.&#8221; <em>arXiv preprint arXiv:2601.02151</em> (2026).</p><p>[7] Chu, Tianzhe, et al. &#8220;Sft memorizes, rl generalizes: A comparative study of foundation model post-training.&#8221; <em>arXiv preprint arXiv:2501.17161</em> (2025).</p><p>[8] Cheng, Sitao, et al. &#8220;From Atomic to Composite: Reinforcement Learning Enables Generalization in Complementary Reasoning.&#8221; <em>arXiv preprint arXiv:2512.01970</em> (2025).</p><p>[9] Huan, Maggie, et al. &#8220;Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning.&#8221; <em>arXiv preprint arXiv:2507.00432</em> (2025).</p><p>[10] McCloskey, Michael, and Neal J. Cohen. &#8220;Catastrophic interference in connectionist networks: The sequential learning problem.&#8221; <em>Psychology of learning and motivation</em>. Vol. 24. Academic Press, 1989. 109-165.</p><p>[11] Kirkpatrick, James, et al. &#8220;Overcoming catastrophic forgetting in neural networks.&#8221; <em>Proceedings of the national academy of sciences</em> 114.13 (2017): 3521-3526.</p><p>[12] Rebuffi, Sylvestre-Alvise, et al. &#8220;icarl: Incremental classifier and representation learning.&#8221; <em>Proceedings of the IEEE conference on Computer Vision and Pattern Recognition</em>. 2017.</p><p>[13] Castro, Francisco M., et al. &#8220;End-to-end incremental learning.&#8221; <em>Proceedings of the European conference on computer vision (ECCV)</em>. 2018.</p><p>[14] Chaudhry, Arslan, et al. &#8220;On tiny episodic memories in continual learning.&#8221; <em>arXiv preprint arXiv:1902.10486</em> (2019).</p><p>[15] Hayes, Tyler L., et al. &#8220;Remind your neural network to prevent catastrophic forgetting.&#8221; <em>European conference on computer vision</em>. Cham: Springer International Publishing, 2020.</p><p>[16] Rannen, Amal, et al. &#8220;Encoder based lifelong learning.&#8221; <em>Proceedings of the IEEE international conference on computer vision</em>. 2017.</p><p>[17] Shin, Hanul, et al. &#8220;Continual learning with deep generative replay.&#8221; <em>Advances in neural information processing systems</em> 30 (2017).</p><p>[18] Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. &#8220;Distilling the knowledge in a neural network.&#8221; <em>arXiv preprint arXiv:1503.02531</em> (2015).</p><p>[19] Li, Zhizhong, and Derek Hoiem. &#8220;Learning without forgetting.&#8221; <em>IEEE transactions on pattern analysis and machine intelligence</em> 40.12 (2017): 2935-2947.</p><p>[20] Wu, Yue, et al. &#8220;Large scale incremental learning.&#8221; <em>Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</em>. 2019.</p><p>[21] Aljundi, Rahaf, et al. &#8220;Memory aware synapses: Learning what (not) to forget.&#8221; <em>Proceedings of the European conference on computer vision (ECCV)</em>. 2018.</p><p>[22] Dhar, Prithviraj, et al. &#8220;Learning without memorizing.&#8221; <em>Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</em>. 2019.</p><p>[24] Rusu, Andrei A., et al. &#8220;Progressive neural networks.&#8221; <em>arXiv preprint arXiv:1606.04671</em> (2016).</p><p>[25] Draelos, Timothy J., et al. &#8220;Neurogenesis deep learning: Extending deep networks to accommodate new classes.&#8221; <em>2017 international joint conference on neural networks (IJCNN)</em>. IEEE, 2017.</p><p>[26] Guo, Haiyang, et al. &#8220;Hide-llava: Hierarchical decoupling for continual instruction tuning of multimodal large language model.&#8221; <em>arXiv preprint arXiv:2503.12941</em> (2025).</p><p>[27] Zhao, Hongbo, et al. &#8220;Mllm-cl: Continual learning for multimodal large language models.&#8221; <em>arXiv preprint arXiv:2506.05453</em> (2025).</p><p>[28] Li, Hongbo, et al. &#8220;Theory on mixture-of-experts in continual learning.&#8221; <em>arXiv preprint arXiv:2406.16437</em> (2024).</p><p>[29] Liu, Wenzhuo, et al. &#8220;LLaVA-c: Continual Improved Visual Instruction Tuning.&#8221; <em>arXiv preprint arXiv:2506.08666</em> (2025).</p><p>[30] Maharana, Adyasha, et al. &#8220;Adapt-$\infty $: Scalable continual multimodal instruction tuning via dynamic data selection.&#8221; <em>arXiv preprint arXiv:2410.10636</em> (2024).</p><p>[31] Lee, Minjae, et al. &#8220;OASIS: Online Sample Selection for Continual Visual Instruction Tuning.&#8221; <em>arXiv preprint arXiv:2506.02011</em> (2025).</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Earlier research papers on this topic also commonly use the term &#8220;catastrophic interference&#8221; to refer to the same concept as catastrophic forgetting.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>The reference model is usually the initial policy prior to RL training, such as the SFT model or a base model. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>See Section 2.1 of <a href="https://arxiv.org/abs/2505.22617">this paper</a> for an exact explanation of how entropy is computed using token probabilities outputted by an LLM. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>More specifically, authors in [1] mention that, without any KL divergence term, the RL training process has to be resumed after a divergence numerous times for the final model to converge properly.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>This is a play on words related to the concept of <a href="https://en.wikipedia.org/wiki/Occam%27s_razor">Occam&#8217;s Razor</a>, which suggests that the simplest solution (or the solution requiring the fewest assumptions or elements) is usually correct.  </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>For example, if we want to reduce the amount of forgetting when training with SFT, we can simply lower our learning rate [2]. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>For a system with <code>K</code> outcomes, the maximum entropy is <code>ln(K)</code>, which is the entropy of the uniform distribution; see <a href="https://en.wikipedia.org/wiki/Principle_of_maximum_entropy">here</a> for details. </p></div></div>]]></content:encoded></item><item><title><![CDATA[GRPO++: Tricks for Making RL Actually Work]]></title><description><![CDATA[How to go from the vanilla GRPO algorithm to functional RL training at scale...]]></description><link>https://cameronrwolfe.substack.com/p/grpo-tricks</link><guid isPermaLink="false">https://cameronrwolfe.substack.com/p/grpo-tricks</guid><dc:creator><![CDATA[Cameron R. Wolfe, Ph.D.]]></dc:creator><pubDate>Mon, 05 Jan 2026 10:33:50 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/168ff804-da03-4ce5-84be-4f3f7322ff70_2500x1404.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZsCt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f464b50-cf2d-4537-992b-c65707832598_2487x1395.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZsCt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f464b50-cf2d-4537-992b-c65707832598_2487x1395.png 424w, https://substackcdn.com/image/fetch/$s_!ZsCt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f464b50-cf2d-4537-992b-c65707832598_2487x1395.png 848w, https://substackcdn.com/image/fetch/$s_!ZsCt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f464b50-cf2d-4537-992b-c65707832598_2487x1395.png 1272w, https://substackcdn.com/image/fetch/$s_!ZsCt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f464b50-cf2d-4537-992b-c65707832598_2487x1395.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZsCt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f464b50-cf2d-4537-992b-c65707832598_2487x1395.png" width="1456" height="817" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1f464b50-cf2d-4537-992b-c65707832598_2487x1395.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:817,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1672436,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f464b50-cf2d-4537-992b-c65707832598_2487x1395.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZsCt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f464b50-cf2d-4537-992b-c65707832598_2487x1395.png 424w, https://substackcdn.com/image/fetch/$s_!ZsCt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f464b50-cf2d-4537-992b-c65707832598_2487x1395.png 848w, https://substackcdn.com/image/fetch/$s_!ZsCt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f464b50-cf2d-4537-992b-c65707832598_2487x1395.png 1272w, https://substackcdn.com/image/fetch/$s_!ZsCt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f464b50-cf2d-4537-992b-c65707832598_2487x1395.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1, 3, 4])</figcaption></figure></div><p>Recent research on large language models (LLMs) has been heavily focused on reasoning and reinforcement learning (RL). At the center of this research lies <a href="https://cameronrwolfe.substack.com/p/grpo">Group Relative Policy Optimization (GRPO)</a> [13], the RL optimizer used to train most open-source reasoning models. The popularity of GRPO is enhanced by its conceptual simplicity and practical efficiency. However, the simplicity of GRPO can be deceptive&#8212;<em>the vanilla GRPO algorithm has subtle issues that can hinder the RL training process, especially at scale</em>. Solving the shortcomings of GRPO has become a popular research topic, leading to the proposal of many tricks, best practices, and techniques for getting the most out of RL training. In this overview, we will outline all of this work, arriving at a deeper practical understanding of how to modify and use GRPO for training high-quality reasoning models.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://cameronrwolfe.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Join 50,000 others who use Deep (Learning) Focus to understand AI research. Consider a paid subscription if you would like to help support the newsletter.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Background on Reasoning and RL</h2><p>Prior to covering recent work on improving GRPO, we will spend this section building a basic understanding of the GRPO algorithm in its original form. We will also learn about Proximal Policy Optimization (PPO) [11], the predecessor to GRPO, and discuss how RL is used in the context of LLMs and reasoning models more generally. Notably, this discussion will assume basic knowledge of the problem setup and terminology used for RL training with LLMs. Those who are less familiar with RL basics can learn more at the following links:</p><ul><li><p>RL Problem Setup &amp; Terminology [<a href="https://cameronrwolfe.substack.com/i/173306894/problem-setup-and-terminology-for-rl">link</a>]</p></li><li><p>Different RL Formulations for LLMs [<a href="https://cameronrwolfe.substack.com/i/173306894/markov-decision-process-mdp-versus-bandit-formulation">link</a>]</p></li><li><p>Policy Gradient Basics [<a href="https://cameronrwolfe.substack.com/i/175107358/policy-gradient-basics">link</a>]</p></li></ul><h4>RL for Reasoning</h4><blockquote><p><em>&#8220;Inference scaling empowers LLMs with unprecedented reasoning ability, with RL as the core technique to elicit complex reasoning.&#8221;</em> - from [1]</p></blockquote><p>GRPO is the most common RL optimizer to use for training reasoning models. Before diving deeper into the details of GRPO, we need to build an understanding of how RL is actually used to train LLMs. In particular, there are two key types of RL training that are commonly used:</p><ul><li><p><em><a href="https://cameronrwolfe.substack.com/p/the-story-of-rlhf-origins-motivations">Reinforcement Learning from Human Feedback (RLHF)</a></em> trains the LLM using RL with rewards derived from a <a href="https://cameronrwolfe.substack.com/p/reward-models">reward model</a> trained on human preferences.</p></li><li><p><em><a href="https://cameronrwolfe.substack.com/i/153722335/reinforcement-learning-with-verifiable-rewards">Reinforcement Learning with Verifiable Rewards (RLVR)</a></em> trains the LLM using RL with rewards derived from rule-based or deterministic verifiers.</p></li></ul><p>The main difference between RLHF and RLVR is how we assign rewards&#8212;<em>RLHF uses a reward model, while RLVR uses verifiable rewards</em>. Aside from this difference, both are online RL algorithms with a similar structure; see below. GRPO is one possible RL optimizer that can be used to derive the policy update in this pipeline, though any RL optimizer (e.g., <a href="https://cameronrwolfe.substack.com/p/ppo-llm">PPO</a> or <a href="https://cameronrwolfe.substack.com/p/reinforce">REINFORCE</a>) can be used. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uPv8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uPv8!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 424w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 848w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 1272w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uPv8!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif" width="1456" height="817" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:817,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;[animate output image]&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="[animate output image]" title="[animate output image]" srcset="https://substackcdn.com/image/fetch/$s_!uPv8!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 424w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 848w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 1272w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">General framework for online RL</figcaption></figure></div><p>Given that RLHF focuses on aligning an LLM to human preferences, it is used more heavily for chat models and is less applicable to reasoning<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>. Most reasoning models are trained using RL in verifiable domains (e.g., math and coding), so we will primarily focus on the RLVR setup for the remainder of this post.</p><p><strong>More on RLVR.</strong> To train an LLM with RLVR, we must select a domain that is verifiable in nature; e.g., math or coding. In other words, we need to create a dataset that has either <em>i)</em> a known ground truth answer or <em>ii)</em> some rule-based technique that can be used to verify the correctness of responses to the prompts in our dataset. For coding, we can create a sandbox for running LLM-generated code and use test cases to assess correctness. Similarly, we can evaluate math problems by performing basic string matching between the answer predicted by the LLM and a ground-truth answer for a problem; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zfsl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zfsl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 424w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 848w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 1272w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zfsl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png" width="1456" height="499" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:499,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!zfsl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 424w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 848w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 1272w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Verifying a problem with exact string matching</figcaption></figure></div><p>Usually, we must instruct the LLM to format its output so the final answer can be easily parsed. As an example, <a href="https://github.com/huggingface/Math-Verify">Math Verify</a> is a popular package that was built for performing robust verification in the math domain. Even then, however, string matching is not always sufficient for evaluating correctness. In many cases, we can benefit from crafting validation logic that is more robust (e.g., asking an LLM to identify equivalent answers) and that captures variations in output.</p><blockquote><p><em>&#8220;Math verification is determined by an LLM judge given the ground truth solution and DeepSeek-R1 solution attempt. We found that using an LLM judge instead of a stricter parsing engine (Math-Verify) for verification during data generation results in a higher yield and leads to higher performing downstream models.&#8221;</em> - <a href="https://www.bespokelabs.ai/blog/scaling-up-open-reasoning-with-openthinker-32b">source</a></p></blockquote><p><strong>Reasoning models </strong>are structurally identical to a standard LLM. The key distinction between reasoning models and LLMs is the ability to &#8220;think&#8221; about a prompt prior to providing a final output. By increasing the length of this thinking process, reasoning models can use <a href="https://cameronrwolfe.substack.com/i/152758713/reasoning-models-and-new-scaling-paradigms">inference-time scaling</a>&#8212;<em>or simply spend more compute on generating a particular completion&#8212;</em>to improve their performance. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Way8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Way8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png 424w, https://substackcdn.com/image/fetch/$s_!Way8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png 848w, https://substackcdn.com/image/fetch/$s_!Way8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png 1272w, https://substackcdn.com/image/fetch/$s_!Way8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Way8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png" width="1456" height="1034" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1034,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Way8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png 424w, https://substackcdn.com/image/fetch/$s_!Way8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png 848w, https://substackcdn.com/image/fetch/$s_!Way8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png 1272w, https://substackcdn.com/image/fetch/$s_!Way8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Concrete example of a reasoning model&#8217;s full output</figcaption></figure></div><p>As shown above, this thinking process occurs in the form of a free-text, long chain-of-thought (CoT)&#8212;<em>also called a rationale or reasoning trajectory</em>&#8212;generated by the LLM. Many closed reasoning models (though not all of them!) hide the raw reasoning trace from the user, providing instead only a truncated version or summary of the reasoning process along with the model&#8217;s final answer.</p><p><strong>Learning to reason via RL.</strong> If we look at <a href="https://openai.com/index/learning-to-reason-with-llms/">some examples</a> of reasoning trajectories from open or closed reasoning models, we will notice that these models exhibit some sophisticated reasoning behaviors in their long CoT:</p><ul><li><p>Thinking through each part of a complex problem.</p></li><li><p>Decomposing complex problems into smaller, solvable parts.</p></li><li><p>Critiquing solutions and finding errors.</p></li><li><p>Exploring many alternative solutions.</p></li></ul><p>Such behavior goes beyond any previously-observed behavior with standard LLMs and <a href="https://cameronrwolfe.substack.com/p/chain-of-thought-prompting-for-llms">chain of thought prompting</a>. However, this behavior is not explicitly injected into the model&#8212;<em>it is naturally developed via large-scale RL training</em>!</p><div class="pullquote"><p>&#8220;One of the most remarkable aspects of this self-evolution is the emergence of sophisticated behaviors as the test-time computation increases. Behaviors such as reflection&#8212;where the model revisits and reevaluates its previous steps&#8212;and the exploration of alternative approaches to problem-solving arise spontaneously. These behaviors are not explicitly programmed but instead emerge as a result of the model&#8217;s interaction with the reinforcement learning environment.&#8221; - from [2]</p></div><p>During RLVR, the model undergoes a self-exploration process in which it learns how to properly use its long CoT to solve reasoning problems. As evidence of this self-evolution process, we commonly observe during RL training that the average length of the model&#8217;s completions increases over time; see below. <em>The model naturally learns how to use more inference-time compute (by generating a longer reasoning trace) in order to solve difficult reasoning problems. </em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!COPD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!COPD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 424w, https://substackcdn.com/image/fetch/$s_!COPD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 848w, https://substackcdn.com/image/fetch/$s_!COPD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 1272w, https://substackcdn.com/image/fetch/$s_!COPD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!COPD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png" width="1456" height="812" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:812,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!COPD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 424w, https://substackcdn.com/image/fetch/$s_!COPD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 848w, https://substackcdn.com/image/fetch/$s_!COPD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 1272w, https://substackcdn.com/image/fetch/$s_!COPD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p><strong>Training stages and Aha moments.</strong> As shown below, LLMs undergo training in several stages. However, reasoning models depart from the standard alignment procedure&#8212;<em>including <a href="https://cameronrwolfe.substack.com/p/understanding-and-using-supervised">supervised finetuning (SFT)</a> and RLHF</em>&#8212;by adding an extra RLVR training stage. Additionally, it is even common in RL research to use an RL-Zero setup in which we directly train the pretrained base model with RLVR. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zJ6B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F957933bb-5d75-4b01-9668-54adf3292637_1724x1002.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zJ6B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F957933bb-5d75-4b01-9668-54adf3292637_1724x1002.png 424w, https://substackcdn.com/image/fetch/$s_!zJ6B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F957933bb-5d75-4b01-9668-54adf3292637_1724x1002.png 848w, https://substackcdn.com/image/fetch/$s_!zJ6B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F957933bb-5d75-4b01-9668-54adf3292637_1724x1002.png 1272w, https://substackcdn.com/image/fetch/$s_!zJ6B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F957933bb-5d75-4b01-9668-54adf3292637_1724x1002.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zJ6B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F957933bb-5d75-4b01-9668-54adf3292637_1724x1002.png" width="534" height="310.2774725274725" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/957933bb-5d75-4b01-9668-54adf3292637_1724x1002.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:846,&quot;width&quot;:1456,&quot;resizeWidth&quot;:534,&quot;bytes&quot;:278033,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F957933bb-5d75-4b01-9668-54adf3292637_1724x1002.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zJ6B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F957933bb-5d75-4b01-9668-54adf3292637_1724x1002.png 424w, https://substackcdn.com/image/fetch/$s_!zJ6B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F957933bb-5d75-4b01-9668-54adf3292637_1724x1002.png 848w, https://substackcdn.com/image/fetch/$s_!zJ6B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F957933bb-5d75-4b01-9668-54adf3292637_1724x1002.png 1272w, https://substackcdn.com/image/fetch/$s_!zJ6B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F957933bb-5d75-4b01-9668-54adf3292637_1724x1002.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The RL-Zero setup was popularized by DeepSeek-R1 [2], which showed that reasoning capabilities can be instilled in an LLM via pure RL (using GRPO) even with no SFT. Most notably, DeepSeek-R1-Zero&#8212;<em>the version of DeepSeek-R1 that is trained with an RL-Zero setup</em>&#8212;is found in [2] to have an &#8220;Aha moment&#8221; in which it learns to invest additional reasoning effort into re-thinking or evaluating its own responses inside of the reasoning trace; see below. This behavior emerges at an intermediate point in RL training and is a classic example of how self-exploration via RL can naturally lead an LLM to develop sophisticated reasoning behavior. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!x8lX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ab8e774-1764-47a6-a978-f593a30b1fdc_1220x770.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!x8lX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ab8e774-1764-47a6-a978-f593a30b1fdc_1220x770.png 424w, https://substackcdn.com/image/fetch/$s_!x8lX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ab8e774-1764-47a6-a978-f593a30b1fdc_1220x770.png 848w, https://substackcdn.com/image/fetch/$s_!x8lX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ab8e774-1764-47a6-a978-f593a30b1fdc_1220x770.png 1272w, https://substackcdn.com/image/fetch/$s_!x8lX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ab8e774-1764-47a6-a978-f593a30b1fdc_1220x770.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!x8lX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ab8e774-1764-47a6-a978-f593a30b1fdc_1220x770.png" width="625" height="394.4672131147541" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8ab8e774-1764-47a6-a978-f593a30b1fdc_1220x770.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:770,&quot;width&quot;:1220,&quot;resizeWidth&quot;:625,&quot;bytes&quot;:164095,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ab8e774-1764-47a6-a978-f593a30b1fdc_1220x770.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!x8lX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ab8e774-1764-47a6-a978-f593a30b1fdc_1220x770.png 424w, https://substackcdn.com/image/fetch/$s_!x8lX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ab8e774-1764-47a6-a978-f593a30b1fdc_1220x770.png 848w, https://substackcdn.com/image/fetch/$s_!x8lX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ab8e774-1764-47a6-a978-f593a30b1fdc_1220x770.png 1272w, https://substackcdn.com/image/fetch/$s_!x8lX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ab8e774-1764-47a6-a978-f593a30b1fdc_1220x770.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><h4>Proximal Policy Optimization (PPO)</h4><p>GRPO is based on the Proximal Policy Optimization (PPO) algorithm [11]. PPO was used in <a href="https://cameronrwolfe.substack.com/i/175107358/learning-to-summarize-from-human-feedback">seminal work on RLHF</a> and, as a result, was the default RL optimizer in the LLM domain for some time. Only after the advent of reasoning models did alternative algorithms like GRPO begin to gain traction for training LLMs. A full overview of PPO is linked below, but we will cover the key details in this section.</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;433cb285-055c-4b5a-bfd6-69e12bac64ad&quot;,&quot;caption&quot;:&quot;A comprehensive and practical explanation of the Proximal Policy Optimization (PPO) algorithm and how it is used to train LLM with RL.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;PPO for LLMs: A Guide for Normal People&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:29736521,&quot;name&quot;:&quot;Cameron R. Wolfe, Ph.D.&quot;,&quot;bio&quot;:&quot;Research @ Netflix &#8226; Rice University PhD &#8226; I make AI understandable&quot;,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/69aba7df-b571-4609-aa47-fc2d031c11b8_1242x1595.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-10-27T09:33:23.171Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/61f107c1-95cb-4438-84b9-8d87c9cdc04f_2502x1408.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://cameronrwolfe.substack.com/p/ppo-llm&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:175107358,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:129,&quot;comment_count&quot;:12,&quot;publication_id&quot;:1092659,&quot;publication_name&quot;:&quot;Deep (Learning) Focus&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!87xa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab9b43fb-52d5-40da-995d-5b7cd3f91064_896x896.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p>The structure of training with PPO is outlined below. As we can see, each training iteration of PPO goes through the following sequence of steps:</p><ol><li><p>Sample a diverse batch of prompts.</p></li><li><p>Generate a completion from the policy for each prompt.</p></li><li><p>Compute advantage estimates for each completion.</p></li><li><p>Perform several policy updates over this sampled data.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!S1nc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc38f9ea3-d07f-4240-898e-de3c75e66878_2264x786.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!S1nc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc38f9ea3-d07f-4240-898e-de3c75e66878_2264x786.png 424w, https://substackcdn.com/image/fetch/$s_!S1nc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc38f9ea3-d07f-4240-898e-de3c75e66878_2264x786.png 848w, https://substackcdn.com/image/fetch/$s_!S1nc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc38f9ea3-d07f-4240-898e-de3c75e66878_2264x786.png 1272w, https://substackcdn.com/image/fetch/$s_!S1nc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc38f9ea3-d07f-4240-898e-de3c75e66878_2264x786.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!S1nc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc38f9ea3-d07f-4240-898e-de3c75e66878_2264x786.png" width="652" height="226.1401098901099" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c38f9ea3-d07f-4240-898e-de3c75e66878_2264x786.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:505,&quot;width&quot;:1456,&quot;resizeWidth&quot;:652,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!S1nc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc38f9ea3-d07f-4240-898e-de3c75e66878_2264x786.png 424w, https://substackcdn.com/image/fetch/$s_!S1nc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc38f9ea3-d07f-4240-898e-de3c75e66878_2264x786.png 848w, https://substackcdn.com/image/fetch/$s_!S1nc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc38f9ea3-d07f-4240-898e-de3c75e66878_2264x786.png 1272w, https://substackcdn.com/image/fetch/$s_!S1nc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc38f9ea3-d07f-4240-898e-de3c75e66878_2264x786.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(from [11])</figcaption></figure></div><p><strong>Surrogate objective.</strong> In PPO, we formulate a loss function (also called the surrogate objective) that is optimized with respect to the parameters of our policy. The PPO loss function is based on the policy ratio (also called the importance ratio) between the current and &#8220;old&#8221; (i.e., before the first update in a training step) policies. The importance ratio stabilizes the training process by comparing the new policy&#8217;s token probabilities to the old policy and applying a weight (or importance) to training that helps to avoid drastic changes; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IXsZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7d1530-a2cc-48c6-9e95-8571b781ba35_1994x792.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IXsZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7d1530-a2cc-48c6-9e95-8571b781ba35_1994x792.png 424w, https://substackcdn.com/image/fetch/$s_!IXsZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7d1530-a2cc-48c6-9e95-8571b781ba35_1994x792.png 848w, https://substackcdn.com/image/fetch/$s_!IXsZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7d1530-a2cc-48c6-9e95-8571b781ba35_1994x792.png 1272w, https://substackcdn.com/image/fetch/$s_!IXsZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7d1530-a2cc-48c6-9e95-8571b781ba35_1994x792.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IXsZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7d1530-a2cc-48c6-9e95-8571b781ba35_1994x792.png" width="554" height="219.92582417582418" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a7d1530-a2cc-48c6-9e95-8571b781ba35_1994x792.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:578,&quot;width&quot;:1456,&quot;resizeWidth&quot;:554,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!IXsZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7d1530-a2cc-48c6-9e95-8571b781ba35_1994x792.png 424w, https://substackcdn.com/image/fetch/$s_!IXsZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7d1530-a2cc-48c6-9e95-8571b781ba35_1994x792.png 848w, https://substackcdn.com/image/fetch/$s_!IXsZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7d1530-a2cc-48c6-9e95-8571b781ba35_1994x792.png 1272w, https://substackcdn.com/image/fetch/$s_!IXsZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7d1530-a2cc-48c6-9e95-8571b781ba35_1994x792.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Policy or importance ratio</figcaption></figure></div><p>To derive the surrogate objective for PPO, we begin with an unclipped objective that resembles the surrogate objective used in <a href="https://cameronrwolfe.substack.com/i/175107358/trust-region-policy-optimization-trpo">Trust Region Policy Optimization (TRPO)</a>; see below. Additionally, we introduce a clipped version of this objective by applying a clipping mechanism to the policy ratio <code>r_t(&#952;)</code>. Clipping forces the policy ratio to fall in the range <code>[1 - &#949;, 1 + &#949;]</code>. In other words, we avoid the policy ratio becoming too large or too small, ensuring that the token probabilities produced by the current and old policies remain relatively similar.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oHJG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oHJG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png 424w, https://substackcdn.com/image/fetch/$s_!oHJG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png 848w, https://substackcdn.com/image/fetch/$s_!oHJG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png 1272w, https://substackcdn.com/image/fetch/$s_!oHJG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oHJG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png" width="1456" height="246" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:246,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:121736,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/175107358?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!oHJG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png 424w, https://substackcdn.com/image/fetch/$s_!oHJG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png 848w, https://substackcdn.com/image/fetch/$s_!oHJG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png 1272w, https://substackcdn.com/image/fetch/$s_!oHJG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The PPO surrogate objective</figcaption></figure></div><p>The PPO surrogate objective is simply the minimum of clipped and unclipped objectives, which makes it a pessimistic (lower bound) estimate for the unclipped objective. The behavior of the clipping mechanism in the surrogate loss changes depending on the sign of the advantage. The possible cases are shown below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ovlv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38769a7f-6549-4fed-ab3e-f829185b5069_1544x642.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ovlv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38769a7f-6549-4fed-ab3e-f829185b5069_1544x642.png 424w, https://substackcdn.com/image/fetch/$s_!ovlv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38769a7f-6549-4fed-ab3e-f829185b5069_1544x642.png 848w, https://substackcdn.com/image/fetch/$s_!ovlv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38769a7f-6549-4fed-ab3e-f829185b5069_1544x642.png 1272w, https://substackcdn.com/image/fetch/$s_!ovlv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38769a7f-6549-4fed-ab3e-f829185b5069_1544x642.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ovlv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38769a7f-6549-4fed-ab3e-f829185b5069_1544x642.png" width="1456" height="605" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/38769a7f-6549-4fed-ab3e-f829185b5069_1544x642.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:605,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!ovlv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38769a7f-6549-4fed-ab3e-f829185b5069_1544x642.png 424w, https://substackcdn.com/image/fetch/$s_!ovlv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38769a7f-6549-4fed-ab3e-f829185b5069_1544x642.png 848w, https://substackcdn.com/image/fetch/$s_!ovlv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38769a7f-6549-4fed-ab3e-f829185b5069_1544x642.png 1272w, https://substackcdn.com/image/fetch/$s_!ovlv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38769a7f-6549-4fed-ab3e-f829185b5069_1544x642.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [11])</figcaption></figure></div><p>As we can see, taking the minimum of clipped and unclipped terms in the surrogate objective causes clipping to be applied in only one direction. The surrogate objective can be arbitrarily <em>decreased</em> by moving the importance ratio away from one, but clipping prevents the objective from being <em>increased</em> beyond a certain point by limiting the importance ratio. In this way, the clipping in PPO disincentivizes large policy ratios and, in turn, maintains a trust region by preventing large policy updates that could potentially damage our policy.</p><blockquote><p><em>&#8220;We only ignore the change in probability ratio when it would make the objective improve, and we include it when it makes the objective worse.&#8221;</em> - from [11]</p></blockquote><p><strong>KL divergence.</strong> We often incorporate a KL divergence between the current policy and a reference policy&#8212;<em>usually the model from the beginning of training</em>&#8212;into RL. The KL divergence serves as a penalty that encourages similarity between the current and reference policies. We compute the KL divergence by comparing token distributions from the two LLMs for each token in a sequence. The easiest&#8212;<em>and most common</em>&#8212;way to approximate KL divergence [12] is via the difference in log probabilities between the policy and reference; see <a href="https://cameronrwolfe.substack.com/i/167254905/kullback-leibler-kl-divergence">here</a>.</p><p>After the KL divergence has been computed, there are two primary ways that it can be incorporated into the RL training process:</p><ol><li><p>By directly subtracting the KL divergence from the reward.</p></li><li><p>By adding the KL divergence to the loss function as a penalty term.</p></li></ol><p>PPO adopts the former option by subtracting the KL divergence directly from the reward signal used in RL training, as shown in the equation below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MMrI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3d5004-2390-489f-995a-e0245c174535_2534x530.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MMrI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3d5004-2390-489f-995a-e0245c174535_2534x530.png 424w, https://substackcdn.com/image/fetch/$s_!MMrI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3d5004-2390-489f-995a-e0245c174535_2534x530.png 848w, https://substackcdn.com/image/fetch/$s_!MMrI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3d5004-2390-489f-995a-e0245c174535_2534x530.png 1272w, https://substackcdn.com/image/fetch/$s_!MMrI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3d5004-2390-489f-995a-e0245c174535_2534x530.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MMrI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3d5004-2390-489f-995a-e0245c174535_2534x530.png" width="587" height="122.9635989010989" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cc3d5004-2390-489f-995a-e0245c174535_2534x530.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:305,&quot;width&quot;:1456,&quot;resizeWidth&quot;:587,&quot;bytes&quot;:188292,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/175107358?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3d5004-2390-489f-995a-e0245c174535_2534x530.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!MMrI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3d5004-2390-489f-995a-e0245c174535_2534x530.png 424w, https://substackcdn.com/image/fetch/$s_!MMrI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3d5004-2390-489f-995a-e0245c174535_2534x530.png 848w, https://substackcdn.com/image/fetch/$s_!MMrI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3d5004-2390-489f-995a-e0245c174535_2534x530.png 1272w, https://substackcdn.com/image/fetch/$s_!MMrI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3d5004-2390-489f-995a-e0245c174535_2534x530.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Adding KL divergence to the reward in PPO</figcaption></figure></div><p><strong>Advantage estimation.</strong> The <a href="https://cameronrwolfe.substack.com/p/ppo-llm?open=false#%C2%A7problem-setup-and-terminology">advantage function</a>, a key part of PPO&#8217;s surrogate objective, is the difference between the <a href="https://cameronrwolfe.substack.com/i/173306894/problem-setup-and-terminology-for-rl">action-value and value function</a>: <code>A(s, a) = Q(s, a) - V(s)</code>. The value function in PPO is estimated with a learned model called the value model (or critic). This critic is a separate copy of our policy, or&#8212;<em>for better parameter efficiency</em>&#8212;an added value head that shares weights with the policy. The critic takes a completion as input and predicts expected cumulative reward on a per-token basis using an architecture that is similar to that of a reward model (i.e., transformer with a regression head); see below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fXOv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8133ba-f772-44f5-bfbc-19e800a842cc_1732x570.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fXOv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8133ba-f772-44f5-bfbc-19e800a842cc_1732x570.png 424w, https://substackcdn.com/image/fetch/$s_!fXOv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8133ba-f772-44f5-bfbc-19e800a842cc_1732x570.png 848w, https://substackcdn.com/image/fetch/$s_!fXOv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8133ba-f772-44f5-bfbc-19e800a842cc_1732x570.png 1272w, https://substackcdn.com/image/fetch/$s_!fXOv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8133ba-f772-44f5-bfbc-19e800a842cc_1732x570.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fXOv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8133ba-f772-44f5-bfbc-19e800a842cc_1732x570.png" width="682" height="224.36675824175825" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fb8133ba-f772-44f5-bfbc-19e800a842cc_1732x570.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:479,&quot;width&quot;:1456,&quot;resizeWidth&quot;:682,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!fXOv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8133ba-f772-44f5-bfbc-19e800a842cc_1732x570.png 424w, https://substackcdn.com/image/fetch/$s_!fXOv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8133ba-f772-44f5-bfbc-19e800a842cc_1732x570.png 848w, https://substackcdn.com/image/fetch/$s_!fXOv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8133ba-f772-44f5-bfbc-19e800a842cc_1732x570.png 1272w, https://substackcdn.com/image/fetch/$s_!fXOv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8133ba-f772-44f5-bfbc-19e800a842cc_1732x570.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The value function is on-policy&#8212;<em>it depends on the current parameters of our policy</em>. Unlike reward models, which are fixed at the beginning of RL training, the critic is trained alongside the LLM to keep its predictions on-policy&#8212;<em>this is known as an actor-critic setup</em>. To train the critic, we add an extra <a href="https://en.wikipedia.org/wiki/Mean_squared_error">mean-squared error (MSE) loss term</a>&#8212;<em>between the rewards predicted by the critic and the actual rewards</em>&#8212;to the PPO loss. Using the critic, we can estimate advantage using Generalized Advantage Estimation (GAE). The details of GAE are beyond the scope of this post, but a full explanation and implementation can be found <a href="https://cameronrwolfe.substack.com/i/175107358/generalized-advantage-estimation-gae">here</a>. </p><h4>Group Relative Policy Optimization (GRPO)</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dzfC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dzfC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png 424w, https://substackcdn.com/image/fetch/$s_!dzfC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png 848w, https://substackcdn.com/image/fetch/$s_!dzfC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png 1272w, https://substackcdn.com/image/fetch/$s_!dzfC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dzfC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png" width="1456" height="701" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:701,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:420310,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/177823868?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!dzfC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png 424w, https://substackcdn.com/image/fetch/$s_!dzfC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png 848w, https://substackcdn.com/image/fetch/$s_!dzfC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png 1272w, https://substackcdn.com/image/fetch/$s_!dzfC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [13])</figcaption></figure></div><p>Group Relative Policy Optimization (GRPO) [13] builds upon PPO by proposing a simpler technique for estimating the advantage. In particular, GRPO estimates the advantage by sampling multiple completions&#8212;<em>or a &#8220;group&#8221; of completions</em>&#8212;for each prompt and using the rewards of these completions to form a <a href="https://cameronrwolfe.substack.com/i/175107358/policy-gradient-basics">baseline</a>. This group-derived baseline replaces the value function, which allows GRPO to forgo training a critic. Avoiding the critic drastically reduces GRPO&#8217;s memory and compute overhead compared to PPO. Additionally, since GRPO is commonly used for reasoning-oriented training, we typically pair it with verifiable rewards, which eliminates the need for a separate reward model.</p><blockquote><p><em>&#8220;We introduce the Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO). GRPO foregoes the critic model, instead estimating the baseline from group scores, significantly reducing training resources.&#8221;</em> - from [13]</p></blockquote><p><strong>Advantage estimation in GRPO</strong> is performed by sampling multiple completions for each prompt and using the formulation shown below. This approach is very simple compared to PPO, which uses a learned value model and GAE. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nguf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nguf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png 424w, https://substackcdn.com/image/fetch/$s_!nguf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png 848w, https://substackcdn.com/image/fetch/$s_!nguf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png 1272w, https://substackcdn.com/image/fetch/$s_!nguf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nguf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png" width="1456" height="597" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:597,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:211136,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/177823868?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!nguf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png 424w, https://substackcdn.com/image/fetch/$s_!nguf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png 848w, https://substackcdn.com/image/fetch/$s_!nguf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png 1272w, https://substackcdn.com/image/fetch/$s_!nguf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Advantage computation in GRPO</figcaption></figure></div><p>In GRPO, completions to the same prompt form a group, and we calculate the advantage relative to other rewards in the group&#8212;<em>hence, the name &#8220;group relative&#8221; policy optimization</em>! More specifically, the advantage for completion <code>i</code> is calculated by first subtracting the mean reward over the group from <code>r_i</code>, then dividing this difference by the standard deviation of rewards over the group. The GRPO loss is assigned on a per-token basis, but we should note that the above formulation assigns the same advantage to every token <code>t</code> in completion <code>i</code>. The per-token loss is therefore dictated by the policy ratio, which varies for each token.</p><blockquote><p><em>&#8220;GRPO is often run with a far higher number of samples per prompt because the advantage is entirely about the relative value of a completion to its peers from that prompt.&#8221;</em> - <a href="http://rlhf%20book/">RLHF book</a></p></blockquote><p>Because we compute the advantage in a relative manner (i.e., based on rewards in the group), the number of completions we sample per prompt must be high to obtain a stable policy gradient estimate. Unlike GRPO, <a href="https://cameronrwolfe.substack.com/p/ppo-llm">PPO</a> and <a href="https://cameronrwolfe.substack.com/i/173306894/reward-increment-nonnegative-factor-x-offset-reinforcement-x-characteristic-eligibility-reinforce">REINFORCE</a> typically sample a single completion per prompt. However, sampling multiple completions per prompt has been explored by prior RL optimizers like <a href="https://cameronrwolfe.substack.com/i/173306894/reinforce-leave-one-out-rloo">RLOO</a>.</p><p><strong>Surrogate loss.</strong> Despite estimating the advantage differently, GRPO uses a loss function that is nearly identical to that of PPO. As shown below, GRPO uses the same clipping mechanism that is used by PPO for the importance ratio. This expression assumes an <a href="http://mdp%20formulation/">MDP formulation</a> and has been modified to explicitly aggregate the loss over multiple completions within a group. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6kXE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ce0fb53-a64d-4225-9f84-acf312c16c06_2475x763.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6kXE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ce0fb53-a64d-4225-9f84-acf312c16c06_2475x763.png 424w, https://substackcdn.com/image/fetch/$s_!6kXE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ce0fb53-a64d-4225-9f84-acf312c16c06_2475x763.png 848w, https://substackcdn.com/image/fetch/$s_!6kXE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ce0fb53-a64d-4225-9f84-acf312c16c06_2475x763.png 1272w, https://substackcdn.com/image/fetch/$s_!6kXE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ce0fb53-a64d-4225-9f84-acf312c16c06_2475x763.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6kXE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ce0fb53-a64d-4225-9f84-acf312c16c06_2475x763.png" width="1456" height="449" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9ce0fb53-a64d-4225-9f84-acf312c16c06_2475x763.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:449,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:192461,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/177823868?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ce0fb53-a64d-4225-9f84-acf312c16c06_2475x763.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!6kXE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ce0fb53-a64d-4225-9f84-acf312c16c06_2475x763.png 424w, https://substackcdn.com/image/fetch/$s_!6kXE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ce0fb53-a64d-4225-9f84-acf312c16c06_2475x763.png 848w, https://substackcdn.com/image/fetch/$s_!6kXE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ce0fb53-a64d-4225-9f84-acf312c16c06_2475x763.png 1272w, https://substackcdn.com/image/fetch/$s_!6kXE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ce0fb53-a64d-4225-9f84-acf312c16c06_2475x763.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">GRPO surrogate loss</figcaption></figure></div><p><strong>KL divergence.</strong> One key difference between PPO and GRPO is the <a href="https://cameronrwolfe.substack.com/i/167254905/kullback-leibler-kl-divergence">KL divergence</a> term being added as a penalty term to the surrogate loss, rather than subtracted from the reward. However, <em>we should note that the KL divergence is frequently omitted when training reasoning models</em>. In the context of RLHF, KL divergence enables model alignment without diverging significantly from the initial model, but this approach makes less sense when training long CoT reasoning models. The model&#8217;s behavior may diverge significantly from the initial model as it develops the ability to perform long CoT reasoning. All of the work that we will study in this overview omits the KL divergence term during RL training. </p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;9f13b191-c8f5-40c7-bf33-06795d43e7ad&quot;,&quot;caption&quot;:&quot;An approachable overview of Group Relative Policy Optimization (GRPO) and how it is used for reasoning-oriented RL training for LLMs. &quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Group Relative Policy Optimization (GRPO)&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:29736521,&quot;name&quot;:&quot;Cameron R. Wolfe, Ph.D.&quot;,&quot;bio&quot;:&quot;Research @ Netflix &#8226; Rice University PhD &#8226; I make AI understandable&quot;,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/69aba7df-b571-4609-aa47-fc2d031c11b8_1242x1595.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-11-24T10:33:31.743Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f98b75b5-c615-4139-a045-ad9572f3cf9f_2008x1130.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://cameronrwolfe.substack.com/p/grpo&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:177823868,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:81,&quot;comment_count&quot;:11,&quot;publication_id&quot;:1092659,&quot;publication_name&quot;:&quot;Deep (Learning) Focus&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!87xa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab9b43fb-52d5-40da-995d-5b7cd3f91064_896x896.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p><strong>Limitations of vanilla GRPO.</strong> For a full overview of GRPO, please see the above link. As we have seen, GRPO is a relatively simple algorithm. The popularity of GRPO was catalyzed by its use for training DeepSeek-R1 [2]. The openness of this work led GRPO to be adopted in open replications of reasoning models, as well as countless other research efforts. Despite its popularity, vanilla GRPO has several issues that become especially pronounced in large-scale RL training runs: </p><ul><li><p>Noise and instability during the training process. </p></li><li><p>Excessive response lengths, especially in incorrect answers.</p></li><li><p>Collapse of the LLM&#8217;s entropy (i.e., reduced exploration).</p></li><li><p>Poor sample efficiency and slow learning.</p></li></ul><p>Due to these issues, many open research efforts initially struggled to replicate the results reported by DeepSeek-R1 [1, 3], <em>indicating that some details necessary to achieve peak performance with GRPO may have been omitted from [2]</em>. This overview will study various works that have diagnosed such issues with GRPO, <em>uncovering a set of practical tricks that can be used to train better reasoning models at scale</em>. </p><h4>Assessing the Health of RL Training</h4><p>Despite the recent success of reasoning models, we must remember that training LLMs via RL is a complex process with many moving parts. We are working with multiple disjoint systems to train the model, each of which has unique settings that must be tuned. As described below, even simple changes to the RL training process can yield unexpected results or completely derail the model. When issues occur, it can be hard to know exactly what went wrong, and the high cost of RL training can make debugging these issues slower and more difficult. To quickly identify issues and iterate on our RL training setup, we need intermediate metrics that allow us to efficiently monitor the health of the training process. </p><div class="pullquote"><p>&#8220;Reinforcement learning on large language models is&#8230; an intrinsically complex systems engineering challenge, characterized by the interdependence of its various subsystems. Modifications to any single subsystem can propagate through the system, leading to unforeseen consequences due to the intricate interplay among these components. Even seemingly minor changes&#8230; can amplify through iterative reinforcement learning processes, yielding substantial deviations in outcomes.&#8221; - from [1]</p></div><p><strong>Health checks.</strong> The key training and policy metrics that can be monitored to catch issues with our RL setup are as follows:</p><ol><li><p><em><strong>Response length</strong></em> should increase during reasoning RL as the policy learns how to effectively leverage its long CoT. Average response length is closely related to training stability, but response length does not always monotonically increase&#8212;<em>it may stagnate or even decrease</em>. Excessively long response lengths are also a symptom of a faulty RL setup. </p></li><li><p><em><strong>Training reward</strong></em> should increase in a stable manner throughout training. A noisy or chaotic reward curve is a clear sign of an issue in our RL setup. However, training rewards do not always accurately reflect the model&#8217;s performance on held-out data&#8212;<em>RL tends to overfit to the training set</em>.</p></li><li><p><em><strong>Entropy</strong></em><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a> of the policy&#8217;s next token prediction distribution serves as a proxy for exploration during RL training. We want entropy to lie in a reasonable range&#8212;<em>not too low and not too high</em>. Low entropy means that the next token distribution is too sharp (i.e., all probability is assigned to a single token), which limits exploration. On the other hand, entropy that is too high may indicate the policy is just outputting gibberish. Similarly to entropy, we can also monitor the model&#8217;s generation probabilities during RL training. </p></li><li><p><em><strong>Held-out evaluation</strong></em> should be performed to track our policy&#8217;s performance (e.g., average reward or accuracy) as training progresses. Performance should be monitored specifically on held-out validation data to ensure that no <a href="https://lilianweng.github.io/posts/2024-11-28-reward-hacking/">reward hacking</a> is taking place. This validation set can be kept (relatively) small to avoid reducing the efficiency of the training process.</p></li></ol><p>An example plot of these key intermediate metrics throughout the RL training process from DAPO [1] is shown below. To iterate upon our RL training setup, we should <em>i)</em> begin with a reasonable setup known to work well<em>,</em> <em>ii)</em> apply interventions to this setup<em>, </em>and<em> iii) </em>monitor these metrics for positive or negative impact. We will see many examples of such a workflow throughout this overview as we study various tweaks and improvements to the vanilla GRPO algorithm. </p><blockquote><p><em>&#8220;We typically use length in conjunction with validation accuracy as indicators to assess whether an experiment is deteriorating&#8230; the trend of reward increase [should be] relatively stable and does not fluctuate or decline significantly due to adjustments in experimental settings&#8230; we find that maintaining a slow upward trend in entropy is conducive to the improvement of model performance.&#8221;</em> - from [1]</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Qf3E!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29be3a5d-dac3-454b-bb85-e6e72e931db8_2124x1560.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Qf3E!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29be3a5d-dac3-454b-bb85-e6e72e931db8_2124x1560.png 424w, https://substackcdn.com/image/fetch/$s_!Qf3E!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29be3a5d-dac3-454b-bb85-e6e72e931db8_2124x1560.png 848w, https://substackcdn.com/image/fetch/$s_!Qf3E!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29be3a5d-dac3-454b-bb85-e6e72e931db8_2124x1560.png 1272w, https://substackcdn.com/image/fetch/$s_!Qf3E!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29be3a5d-dac3-454b-bb85-e6e72e931db8_2124x1560.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Qf3E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29be3a5d-dac3-454b-bb85-e6e72e931db8_2124x1560.png" width="1456" height="1069" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/29be3a5d-dac3-454b-bb85-e6e72e931db8_2124x1560.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1069,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:364894,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29be3a5d-dac3-454b-bb85-e6e72e931db8_2124x1560.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Qf3E!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29be3a5d-dac3-454b-bb85-e6e72e931db8_2124x1560.png 424w, https://substackcdn.com/image/fetch/$s_!Qf3E!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29be3a5d-dac3-454b-bb85-e6e72e931db8_2124x1560.png 848w, https://substackcdn.com/image/fetch/$s_!Qf3E!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29be3a5d-dac3-454b-bb85-e6e72e931db8_2124x1560.png 1272w, https://substackcdn.com/image/fetch/$s_!Qf3E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29be3a5d-dac3-454b-bb85-e6e72e931db8_2124x1560.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p><strong>A note on batching and data.</strong> Prior to making algorithmic changes to GRPO, we should focus on the correctness of our data. GRPO needs (relatively) large batch sizes to work well. Using a small batch size in GRPO is one of the most common mistakes in RL training. To avoid this mistake, we should begin with a reasonable batch and group size (e.g., <a href="https://cameronrwolfe.substack.com/p/olmo-3">Olmo 3</a> [5] uses a batch size of 512 with 64 prompts and 8 rollouts per prompt) and test how varying the batch and group sizes impacts the metrics discussed above. For example, if a larger batch size makes our reward curve much more stable, then our initial batch size was <a href="https://x.com/willccbb/status/2000038557428457552">probably too small</a>.</p><p>As shown in recent RL research [9, 10], <em>curating the correct set of prompts is also essential</em>. More specifically, we want our data to be diverse in terms of topic and difficulty. For example, Olmo 3 [5] incorporates several domains&#8212;<em>math, coding, instruction following, and general chat</em>&#8212;into RL training and uses offline difficulty filtering to filter out prompts that are too easy or too difficult. Using another LLM to gauge prompt difficulty by measuring Pass@K performance is also a common filtering approach [9]. We see each data point multiple times during RL training, so data curricula<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a> are less relevant. To make the most of our data, we simply want to ensure sufficient quality and diversity!</p><blockquote><p><em>&#8220;Where algorithmic changes can make models more robust to less balanced data, a crucial part of the current RL training is to have a diversity of difficulties in your data. With large batch sizes, the model should have questions that are trivial, somewhat challenging, and nearly impossible in each batch.&#8221;</em> - <a href="https://www.interconnects.ai/i/159577063/kimi-k-scaling-reinforcement-learning-with-llms">Nathan Lambert</a></p></blockquote><p>As a final note, certain categories of questions&#8212;<em>specifically those that are easily guessable without any true reasoning</em>&#8212;can damage the fidelity of RL training. For example, multiple choice questions can easily be reward hacked if the policy randomly guesses an answer to each question. Therefore, removing this style of easily-guessable questions from RL training is a common practice.  </p><h2>Improving upon Vanilla GRPO</h2><p>Now that we understand GRPO, we will learn about recent research that has identified (and solved) problems with the vanilla GRPO algorithm. Given the popularity of GRPO, many papers have been published on this topic. We will aim to review this work in a way that is both comprehensive and of sufficient depth. The section will begin with longer overviews of a few popular papers. After the longer overviews, we will provide a wider outline of the topic via shorter paper summaries and an exhaustive list of recent and notable publications.</p><h4><strong><a href="https://arxiv.org/abs/2503.14476">DAPO: An Open-Source LLM Reinforcement Learning System at Scale</a> [1]</strong></h4><p>Despite the impressive recent results achieved with reasoning models, many details needed to reproduce these results are concealed. In fact, even open models like DeepSeek-R1 [2] do not provide sufficient technical details to fully reproduce their results. A naive application of GRPO with <a href="https://huggingface.co/Qwen/Qwen2.5-32B">Qwen-2.5-32B</a> achieves a score of 30% on <a href="https://huggingface.co/datasets/HuggingFaceH4/aime_2024">AIME 2024</a>, <em>underperforming the score of 47% achieved in the DeepSeek-R1 technical report</em>. This difficulty in reproducing the results of DeepSeek-R1 hints at missing details that are necessary for stable, performant, and scalable RL.</p><blockquote><p><em>&#8220;The broader community has encountered similar challenges in reproducing DeepSeek&#8217;s results suggesting that critical training details may have been omitted in the R1 paper that are required to develop an industry-level, large-scale, and reproducible RL system.&#8221;</em> - from [1]</p></blockquote><p>In [1], authors aim to discover these missing details, arriving at four key changes to the vanilla GRPO algorithm that&#8212;<em>when applied in tandem</em>&#8212;match and surpass results observed in [2]. The modified GRPO algorithm derived in [1] is called the Decoupled Clip and <strong>D</strong>ynamic S<strong>a</strong>mpling <strong>P</strong>olicy <strong>O</strong>ptimization (DAPO) algorithm. All <a href="https://github.com/BytedTsinghua-SIA/DAPO">code</a> (based on <a href="https://github.com/volcengine/verl">verl</a>) and <a href="https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k">data</a> are openly released to support future research.</p><p><strong>Vanilla GRPO.</strong> When running the vanilla GRPO algorithm, authors notice several issues in the training process, including:</p><ul><li><p><em>Entropy collapse</em>: the entropy of the model&#8217;s next token distribution collapses during the training process. Probability mass is primarily assigned to a single token and outputs are more deterministic.</p></li><li><p><em>Reward noise</em>: the training reward is very noisy and does not steadily increase during the RL training process.</p></li><li><p><em>Training instability</em>: the training process is unstable and may diverge. We do not observe a steady increase in response length during training.</p></li></ul><p>To mitigate these issues, authors propose the following four solutions in [1].</p><p><strong>(1) Clip higher.</strong> As mentioned previously, authors in [1] observe entropy collapse when training models with vanilla GRPO; see below. When entropy declines, the next token distribution becomes concentrated on a single token, leading sampled responses in a group to be very similar. As a result, exploration becomes limited and the advantage computation in GRPO becomes less reliable&#8212;<em>each sample in the group will tend to receive the same reward, making group normalization difficult</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SNyV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb039aeac-b8b3-442f-8efe-7a37c6a1679d_3130x1337.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SNyV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb039aeac-b8b3-442f-8efe-7a37c6a1679d_3130x1337.png 424w, https://substackcdn.com/image/fetch/$s_!SNyV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb039aeac-b8b3-442f-8efe-7a37c6a1679d_3130x1337.png 848w, https://substackcdn.com/image/fetch/$s_!SNyV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb039aeac-b8b3-442f-8efe-7a37c6a1679d_3130x1337.png 1272w, https://substackcdn.com/image/fetch/$s_!SNyV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb039aeac-b8b3-442f-8efe-7a37c6a1679d_3130x1337.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SNyV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb039aeac-b8b3-442f-8efe-7a37c6a1679d_3130x1337.png" width="1456" height="622" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b039aeac-b8b3-442f-8efe-7a37c6a1679d_3130x1337.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:622,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:381135,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb039aeac-b8b3-442f-8efe-7a37c6a1679d_3130x1337.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SNyV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb039aeac-b8b3-442f-8efe-7a37c6a1679d_3130x1337.png 424w, https://substackcdn.com/image/fetch/$s_!SNyV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb039aeac-b8b3-442f-8efe-7a37c6a1679d_3130x1337.png 848w, https://substackcdn.com/image/fetch/$s_!SNyV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb039aeac-b8b3-442f-8efe-7a37c6a1679d_3130x1337.png 1272w, https://substackcdn.com/image/fetch/$s_!SNyV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb039aeac-b8b3-442f-8efe-7a37c6a1679d_3130x1337.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>Interestingly, we see in [1] that this entropy collapse is caused by the clipping operation in PPO and GRPO. To see why this occurs, let us consider two kinds of tokens to which clipping could be applied:</p><ol><li><p><em>Exploitation token</em>: a token that is already highly likely in the current policy.</p></li><li><p><em>Exploration token</em>: a low probability token in the current policy.</p></li></ol><p>Sampling lower probability tokens gives the model a chance to explore alternative tokens when searching for better completions. Clipping is applied to the policy ratio, or the ratio of a token&#8217;s probability after and before the policy update:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;r_t(\\theta) = \\frac{\\pi_\\theta(a_t | s_t)}{\\pi_{old}(a_t | s_t)}&quot;,&quot;id&quot;:&quot;IWBQMNHTUZ&quot;}" data-component-name="LatexBlockToDOM"></div><p>The policy ratio is constrained to a range of <code>[1 - &#949;, 1 + &#949;]</code>. This upper bound allows high probability (exploitation) tokens to become more probable, but it restricts increases in low probability (exploration) tokens. A concrete example of how the upper clipping bound can discourage exploration is explained below.</p><div class="pullquote"><p>&#8220;When <code>&#949; = 0.2</code> and [advantage is positive], consider two actions with probabilities <code>&#960;_old(a_t|s_t) = 0.01</code> and <code>0.9</code>. The upper bounds of the increased probabilities <code>&#960;_&#952;(a_t|s_t)</code> are <code>0.012</code> and <code>1.08</code>, respectively (i.e., <code>&#960;_old&#183;(1 + &#949;)</code>). This implies that exploitation tokens with a higher probability (e.g., <code>0.9</code>) are not constrained to get even extremely larger probabilities like <code>0.999</code>. Conversely, for low-probability exploration tokens, achieving a non-trivial increase in probability is considerably more challenging.&#8221; - from [1]</p></div><p>The &#8220;clip higher&#8221; approach, which decouples the lower and upper bound for clipping, is proposed as a solution to this problem. Specifically, we clip in the range <code>[1 - &#949;_low, 1 + &#949;_high]</code>, where <code>&#949;_low = 0.2</code> (default setting) and <code>&#949;_high = 0.28</code> in [1]. As shown in the figure above, increasing <code>&#949;_high</code> prevents entropy collapse and improves GRPO performance. On the other hand, authors note that <code>&#949;_low</code> should not be increased, as this would suppress some tokens to a probability of zero and collapse the token sampling space.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Czeq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7452cea9-b66f-474b-a3fd-72a8d156120e_991x645.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Czeq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7452cea9-b66f-474b-a3fd-72a8d156120e_991x645.png 424w, https://substackcdn.com/image/fetch/$s_!Czeq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7452cea9-b66f-474b-a3fd-72a8d156120e_991x645.png 848w, https://substackcdn.com/image/fetch/$s_!Czeq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7452cea9-b66f-474b-a3fd-72a8d156120e_991x645.png 1272w, https://substackcdn.com/image/fetch/$s_!Czeq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7452cea9-b66f-474b-a3fd-72a8d156120e_991x645.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Czeq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7452cea9-b66f-474b-a3fd-72a8d156120e_991x645.png" width="401" height="260.9939455095863" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7452cea9-b66f-474b-a3fd-72a8d156120e_991x645.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:645,&quot;width&quot;:991,&quot;resizeWidth&quot;:401,&quot;bytes&quot;:85796,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7452cea9-b66f-474b-a3fd-72a8d156120e_991x645.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Czeq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7452cea9-b66f-474b-a3fd-72a8d156120e_991x645.png 424w, https://substackcdn.com/image/fetch/$s_!Czeq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7452cea9-b66f-474b-a3fd-72a8d156120e_991x645.png 848w, https://substackcdn.com/image/fetch/$s_!Czeq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7452cea9-b66f-474b-a3fd-72a8d156120e_991x645.png 1272w, https://substackcdn.com/image/fetch/$s_!Czeq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7452cea9-b66f-474b-a3fd-72a8d156120e_991x645.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Ratio of samples with perfect accuracy throughout RL training (from [1])</figcaption></figure></div><p><strong>(2) Dynamic Sampling.</strong> Throughout the course of RL training, the number of samples for which all completions in a group are correct naturally increases; see above. Although this trend indicates that the model is improving, prompts with perfect accuracy are problematic for GRPO. If all completions in a group are correct (i.e., reward of one), then the advantage for each completion in the group and the corresponding policy gradient are zero. As a result, our batch size effectively becomes smaller because there are many elements in the batch with zero gradient&#8212;<em>leading to a noisier batch gradient and, in turn, degraded sample efficiency</em>. To solve this issue, we can perform dynamic sampling, which simply:</p><ol><li><p>Over-samples prompts for each batch.</p></li><li><p>Filters or removes all prompts with perfect accuracy. </p></li></ol><p>The sampling cost per batch is dynamic&#8212;<em>hence the name &#8220;dynamic sampling&#8221;</em>&#8212;and we simply continue sampling and filtering until we have a full batch. However, this additional sampling cost is typically offset by the improved sample efficiency of the algorithm. Put differently, the model tends to converge much faster when we filter out prompts with perfect accuracy (i.e., dynamic sampling); see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!201L!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4484a60-e6ab-477f-89ab-9a68a851954f_1766x756.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!201L!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4484a60-e6ab-477f-89ab-9a68a851954f_1766x756.png 424w, https://substackcdn.com/image/fetch/$s_!201L!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4484a60-e6ab-477f-89ab-9a68a851954f_1766x756.png 848w, https://substackcdn.com/image/fetch/$s_!201L!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4484a60-e6ab-477f-89ab-9a68a851954f_1766x756.png 1272w, https://substackcdn.com/image/fetch/$s_!201L!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4484a60-e6ab-477f-89ab-9a68a851954f_1766x756.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!201L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4484a60-e6ab-477f-89ab-9a68a851954f_1766x756.png" width="1456" height="623" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a4484a60-e6ab-477f-89ab-9a68a851954f_1766x756.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:623,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:151262,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4484a60-e6ab-477f-89ab-9a68a851954f_1766x756.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!201L!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4484a60-e6ab-477f-89ab-9a68a851954f_1766x756.png 424w, https://substackcdn.com/image/fetch/$s_!201L!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4484a60-e6ab-477f-89ab-9a68a851954f_1766x756.png 848w, https://substackcdn.com/image/fetch/$s_!201L!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4484a60-e6ab-477f-89ab-9a68a851954f_1766x756.png 1272w, https://substackcdn.com/image/fetch/$s_!201L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4484a60-e6ab-477f-89ab-9a68a851954f_1766x756.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p><strong>(3) Token-level loss.</strong> The GRPO surrogate objective is computed at a token level, but we must aggregate this objective over the batch before computing the policy update. This aggregation is performed at the sample level, as described below.</p><blockquote><p><em>&#8220;The original GRPO algorithm employs a sample-level loss calculation, which involves first averaging the losses by token within each sample and then aggregating the losses across samples. In this approach, each sample is assigned an equal weight in the final loss computation.&#8221;</em> - from [1]</p></blockquote><p>When aggregating at the sample level, each sample in the batch is assigned an equal weight in the GRPO loss. Although this approach may seem reasonable, it creates a subtle bias in our GRPO implementation&#8212;<em>tokens within long responses have a disproportionately lower contribution to the loss</em>. More specifically, a sample receives equal weight in the GRPO loss no matter its length. The contribution of each individual token is determined by its impact on the average loss for the sequence. Given that longer sequences contain a larger number of tokens, the impact of an individual token is muted when it exists in a longer sequence. </p><p>This length bias makes learning from high-quality, longer samples&#8212;<em>or punishing patterns in low-quality samples</em>&#8212;difficult in vanilla GRPO. As evidence of this bias, we often see that excessively long samples tend to contain noticeable artifacts like repeated words or gibberish. Luckily, this problem has an easy solution: <em>we can just aggregate the loss via an average over all tokens, thus weighting the contribution of each token equally</em>. As shown below, this modification has a clear impact on the health and stability of RL training, where we can observe a stable increase in the model&#8217;s entropy and response length throughout training.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5gMs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F902ca880-8266-4469-a56e-bce9ced3437b_2080x856.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5gMs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F902ca880-8266-4469-a56e-bce9ced3437b_2080x856.png 424w, https://substackcdn.com/image/fetch/$s_!5gMs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F902ca880-8266-4469-a56e-bce9ced3437b_2080x856.png 848w, https://substackcdn.com/image/fetch/$s_!5gMs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F902ca880-8266-4469-a56e-bce9ced3437b_2080x856.png 1272w, https://substackcdn.com/image/fetch/$s_!5gMs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F902ca880-8266-4469-a56e-bce9ced3437b_2080x856.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5gMs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F902ca880-8266-4469-a56e-bce9ced3437b_2080x856.png" width="1456" height="599" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/902ca880-8266-4469-a56e-bce9ced3437b_2080x856.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:599,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:271495,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F902ca880-8266-4469-a56e-bce9ced3437b_2080x856.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5gMs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F902ca880-8266-4469-a56e-bce9ced3437b_2080x856.png 424w, https://substackcdn.com/image/fetch/$s_!5gMs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F902ca880-8266-4469-a56e-bce9ced3437b_2080x856.png 848w, https://substackcdn.com/image/fetch/$s_!5gMs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F902ca880-8266-4469-a56e-bce9ced3437b_2080x856.png 1272w, https://substackcdn.com/image/fetch/$s_!5gMs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F902ca880-8266-4469-a56e-bce9ced3437b_2080x856.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p><strong>(4) Overlong reward shaping.</strong> The final improvement proposed for GRPO in [1] is related to the handling of truncated samples. During RL training, we usually impose a maximum generation length for rollouts to improve efficiency, but the policy does not always adhere to this maximum length. In some cases, the policy will attempt to generate a sample that is too long, and we will have to truncate this sample to the maximum length. The default response to this behavior in RL is punishment&#8212;<em>we simply provide a negative reward for any truncated samples</em>. </p><p>Interestingly, authors in [1] show that how we shape this punitive reward for truncated samples is important and can lead to training instability if handled incorrectly. For example, <em>what if the policy&#8217;s reasoning process was totally valid but just too long?</em> Assigning a negative reward to such a case could confuse the model. To test this theory, authors perform an experiment in which truncated samples are masked&#8212;<em>meaning they have no contribution to the policy update</em>&#8212;in the GRPO loss instead of being negatively reinforced. As shown in the figure below, this overlong filtering strategy improves both performance and training stability.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SpHL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff75db185-318f-44f2-8316-13dc8d0dfb4d_2192x934.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SpHL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff75db185-318f-44f2-8316-13dc8d0dfb4d_2192x934.png 424w, https://substackcdn.com/image/fetch/$s_!SpHL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff75db185-318f-44f2-8316-13dc8d0dfb4d_2192x934.png 848w, https://substackcdn.com/image/fetch/$s_!SpHL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff75db185-318f-44f2-8316-13dc8d0dfb4d_2192x934.png 1272w, https://substackcdn.com/image/fetch/$s_!SpHL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff75db185-318f-44f2-8316-13dc8d0dfb4d_2192x934.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SpHL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff75db185-318f-44f2-8316-13dc8d0dfb4d_2192x934.png" width="1456" height="620" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f75db185-318f-44f2-8316-13dc8d0dfb4d_2192x934.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:620,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:312338,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff75db185-318f-44f2-8316-13dc8d0dfb4d_2192x934.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SpHL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff75db185-318f-44f2-8316-13dc8d0dfb4d_2192x934.png 424w, https://substackcdn.com/image/fetch/$s_!SpHL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff75db185-318f-44f2-8316-13dc8d0dfb4d_2192x934.png 848w, https://substackcdn.com/image/fetch/$s_!SpHL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff75db185-318f-44f2-8316-13dc8d0dfb4d_2192x934.png 1272w, https://substackcdn.com/image/fetch/$s_!SpHL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff75db185-318f-44f2-8316-13dc8d0dfb4d_2192x934.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>Additionally, a length-aware penalty is proposed that assigns a soft punishment to truncated samples. In particular, we define both a maximum generation length (<code>L_max</code>) and a cache length (<code>L_cache</code>), which together form the punishment interval <code>[L_max - L_cache, L_max]</code>. Any generation that exceeds <code>L_max</code> tokens in length will receive a maximum penalty of <code>-1</code>, while any generation less than <code>L_max - L_cache</code> tokens in length will have no penalty. Within the punishment interval, however, the negative reward is dynamically adjusted based on the length of the sample; see below. This soft overlong punishment is directly added to the verifiable reward in GRPO. A maximum length of 16K tokens and cache length of 4K tokens are used for DAPO experiments in [1]. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!b6KK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd6698a0-1b18-4617-a7a6-68fda7eb709b_2201x652.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!b6KK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd6698a0-1b18-4617-a7a6-68fda7eb709b_2201x652.png 424w, https://substackcdn.com/image/fetch/$s_!b6KK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd6698a0-1b18-4617-a7a6-68fda7eb709b_2201x652.png 848w, https://substackcdn.com/image/fetch/$s_!b6KK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd6698a0-1b18-4617-a7a6-68fda7eb709b_2201x652.png 1272w, https://substackcdn.com/image/fetch/$s_!b6KK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd6698a0-1b18-4617-a7a6-68fda7eb709b_2201x652.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!b6KK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd6698a0-1b18-4617-a7a6-68fda7eb709b_2201x652.png" width="1456" height="431" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cd6698a0-1b18-4617-a7a6-68fda7eb709b_2201x652.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:431,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:205473,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd6698a0-1b18-4617-a7a6-68fda7eb709b_2201x652.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!b6KK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd6698a0-1b18-4617-a7a6-68fda7eb709b_2201x652.png 424w, https://substackcdn.com/image/fetch/$s_!b6KK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd6698a0-1b18-4617-a7a6-68fda7eb709b_2201x652.png 848w, https://substackcdn.com/image/fetch/$s_!b6KK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd6698a0-1b18-4617-a7a6-68fda7eb709b_2201x652.png 1272w, https://substackcdn.com/image/fetch/$s_!b6KK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd6698a0-1b18-4617-a7a6-68fda7eb709b_2201x652.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Soft overlong punishment formulation (from [1])</figcaption></figure></div><p>The full DAPO algorithm, which combines the four modifications described above, is formulated by the algorithm and objective function provided below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hW2Z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed791d47-2965-48f7-8fef-eca918cb7d93_2122x1339.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hW2Z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed791d47-2965-48f7-8fef-eca918cb7d93_2122x1339.png 424w, https://substackcdn.com/image/fetch/$s_!hW2Z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed791d47-2965-48f7-8fef-eca918cb7d93_2122x1339.png 848w, https://substackcdn.com/image/fetch/$s_!hW2Z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed791d47-2965-48f7-8fef-eca918cb7d93_2122x1339.png 1272w, https://substackcdn.com/image/fetch/$s_!hW2Z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed791d47-2965-48f7-8fef-eca918cb7d93_2122x1339.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hW2Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed791d47-2965-48f7-8fef-eca918cb7d93_2122x1339.png" width="1456" height="919" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ed791d47-2965-48f7-8fef-eca918cb7d93_2122x1339.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:919,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:559217,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed791d47-2965-48f7-8fef-eca918cb7d93_2122x1339.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hW2Z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed791d47-2965-48f7-8fef-eca918cb7d93_2122x1339.png 424w, https://substackcdn.com/image/fetch/$s_!hW2Z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed791d47-2965-48f7-8fef-eca918cb7d93_2122x1339.png 848w, https://substackcdn.com/image/fetch/$s_!hW2Z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed791d47-2965-48f7-8fef-eca918cb7d93_2122x1339.png 1272w, https://substackcdn.com/image/fetch/$s_!hW2Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed791d47-2965-48f7-8fef-eca918cb7d93_2122x1339.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p><strong>Experiments</strong> in [1] are conducted with the <a href="https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k">DAPO-Math-17K</a> dataset<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>, which contains 17K prompts. The dataset is purposely curated so that answers are formatted as integers, making parsing and verification simple. Experiments are only performed in the math domain, but this is a <a href="https://cameronrwolfe.substack.com/i/179769076/rlvr-with-grpo">common approach</a> for evaluating algorithmic changes in RL. Due to the high cost of experimentation, researchers frequently use math RL as a testbed and assume that most findings will translate reasonably well to other domains. The Qwen-2.5-32B base model is selected to match the RL-Zero training setup of DeepSeek-R1 [2]. As shown below, accuracy on AIME increases from 0% to 50% after training with DAPO, exceeding the 47% accuracy achieved in [2].</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MlGx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96b64bbe-4035-457e-a32f-95a6fcfe39cc_2524x1100.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MlGx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96b64bbe-4035-457e-a32f-95a6fcfe39cc_2524x1100.png 424w, https://substackcdn.com/image/fetch/$s_!MlGx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96b64bbe-4035-457e-a32f-95a6fcfe39cc_2524x1100.png 848w, https://substackcdn.com/image/fetch/$s_!MlGx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96b64bbe-4035-457e-a32f-95a6fcfe39cc_2524x1100.png 1272w, https://substackcdn.com/image/fetch/$s_!MlGx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96b64bbe-4035-457e-a32f-95a6fcfe39cc_2524x1100.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MlGx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96b64bbe-4035-457e-a32f-95a6fcfe39cc_2524x1100.png" width="1456" height="635" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/96b64bbe-4035-457e-a32f-95a6fcfe39cc_2524x1100.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:635,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:346741,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96b64bbe-4035-457e-a32f-95a6fcfe39cc_2524x1100.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!MlGx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96b64bbe-4035-457e-a32f-95a6fcfe39cc_2524x1100.png 424w, https://substackcdn.com/image/fetch/$s_!MlGx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96b64bbe-4035-457e-a32f-95a6fcfe39cc_2524x1100.png 848w, https://substackcdn.com/image/fetch/$s_!MlGx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96b64bbe-4035-457e-a32f-95a6fcfe39cc_2524x1100.png 1272w, https://substackcdn.com/image/fetch/$s_!MlGx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96b64bbe-4035-457e-a32f-95a6fcfe39cc_2524x1100.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>This performance is achieved using only half of the training steps required to train DeepSeek-R1-Zero-Qwen-32B, <em>showcasing the improved sample efficiency of DAPO</em>. In contrast, vanilla GRPO achieves an accuracy of only 30% on this benchmark. All four DAPO modifications are shown to clearly benefit final performance; see below. Although we see the smallest accuracy boost from the token-level loss, this modification makes the training process more stable. The improved health of the RL training process with DAPO is evidenced by stable increases in average response length, entropy, and training reward.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AZvw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F829d3929-8e3f-4767-b5b0-6e350c0d65e5_988x520.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AZvw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F829d3929-8e3f-4767-b5b0-6e350c0d65e5_988x520.png 424w, https://substackcdn.com/image/fetch/$s_!AZvw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F829d3929-8e3f-4767-b5b0-6e350c0d65e5_988x520.png 848w, https://substackcdn.com/image/fetch/$s_!AZvw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F829d3929-8e3f-4767-b5b0-6e350c0d65e5_988x520.png 1272w, https://substackcdn.com/image/fetch/$s_!AZvw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F829d3929-8e3f-4767-b5b0-6e350c0d65e5_988x520.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AZvw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F829d3929-8e3f-4767-b5b0-6e350c0d65e5_988x520.png" width="487" height="256.3157894736842" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/829d3929-8e3f-4767-b5b0-6e350c0d65e5_988x520.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:520,&quot;width&quot;:988,&quot;resizeWidth&quot;:487,&quot;bytes&quot;:82694,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F829d3929-8e3f-4767-b5b0-6e350c0d65e5_988x520.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AZvw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F829d3929-8e3f-4767-b5b0-6e350c0d65e5_988x520.png 424w, https://substackcdn.com/image/fetch/$s_!AZvw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F829d3929-8e3f-4767-b5b0-6e350c0d65e5_988x520.png 848w, https://substackcdn.com/image/fetch/$s_!AZvw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F829d3929-8e3f-4767-b5b0-6e350c0d65e5_988x520.png 1272w, https://substackcdn.com/image/fetch/$s_!AZvw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F829d3929-8e3f-4767-b5b0-6e350c0d65e5_988x520.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><h4><a href="https://arxiv.org/abs/2503.20783">Understanding r1-zero-like training: A critical perspective</a> [3]</h4><p>When performing RL-Zero-style training (i.e., RL training applied directly to a base model), there are two key aspects of our training setup to consider:</p><ol><li><p>The base model.</p></li><li><p>The RL training setup.</p></li></ol><p>In [3], authors perform a deep investigation into these two aspects to better understand <em>i)</em> the impact of pretraining on performance after RL and <em>ii)</em> the dynamics of the RL training process in general. This investigation uncovers several interesting properties of base models that are commonly used in open RL recipes. Additionally, several biases are discovered in the GRPO loss formulation that are shown to degrade training stability and artificially inflate the length of incorrect responses. As a solution, authors propose GRPO done right (or Dr. GRPO), which uses a different advantage formulation and modified loss aggregation strategy to improve stability and address biases in GRPO.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rTJe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8b54eec-851f-4ef5-a2ea-74c597534432_1846x1042.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rTJe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8b54eec-851f-4ef5-a2ea-74c597534432_1846x1042.png 424w, https://substackcdn.com/image/fetch/$s_!rTJe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8b54eec-851f-4ef5-a2ea-74c597534432_1846x1042.png 848w, https://substackcdn.com/image/fetch/$s_!rTJe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8b54eec-851f-4ef5-a2ea-74c597534432_1846x1042.png 1272w, https://substackcdn.com/image/fetch/$s_!rTJe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8b54eec-851f-4ef5-a2ea-74c597534432_1846x1042.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rTJe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8b54eec-851f-4ef5-a2ea-74c597534432_1846x1042.png" width="1456" height="822" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e8b54eec-851f-4ef5-a2ea-74c597534432_1846x1042.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:822,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:394729,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8b54eec-851f-4ef5-a2ea-74c597534432_1846x1042.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rTJe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8b54eec-851f-4ef5-a2ea-74c597534432_1846x1042.png 424w, https://substackcdn.com/image/fetch/$s_!rTJe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8b54eec-851f-4ef5-a2ea-74c597534432_1846x1042.png 848w, https://substackcdn.com/image/fetch/$s_!rTJe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8b54eec-851f-4ef5-a2ea-74c597534432_1846x1042.png 1272w, https://substackcdn.com/image/fetch/$s_!rTJe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8b54eec-851f-4ef5-a2ea-74c597534432_1846x1042.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><p><strong>Base models.</strong> Several pretrained base models are tested in [3]&#8212;<em>with a focus upon Qwen-2.5 (i.e., commonly used in open RL-Zero recipes) and DeepSeek-V3-Base (i.e., the original base model used for DeepSeek-R1-Zero [2])</em>&#8212;by analyzing their responses to a set of 500 questions from <a href="https://huggingface.co/datasets/EleutherAI/hendrycks_math">MATH</a>. The results of this analysis are summarized in the figure above and focus on two major questions:</p><ol><li><p>Can we elicit better reasoning skills by changing the template used for prompting the base model?</p></li><li><p>Do base models already exhibit reasoning and self-reflection behaviors (i.e., the &#8220;Aha moment&#8221; of DeepSeek-R1) prior to RL training?</p></li></ol><p><strong>(1) Templates.</strong> Base models are trained using <a href="https://cameronrwolfe.substack.com/i/136638774/understanding-next-token-prediction">next token prediction</a> and have not yet undergone any alignment. As a result, these models struggle with instruction following, making the exact template used for prompting the model important. To better understand how the selected prompt template influences base model performance, three different styles of templates are tested in [3]; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8_-x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F307e7f03-33e8-4536-bfab-a4957ad8e3d4_1840x652.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8_-x!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F307e7f03-33e8-4536-bfab-a4957ad8e3d4_1840x652.png 424w, https://substackcdn.com/image/fetch/$s_!8_-x!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F307e7f03-33e8-4536-bfab-a4957ad8e3d4_1840x652.png 848w, https://substackcdn.com/image/fetch/$s_!8_-x!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F307e7f03-33e8-4536-bfab-a4957ad8e3d4_1840x652.png 1272w, https://substackcdn.com/image/fetch/$s_!8_-x!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F307e7f03-33e8-4536-bfab-a4957ad8e3d4_1840x652.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8_-x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F307e7f03-33e8-4536-bfab-a4957ad8e3d4_1840x652.png" width="1456" height="516" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/307e7f03-33e8-4536-bfab-a4957ad8e3d4_1840x652.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:516,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:230873,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F307e7f03-33e8-4536-bfab-a4957ad8e3d4_1840x652.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8_-x!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F307e7f03-33e8-4536-bfab-a4957ad8e3d4_1840x652.png 424w, https://substackcdn.com/image/fetch/$s_!8_-x!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F307e7f03-33e8-4536-bfab-a4957ad8e3d4_1840x652.png 848w, https://substackcdn.com/image/fetch/$s_!8_-x!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F307e7f03-33e8-4536-bfab-a4957ad8e3d4_1840x652.png 1272w, https://substackcdn.com/image/fetch/$s_!8_-x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F307e7f03-33e8-4536-bfab-a4957ad8e3d4_1840x652.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><p>To determine the template that is most suitable for each model, <a href="https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/">GPT-4o-mini</a> is used to assess whether questions are answered with the correct output format<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>. As shown in the figure below, the choice of template significantly influences model performance, but the most suitable template varies by model. For example, Qwen-2.5 models perform best with no template, while DeepSeek-V3 base [14] performs very poorly unless the correct chat template is used. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_LC_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F379068d9-ef36-4cf1-b336-1e94a896b781_810x740.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_LC_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F379068d9-ef36-4cf1-b336-1e94a896b781_810x740.png 424w, https://substackcdn.com/image/fetch/$s_!_LC_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F379068d9-ef36-4cf1-b336-1e94a896b781_810x740.png 848w, https://substackcdn.com/image/fetch/$s_!_LC_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F379068d9-ef36-4cf1-b336-1e94a896b781_810x740.png 1272w, https://substackcdn.com/image/fetch/$s_!_LC_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F379068d9-ef36-4cf1-b336-1e94a896b781_810x740.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_LC_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F379068d9-ef36-4cf1-b336-1e94a896b781_810x740.png" width="362" height="330.71604938271605" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/379068d9-ef36-4cf1-b336-1e94a896b781_810x740.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:740,&quot;width&quot;:810,&quot;resizeWidth&quot;:362,&quot;bytes&quot;:84041,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8b54eec-851f-4ef5-a2ea-74c597534432_1846x1042.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!_LC_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F379068d9-ef36-4cf1-b336-1e94a896b781_810x740.png 424w, https://substackcdn.com/image/fetch/$s_!_LC_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F379068d9-ef36-4cf1-b336-1e94a896b781_810x740.png 848w, https://substackcdn.com/image/fetch/$s_!_LC_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F379068d9-ef36-4cf1-b336-1e94a896b781_810x740.png 1272w, https://substackcdn.com/image/fetch/$s_!_LC_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F379068d9-ef36-4cf1-b336-1e94a896b781_810x740.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><blockquote><p><em>&#8220;Since Qwen2.5 uses chat model&#8217;s data (question-answer pairs) during the pretraining stage, we hypothesize that they might pretrain on the concatenated text&#8230; If our hypothesis turns out true, we shall be more careful about using Qwen2.5 models to reproduce DeepSeek-R1-Zero, since the base models are already SFT-like without templates.&#8221;</em> - from [3]</p></blockquote><p>Using a concatenated question-answer format with no template for Qwen-2.5 models leads to a 60% performance improvement, demonstrating the importance of understanding the unique properties of each base model used for RL training. In the case of Qwen-2.5, these results indicate that the base model was pretrained on concatenated question-answer data. If true, this hypothesis has significant implications for RL-Zero training&#8212;<em>the base model has already undergone SFT-like training over question-answer pairs and thus cannot truly be considered an unaligned base model for RL-Zero-style training</em>. However, validating this hypothesis is not possible because Qwen models do not openly disclose their training data.</p><p>Using the correct template is beneficial to the base model, but the impact is less pronounced after RL training. Similar performance is reached by most templates after RL training, despite large initial performance differences in the base model; see below. This finding hints that the performance benefits of RL may be more modest than is typically reported&#8212;<em>model performance can be artificially deflated prior to RL training based on the exact template being used</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vMra!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ac409f5-b8df-4cac-a3a9-ec1ace8e72cd_2448x882.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vMra!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ac409f5-b8df-4cac-a3a9-ec1ace8e72cd_2448x882.png 424w, https://substackcdn.com/image/fetch/$s_!vMra!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ac409f5-b8df-4cac-a3a9-ec1ace8e72cd_2448x882.png 848w, https://substackcdn.com/image/fetch/$s_!vMra!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ac409f5-b8df-4cac-a3a9-ec1ace8e72cd_2448x882.png 1272w, https://substackcdn.com/image/fetch/$s_!vMra!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ac409f5-b8df-4cac-a3a9-ec1ace8e72cd_2448x882.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vMra!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ac409f5-b8df-4cac-a3a9-ec1ace8e72cd_2448x882.png" width="1456" height="525" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4ac409f5-b8df-4cac-a3a9-ec1ace8e72cd_2448x882.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:525,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:326133,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ac409f5-b8df-4cac-a3a9-ec1ace8e72cd_2448x882.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vMra!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ac409f5-b8df-4cac-a3a9-ec1ace8e72cd_2448x882.png 424w, https://substackcdn.com/image/fetch/$s_!vMra!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ac409f5-b8df-4cac-a3a9-ec1ace8e72cd_2448x882.png 848w, https://substackcdn.com/image/fetch/$s_!vMra!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ac409f5-b8df-4cac-a3a9-ec1ace8e72cd_2448x882.png 1272w, https://substackcdn.com/image/fetch/$s_!vMra!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ac409f5-b8df-4cac-a3a9-ec1ace8e72cd_2448x882.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><p>Interestingly, the ability of RL to restore full performance may depend on data coverage. More specifically, if the model and prompt template are aligned well (i.e., meaning the base model initially performs well with that prompt template), we can achieve performance benefits from RL training even on very narrow datasets (e.g., <a href="https://huggingface.co/datasets/openai/gsm8k">GSM-8K</a>). However, if there is a mismatch between the base model and prompt template being used, the observed performance after RL training may suffer unless a diverse dataset with wider coverage is used; see above. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-UZ0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f9610d8-8c14-4c69-b163-80cf4ba0f8a4_1027x689.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-UZ0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f9610d8-8c14-4c69-b163-80cf4ba0f8a4_1027x689.png 424w, https://substackcdn.com/image/fetch/$s_!-UZ0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f9610d8-8c14-4c69-b163-80cf4ba0f8a4_1027x689.png 848w, https://substackcdn.com/image/fetch/$s_!-UZ0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f9610d8-8c14-4c69-b163-80cf4ba0f8a4_1027x689.png 1272w, https://substackcdn.com/image/fetch/$s_!-UZ0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f9610d8-8c14-4c69-b163-80cf4ba0f8a4_1027x689.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-UZ0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f9610d8-8c14-4c69-b163-80cf4ba0f8a4_1027x689.png" width="445" height="298.54430379746833" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2f9610d8-8c14-4c69-b163-80cf4ba0f8a4_1027x689.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:689,&quot;width&quot;:1027,&quot;resizeWidth&quot;:445,&quot;bytes&quot;:123966,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8b54eec-851f-4ef5-a2ea-74c597534432_1846x1042.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!-UZ0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f9610d8-8c14-4c69-b163-80cf4ba0f8a4_1027x689.png 424w, https://substackcdn.com/image/fetch/$s_!-UZ0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f9610d8-8c14-4c69-b163-80cf4ba0f8a4_1027x689.png 848w, https://substackcdn.com/image/fetch/$s_!-UZ0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f9610d8-8c14-4c69-b163-80cf4ba0f8a4_1027x689.png 1272w, https://substackcdn.com/image/fetch/$s_!-UZ0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f9610d8-8c14-4c69-b163-80cf4ba0f8a4_1027x689.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><p><strong>(2) Reasoning performance.</strong> After determining the most suitable template per model, authors also measure the Pass@8 performance using various temperature settings to measure base model exploration capabilities. If a base model cannot generate at least one viable solution among several rollouts, then improving reasoning capabilities with RL will be difficult&#8212;<em>the model cannot learn to answer problems correctly via exploration</em>. The results of this test are outlined in the figure shown above, where we see that all models have a non-zero success rate for solving reasoning problems when sampling multiple rollouts. The Qwen-2.5 and DeepSeek-V3 models already demonstrate impressive Pass@8 performance. </p><div class="pullquote"><p><em>&#8220;If a base policy cannot even sample a single trajectory that leads to the correct final answer, it is impossible for reinforcement learning to improve the policy because there is no reward signal.&#8221; - from [3]</em></p></div><p><strong>(2.5) Aha moment.</strong> The presence of an Aha moment in the RL training process of DeepSeek-R1-Zero [2] was a huge discovery in AI research, <em>as it indicates that sophisticated reasoning behaviors can emerge naturally from RL training</em>. However, researchers have struggled to reproduce this behavior with open models, leading many to question whether self-reflection is truly an emergent property of RL. One popular explanation for these difficulties is that base models may already exhibit self-reflection behavior prior to RL training, leading this behavior to just be emphasized&#8212;<em>rather than completely learned</em>&#8212;during the RL training process. </p><blockquote><p><em>&#8220;Although self-reflection behaviors occur more frequently in R1-Zero, we observe that these behaviors are not positively correlated with higher accuracy.&#8221;</em> - from [3]</p></blockquote><p>To test this theory, authors in [3] analyze DeepSeek-V3-Base for patterns of self-reflection on the MATH dataset. This analysis reveals that the base model already uses self-reflection in a large number of queries; see below. We can find from simple keyword searches that the model outputs many &#8220;Aha&#8221; or &#8220;wait&#8221; tokens, revealing that self-reflection behavior may not be purely developed via RL. Interestingly, RL training does increase the frequency of self-reflection in the model&#8217;s output, <em>but this behavior is not found to measurably improve performance</em>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eRxf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cb55c64-63f1-4a00-bc8b-80ecbf45c825_933x692.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eRxf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cb55c64-63f1-4a00-bc8b-80ecbf45c825_933x692.png 424w, https://substackcdn.com/image/fetch/$s_!eRxf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cb55c64-63f1-4a00-bc8b-80ecbf45c825_933x692.png 848w, https://substackcdn.com/image/fetch/$s_!eRxf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cb55c64-63f1-4a00-bc8b-80ecbf45c825_933x692.png 1272w, https://substackcdn.com/image/fetch/$s_!eRxf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cb55c64-63f1-4a00-bc8b-80ecbf45c825_933x692.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eRxf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cb55c64-63f1-4a00-bc8b-80ecbf45c825_933x692.png" width="389" height="288.518756698821" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5cb55c64-63f1-4a00-bc8b-80ecbf45c825_933x692.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:692,&quot;width&quot;:933,&quot;resizeWidth&quot;:389,&quot;bytes&quot;:120309,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8b54eec-851f-4ef5-a2ea-74c597534432_1846x1042.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!eRxf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cb55c64-63f1-4a00-bc8b-80ecbf45c825_933x692.png 424w, https://substackcdn.com/image/fetch/$s_!eRxf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cb55c64-63f1-4a00-bc8b-80ecbf45c825_933x692.png 848w, https://substackcdn.com/image/fetch/$s_!eRxf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cb55c64-63f1-4a00-bc8b-80ecbf45c825_933x692.png 1272w, https://substackcdn.com/image/fetch/$s_!eRxf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5cb55c64-63f1-4a00-bc8b-80ecbf45c825_933x692.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><p><strong>GRPO biases.</strong> In addition to analyzing properties of base models, authors in [3] point out a few problematic biases in GRPO, as well as recommend a modified algorithm&#8212;<em>called GRPO Done Right (or Dr. GRPO)</em>&#8212;to fix these biases. When an LLM is trained using vanilla GRPO, we usually observe a clear increase in the model&#8217;s average response length throughout training. Such increasing response length is usually attributed to the development of long CoT reasoning abilities.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_v4J!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88919476-c660-43d9-8416-c959312ae751_2226x1136.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_v4J!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88919476-c660-43d9-8416-c959312ae751_2226x1136.png 424w, https://substackcdn.com/image/fetch/$s_!_v4J!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88919476-c660-43d9-8416-c959312ae751_2226x1136.png 848w, https://substackcdn.com/image/fetch/$s_!_v4J!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88919476-c660-43d9-8416-c959312ae751_2226x1136.png 1272w, https://substackcdn.com/image/fetch/$s_!_v4J!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88919476-c660-43d9-8416-c959312ae751_2226x1136.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_v4J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88919476-c660-43d9-8416-c959312ae751_2226x1136.png" width="1456" height="743" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/88919476-c660-43d9-8416-c959312ae751_2226x1136.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:743,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:343414,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88919476-c660-43d9-8416-c959312ae751_2226x1136.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_v4J!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88919476-c660-43d9-8416-c959312ae751_2226x1136.png 424w, https://substackcdn.com/image/fetch/$s_!_v4J!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88919476-c660-43d9-8416-c959312ae751_2226x1136.png 848w, https://substackcdn.com/image/fetch/$s_!_v4J!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88919476-c660-43d9-8416-c959312ae751_2226x1136.png 1272w, https://substackcdn.com/image/fetch/$s_!_v4J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88919476-c660-43d9-8416-c959312ae751_2226x1136.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><p>Going against common intuition for reasoning models, however, we see in [3] that this increase in response length is partially attributable to fundamental biases in the GRPO objective function. In fact, we even see in [3] that GRPO continues to increase response length after rewards begin to plateau; see above. Additionally, output lengths become noticeably longer for incorrect responses throughout the course of training, <em>revealing a bias towards artificially inflating response lengths in GRPO</em>. Specifically, there are two key biases that exist in the GRPO objective:</p><ol><li><p><em>Response-level length bias</em>: GRPO normalizes the summed loss of tokens in each sequence by the total number of tokens in that sequence, leading to biased gradient updates based on the length of each response. </p></li><li><p><em>Question-level difficulty biases</em>: the standard deviation term in the denominator of the advantage formulation in GRPO causes the advantage to become very large for questions that are either too easy (i.e., most responses have a reward of one) or too hard (i.e., most responses have a reward of zero)<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a>.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gilS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6afc090-8b41-415a-b1eb-5e2436e10662_1702x526.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gilS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6afc090-8b41-415a-b1eb-5e2436e10662_1702x526.png 424w, https://substackcdn.com/image/fetch/$s_!gilS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6afc090-8b41-415a-b1eb-5e2436e10662_1702x526.png 848w, https://substackcdn.com/image/fetch/$s_!gilS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6afc090-8b41-415a-b1eb-5e2436e10662_1702x526.png 1272w, https://substackcdn.com/image/fetch/$s_!gilS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6afc090-8b41-415a-b1eb-5e2436e10662_1702x526.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gilS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6afc090-8b41-415a-b1eb-5e2436e10662_1702x526.png" width="1456" height="450" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a6afc090-8b41-415a-b1eb-5e2436e10662_1702x526.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:180133,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6afc090-8b41-415a-b1eb-5e2436e10662_1702x526.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gilS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6afc090-8b41-415a-b1eb-5e2436e10662_1702x526.png 424w, https://substackcdn.com/image/fetch/$s_!gilS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6afc090-8b41-415a-b1eb-5e2436e10662_1702x526.png 848w, https://substackcdn.com/image/fetch/$s_!gilS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6afc090-8b41-415a-b1eb-5e2436e10662_1702x526.png 1272w, https://substackcdn.com/image/fetch/$s_!gilS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6afc090-8b41-415a-b1eb-5e2436e10662_1702x526.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><p>Response lengths vary during RL training, so the loss is normalized dynamically based on the length of each sequence. The response-level length bias observed in [3] matches findings in [1] that motivated the use of a token-level loss to avoid sequence lengths influencing each token&#8217;s contribution to the loss. Normalizing the GRPO loss on a sequence level leads to larger gradient updates for shorter responses&#8212;<em>or smaller gradient updates for longer responses</em>&#8212;when the advantage is positive. When advantage is negative, however, long responses are penalized less, leading longer responses to be preferred among incorrect outputs.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wdeO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77157229-818d-493e-a494-b24462431a61_2106x570.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wdeO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77157229-818d-493e-a494-b24462431a61_2106x570.png 424w, https://substackcdn.com/image/fetch/$s_!wdeO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77157229-818d-493e-a494-b24462431a61_2106x570.png 848w, https://substackcdn.com/image/fetch/$s_!wdeO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77157229-818d-493e-a494-b24462431a61_2106x570.png 1272w, https://substackcdn.com/image/fetch/$s_!wdeO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77157229-818d-493e-a494-b24462431a61_2106x570.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wdeO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77157229-818d-493e-a494-b24462431a61_2106x570.png" width="1456" height="394" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/77157229-818d-493e-a494-b24462431a61_2106x570.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:394,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:160134,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77157229-818d-493e-a494-b24462431a61_2106x570.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wdeO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77157229-818d-493e-a494-b24462431a61_2106x570.png 424w, https://substackcdn.com/image/fetch/$s_!wdeO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77157229-818d-493e-a494-b24462431a61_2106x570.png 848w, https://substackcdn.com/image/fetch/$s_!wdeO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77157229-818d-493e-a494-b24462431a61_2106x570.png 1272w, https://substackcdn.com/image/fetch/$s_!wdeO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77157229-818d-493e-a494-b24462431a61_2106x570.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><p>Put differently, <em>GRPO biases the model towards overthinking by using more tokens for incorrect answers! </em>To avoid the length bias from sequence-level aggregation, we can divide the sum of losses in each sequence by a fixed constant rather than the total number of tokens in the sequence; see above for an example implementation.</p><p><strong>Dr. GRPO</strong> is a modified version of GRPO proposed in [3] to fix the biases outlined above. Compared to vanilla GRPO, Dr. GRPO makes two key modifications:</p><ol><li><p>Normalizing the summed loss of each sequence by a fixed constant, rather than by the number of tokens in the sequence.</p></li><li><p>Removing the standard deviation term from the denominator of the advantage formulation. </p></li></ol><p>Dr. GRPO is formulated below, where we see that the loss is not normalized by sequence length. The loss is instead divided by the <code>MAX_TOKENS</code> constant, as shown in the above code snippet. Additionally, the advantage is computed by subtracting the group-level mean of rewards from the reward for each completion (i.e., no division by standard deviation). These changes are found to mitigate the aforementioned biases and  yield models that perform better on a per-token basis&#8212;<em>better performance is achieved while outputting fewer tokens on average</em>; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wrhA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdee6e2d-9cfe-4d20-b84c-56831fe0dc43_2007x909.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wrhA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdee6e2d-9cfe-4d20-b84c-56831fe0dc43_2007x909.png 424w, https://substackcdn.com/image/fetch/$s_!wrhA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdee6e2d-9cfe-4d20-b84c-56831fe0dc43_2007x909.png 848w, https://substackcdn.com/image/fetch/$s_!wrhA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdee6e2d-9cfe-4d20-b84c-56831fe0dc43_2007x909.png 1272w, https://substackcdn.com/image/fetch/$s_!wrhA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdee6e2d-9cfe-4d20-b84c-56831fe0dc43_2007x909.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wrhA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdee6e2d-9cfe-4d20-b84c-56831fe0dc43_2007x909.png" width="1456" height="659" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cdee6e2d-9cfe-4d20-b84c-56831fe0dc43_2007x909.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:659,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:321680,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdee6e2d-9cfe-4d20-b84c-56831fe0dc43_2007x909.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wrhA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdee6e2d-9cfe-4d20-b84c-56831fe0dc43_2007x909.png 424w, https://substackcdn.com/image/fetch/$s_!wrhA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdee6e2d-9cfe-4d20-b84c-56831fe0dc43_2007x909.png 848w, https://substackcdn.com/image/fetch/$s_!wrhA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdee6e2d-9cfe-4d20-b84c-56831fe0dc43_2007x909.png 1272w, https://substackcdn.com/image/fetch/$s_!wrhA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdee6e2d-9cfe-4d20-b84c-56831fe0dc43_2007x909.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><p><strong>Experiments.</strong> Dr. GRPO is implemented using the <a href="https://github.com/sail-sg/oat">Oat framework</a> and is <a href="https://github.com/sail-sg/understand-r1-zero">released openly</a>. Models are trained using the <a href="https://huggingface.co/datasets/EleutherAI/hendrycks_math">MATH</a> dataset and evaluated on a variety of benchmarks, including <a href="https://huggingface.co/datasets/Hothan/OlympiadBench">OlympiadBench</a>, <a href="https://huggingface.co/datasets/Maxwell-Jia/AIME_2024">AIME 2024</a>, <a href="https://huggingface.co/datasets/math-ai/amc23">AMC</a>, <a href="https://huggingface.co/datasets/math-ai/minervamath">Minverva Math</a>, and <a href="https://huggingface.co/datasets/HuggingFaceH4/MATH-500">Math500</a>. Rewards are derived based on correctness (i.e., correct responses receive a reward of one, while incorrect responses receive a reward of zero) using <a href="https://github.com/huggingface/Math-Verify">Math Verify</a>. When used to train the <a href="https://huggingface.co/Qwen/Qwen2.5-Math-7B">Qwen-2.5-Math-7B</a> model (with the Qwen-Math prompt template), the simple Dr. GRPO RL-Zero recipe achieves 43.3% accuracy on AIME 2024, which is state-of-the-art for a model of this scale; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iEDx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6109ca52-5f1e-4da9-b250-51aeb05c54b5_2392x1002.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iEDx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6109ca52-5f1e-4da9-b250-51aeb05c54b5_2392x1002.png 424w, https://substackcdn.com/image/fetch/$s_!iEDx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6109ca52-5f1e-4da9-b250-51aeb05c54b5_2392x1002.png 848w, https://substackcdn.com/image/fetch/$s_!iEDx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6109ca52-5f1e-4da9-b250-51aeb05c54b5_2392x1002.png 1272w, https://substackcdn.com/image/fetch/$s_!iEDx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6109ca52-5f1e-4da9-b250-51aeb05c54b5_2392x1002.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iEDx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6109ca52-5f1e-4da9-b250-51aeb05c54b5_2392x1002.png" width="1456" height="610" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6109ca52-5f1e-4da9-b250-51aeb05c54b5_2392x1002.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:610,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:309410,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6109ca52-5f1e-4da9-b250-51aeb05c54b5_2392x1002.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iEDx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6109ca52-5f1e-4da9-b250-51aeb05c54b5_2392x1002.png 424w, https://substackcdn.com/image/fetch/$s_!iEDx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6109ca52-5f1e-4da9-b250-51aeb05c54b5_2392x1002.png 848w, https://substackcdn.com/image/fetch/$s_!iEDx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6109ca52-5f1e-4da9-b250-51aeb05c54b5_2392x1002.png 1272w, https://substackcdn.com/image/fetch/$s_!iEDx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6109ca52-5f1e-4da9-b250-51aeb05c54b5_2392x1002.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><p>The training process for this model completes in ~27 hours on only eight A100 GPUs. Such a lightweight training setup is useful for research, as one can quickly iterate upon changes to the RL training process. The key findings from [3] are summarized in the figure below. Beyond the observed properties of base models and reported benefits of Dr. GRPO, authors in [3] find that continued, domain-specific pretraining is helpful for RL. Specifically, continually pretraining the Llama-3.2-3B model on math-specific data prior to RL-Zero training noticeably raises the model&#8217;s performance ceiling during the RL training process.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ptG6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf34c5ee-9594-4b0d-964f-b3b37a2c9788_1052x522.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ptG6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf34c5ee-9594-4b0d-964f-b3b37a2c9788_1052x522.png 424w, https://substackcdn.com/image/fetch/$s_!ptG6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf34c5ee-9594-4b0d-964f-b3b37a2c9788_1052x522.png 848w, https://substackcdn.com/image/fetch/$s_!ptG6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf34c5ee-9594-4b0d-964f-b3b37a2c9788_1052x522.png 1272w, https://substackcdn.com/image/fetch/$s_!ptG6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf34c5ee-9594-4b0d-964f-b3b37a2c9788_1052x522.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ptG6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf34c5ee-9594-4b0d-964f-b3b37a2c9788_1052x522.png" width="1052" height="522" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cf34c5ee-9594-4b0d-964f-b3b37a2c9788_1052x522.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:522,&quot;width&quot;:1052,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:165292,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf34c5ee-9594-4b0d-964f-b3b37a2c9788_1052x522.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ptG6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf34c5ee-9594-4b0d-964f-b3b37a2c9788_1052x522.png 424w, https://substackcdn.com/image/fetch/$s_!ptG6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf34c5ee-9594-4b0d-964f-b3b37a2c9788_1052x522.png 848w, https://substackcdn.com/image/fetch/$s_!ptG6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf34c5ee-9594-4b0d-964f-b3b37a2c9788_1052x522.png 1272w, https://substackcdn.com/image/fetch/$s_!ptG6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf34c5ee-9594-4b0d-964f-b3b37a2c9788_1052x522.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><h4><a href="https://fengyao.notion.site/off-policy-rl">Your Efficient RL Framework Secretly Brings You Off-Policy RL Training</a> [4]</h4><p>During RL training, we alternate between two key operations:</p><ol><li><p><em>Rollouts</em>: given a set of prompts, sample multiple completions for each prompt using the current LLM.</p></li><li><p><em>Policy Updates</em>: compute a weight update for the LLM using the sampled rollouts and the given objective function (e.g., from GRPO).</p></li></ol><p>The cost of the RL training process is notoriously high and typically dominated by rollout generation&#8212;<em>most of the time in RL is spent waiting for inference to finish</em><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a>. For example, profiling the <a href="https://cameronrwolfe.substack.com/i/179769076/rlvr-with-grpo">RL training process for Olmo 3</a> [5] reveals that 5-14&#215; more compute is spent on inference compared to policy updates. For this reason, most modern RL training frameworks use separate engines on the backend for generating rollouts and performing policy updates. Specifically, we usually use popular training frameworks like <a href="https://engineering.fb.com/2021/07/15/open-source/fsdp/">FSDP</a> or <a href="https://www.deepspeed.ai/training/">DeepSpeed</a> for policy updates, while optimized inference engines like <a href="https://docs.vllm.ai/en/latest/">vLLM</a> or <a href="https://docs.sglang.io/">SGLang</a>&#8212;<em>often with lower precision inference (e.g., </em><code>int8</code><em> or </em><code>fp8</code><em>) for added efficiency</em>&#8212;are used to generate rollouts.</p><blockquote><p><em>&#8220;In modern RL training frameworks, different implementations are used for rollout generation and model training&#8230; We show the implementation gap implicitly turns the on-policy RL to be off-policy.&#8221;</em> - from [4]</p></blockquote><p>For simplicity, we will refer to the engines used for sampling rollouts and computing policy updates as the sampler and learner engines, respectively.</p><p><strong>Gap between engines.</strong> One may naively assume that engine implementations should be similar, but the use of separate sampler and learner engines creates a mismatch in the code used for rollouts and policy updates. Even when engines share the same exact model parameters, <em>the token probabilities that they predict can differ significantly</em>; see below. In the worst case, token probabilities are completely contradictory between the two engines, meaning that the learner would not have generated the same completion as the sampler. In this case, the RL training process actually becomes <a href="https://cameronrwolfe.substack.com/p/online-rl">off-policy</a>, thus degrading performance. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YoVu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f305b8-7ab7-4ea5-ac11-feb3c5a477e9_1226x836.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YoVu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f305b8-7ab7-4ea5-ac11-feb3c5a477e9_1226x836.png 424w, https://substackcdn.com/image/fetch/$s_!YoVu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f305b8-7ab7-4ea5-ac11-feb3c5a477e9_1226x836.png 848w, https://substackcdn.com/image/fetch/$s_!YoVu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f305b8-7ab7-4ea5-ac11-feb3c5a477e9_1226x836.png 1272w, https://substackcdn.com/image/fetch/$s_!YoVu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f305b8-7ab7-4ea5-ac11-feb3c5a477e9_1226x836.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YoVu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f305b8-7ab7-4ea5-ac11-feb3c5a477e9_1226x836.png" width="471" height="321.17128874388254" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c1f305b8-7ab7-4ea5-ac11-feb3c5a477e9_1226x836.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:836,&quot;width&quot;:1226,&quot;resizeWidth&quot;:471,&quot;bytes&quot;:240324,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2a193d2-ff35-4dca-92cb-5759188de78e_2444x1185.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YoVu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f305b8-7ab7-4ea5-ac11-feb3c5a477e9_1226x836.png 424w, https://substackcdn.com/image/fetch/$s_!YoVu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f305b8-7ab7-4ea5-ac11-feb3c5a477e9_1226x836.png 848w, https://substackcdn.com/image/fetch/$s_!YoVu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f305b8-7ab7-4ea5-ac11-feb3c5a477e9_1226x836.png 1272w, https://substackcdn.com/image/fetch/$s_!YoVu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1f305b8-7ab7-4ea5-ac11-feb3c5a477e9_1226x836.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Difference in token probabilities created by the mismatch between sampler and learner engines (from [4])</figcaption></figure></div><p>We must address this implementation gap for RL training to be truly on-policy. To accomplish this, we could (obviously) take an engineering-centric approach&#8212;<em>just find and eliminate implementation differences so that the two engines yield identical token probabilities.</em> In [4], authors take this approach by identifying problem areas that contribute to differences in token probabilities, but the implementation gap still exists even after patching several issues in the engine code; see above.</p><p>To fully eliminate this implementation gap, we must chase down an even larger number of subtle issues that exist, such as precision differences throughout parts of the model or deviations in sampling code. Identifying and removing all of these bugs is a tedious engineering process that must be repeated any time a new (or even slightly modified) engine is used for RL. Going further, even if all of these issues are addressed, the LLM inference process is still <a href="https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/">fundamentally non-deterministic</a>. As a result, <em>the engine gap can be minimized but not fully removed. </em>For these reasons, an engineering-centric solution, though conceptually simple, is resource-intensive and difficult to achieve in practice. </p><p><strong>Importance Sampling.</strong> Authors in [4] propose an algorithmic approach based on importance sampling for addressing the engine mismatch in RL. Formally, <a href="https://en.wikipedia.org/wiki/Importance_sampling">importance sampling</a> is a statistical method used to estimate properties (e.g., an expectation) of a target probability distribution <code>f(x)</code> by sampling from a proposal distribution <code>g(x)</code>. Usually, sampling from <code>g(x)</code> is much cheaper than sampling from <code>f(x)</code>, <em>which is the motivation for importance sampling</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iEKF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb69437bf-88b3-4485-b263-f2828f40db17_2288x762.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iEKF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb69437bf-88b3-4485-b263-f2828f40db17_2288x762.png 424w, https://substackcdn.com/image/fetch/$s_!iEKF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb69437bf-88b3-4485-b263-f2828f40db17_2288x762.png 848w, https://substackcdn.com/image/fetch/$s_!iEKF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb69437bf-88b3-4485-b263-f2828f40db17_2288x762.png 1272w, https://substackcdn.com/image/fetch/$s_!iEKF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb69437bf-88b3-4485-b263-f2828f40db17_2288x762.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iEKF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb69437bf-88b3-4485-b263-f2828f40db17_2288x762.png" width="608" height="202.52747252747253" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b69437bf-88b3-4485-b263-f2828f40db17_2288x762.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:485,&quot;width&quot;:1456,&quot;resizeWidth&quot;:608,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iEKF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb69437bf-88b3-4485-b263-f2828f40db17_2288x762.png 424w, https://substackcdn.com/image/fetch/$s_!iEKF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb69437bf-88b3-4485-b263-f2828f40db17_2288x762.png 848w, https://substackcdn.com/image/fetch/$s_!iEKF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb69437bf-88b3-4485-b263-f2828f40db17_2288x762.png 1272w, https://substackcdn.com/image/fetch/$s_!iEKF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb69437bf-88b3-4485-b263-f2828f40db17_2288x762.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(<a href="https://ionides.github.io/pubs/ionides08-jcgs.pdf">source</a>)</figcaption></figure></div><p>In other words, if sampling from <code>f(x)</code> is difficult, we can instead choose to draw samples from <code>g(x)</code> and just correct for the discrepancy between <code>f(x)</code> and <code>g(x)</code> by weighting each sample by the importance ratio <code>f(x) / g(x)</code>; see above. This concept can be directly applied in the context of RL! Namely, we can denote the token probabilities from our learner and sampler as <code>f(x)</code> and <code>g(x)</code>, respectively. From our prior discussion, we know that:</p><ol><li><p>Sampling from <code>g(x)</code> is much more efficient relative to <code>f(x)</code>.</p></li><li><p>There is a discrepancy between these two distributions. </p></li></ol><p>Therefore, importance sampling can be directly used to correct for this mismatch.</p><div class="pullquote"><p>&#8220;When direct Monte Carlo estimation of the expected value under a target distribution is difficult, importance sampling allows us to sample from an alternative distribution instead. In our case, the target distribution is <code>&#960;_learner</code>, but it is extremely slow to sample from. Using a separate backend (e.g., vLLM) for rollout generation means that we are sampling from <code>&#960;_sampler</code> instead. The discrepancy is then corrected by weighting each sample with an importance ratio.&#8221; - from [4]</p></div><p><strong>Truncated Importance Sampling (TIS) for RL.</strong> To understand how importance sampling can be practically implemented in the context of RL training, let&#8217;s begin with the <a href="https://cameronrwolfe.substack.com/i/175107358/policy-gradient-basics">most basic expression for a policy gradient</a>; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fsIR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa679c7f-f7c6-42b4-aa7c-82c6f032c121_2144x399.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fsIR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa679c7f-f7c6-42b4-aa7c-82c6f032c121_2144x399.png 424w, https://substackcdn.com/image/fetch/$s_!fsIR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa679c7f-f7c6-42b4-aa7c-82c6f032c121_2144x399.png 848w, https://substackcdn.com/image/fetch/$s_!fsIR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa679c7f-f7c6-42b4-aa7c-82c6f032c121_2144x399.png 1272w, https://substackcdn.com/image/fetch/$s_!fsIR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa679c7f-f7c6-42b4-aa7c-82c6f032c121_2144x399.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fsIR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa679c7f-f7c6-42b4-aa7c-82c6f032c121_2144x399.png" width="557" height="103.67239010989012" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fa679c7f-f7c6-42b4-aa7c-82c6f032c121_2144x399.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:271,&quot;width&quot;:1456,&quot;resizeWidth&quot;:557,&quot;bytes&quot;:146351,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa679c7f-f7c6-42b4-aa7c-82c6f032c121_2144x399.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fsIR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa679c7f-f7c6-42b4-aa7c-82c6f032c121_2144x399.png 424w, https://substackcdn.com/image/fetch/$s_!fsIR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa679c7f-f7c6-42b4-aa7c-82c6f032c121_2144x399.png 848w, https://substackcdn.com/image/fetch/$s_!fsIR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa679c7f-f7c6-42b4-aa7c-82c6f032c121_2144x399.png 1272w, https://substackcdn.com/image/fetch/$s_!fsIR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa679c7f-f7c6-42b4-aa7c-82c6f032c121_2144x399.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Basic policy gradient expression</figcaption></figure></div><p>In practice, the policy gradient that we can compute looks slightly different from this, as we are not using the same policy for sampling the rollout and computing the policy gradient. Rather, the actual expression we will use is shown below, where separate engines are used for the rollouts and policy gradient.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sdvK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90144a7e-afdd-4e5c-b0a5-e62637ce34d8_2395x864.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sdvK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90144a7e-afdd-4e5c-b0a5-e62637ce34d8_2395x864.png 424w, https://substackcdn.com/image/fetch/$s_!sdvK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90144a7e-afdd-4e5c-b0a5-e62637ce34d8_2395x864.png 848w, https://substackcdn.com/image/fetch/$s_!sdvK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90144a7e-afdd-4e5c-b0a5-e62637ce34d8_2395x864.png 1272w, https://substackcdn.com/image/fetch/$s_!sdvK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90144a7e-afdd-4e5c-b0a5-e62637ce34d8_2395x864.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sdvK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90144a7e-afdd-4e5c-b0a5-e62637ce34d8_2395x864.png" width="1456" height="525" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/90144a7e-afdd-4e5c-b0a5-e62637ce34d8_2395x864.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:525,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:319461,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90144a7e-afdd-4e5c-b0a5-e62637ce34d8_2395x864.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sdvK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90144a7e-afdd-4e5c-b0a5-e62637ce34d8_2395x864.png 424w, https://substackcdn.com/image/fetch/$s_!sdvK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90144a7e-afdd-4e5c-b0a5-e62637ce34d8_2395x864.png 848w, https://substackcdn.com/image/fetch/$s_!sdvK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90144a7e-afdd-4e5c-b0a5-e62637ce34d8_2395x864.png 1272w, https://substackcdn.com/image/fetch/$s_!sdvK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90144a7e-afdd-4e5c-b0a5-e62637ce34d8_2395x864.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Basic policy gradient expression with different engines and TIS</figcaption></figure></div><p>As shown above, importance sampling operates by weighting the policy gradient by the importance ratio <code>f(x) / g(x)</code>. For RL training, the importance ratio is computed as <code>&#960;_learner / &#960;_sampler</code> (i.e., the ratio of token probabilities from the learner and sampler engines). To make the policy update more stable, authors in [4] adopt truncated importance sampling (TIS), which simply caps the importance ratio at a maximum value of <code>&#961;</code>. The policy gradient is not changed much&#8212;<em>we just scale the gradient expression by the (truncated) importance ratio</em>. </p><blockquote><p><em>&#8220;While there has been extensive study on how to design a stable and effective importance sampling, in practice we find it usually sufficient to use a classical technique, truncated importance sampling.&#8221;</em> - from [4]</p></blockquote><p>We formulate TIS with a basic policy gradient expression above, but extending this idea to other RL optimizers is straightforward. In particular, we can just:</p><ul><li><p>Take the policy gradient expression for our RL optimizer of choice.</p></li><li><p>Scale the new policy gradient expression by the same importance ratio.</p></li></ul><p>For example, we can apply TIS to GRPO or PPO as shown below<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a>. After computing the policy gradient, we can multiply this gradient by the (truncated) importance ratio. We still scale the policy gradient by the importance ratio, but we substitute our standard policy gradient expression with that of GRPO.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9naH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ddd67e1-c51f-4318-8bf5-47cd5f0c2498_3124x340.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9naH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ddd67e1-c51f-4318-8bf5-47cd5f0c2498_3124x340.png 424w, https://substackcdn.com/image/fetch/$s_!9naH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ddd67e1-c51f-4318-8bf5-47cd5f0c2498_3124x340.png 848w, https://substackcdn.com/image/fetch/$s_!9naH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ddd67e1-c51f-4318-8bf5-47cd5f0c2498_3124x340.png 1272w, https://substackcdn.com/image/fetch/$s_!9naH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ddd67e1-c51f-4318-8bf5-47cd5f0c2498_3124x340.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9naH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ddd67e1-c51f-4318-8bf5-47cd5f0c2498_3124x340.png" width="1456" height="158" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2ddd67e1-c51f-4318-8bf5-47cd5f0c2498_3124x340.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:158,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:146433,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ddd67e1-c51f-4318-8bf5-47cd5f0c2498_3124x340.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9naH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ddd67e1-c51f-4318-8bf5-47cd5f0c2498_3124x340.png 424w, https://substackcdn.com/image/fetch/$s_!9naH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ddd67e1-c51f-4318-8bf5-47cd5f0c2498_3124x340.png 848w, https://substackcdn.com/image/fetch/$s_!9naH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ddd67e1-c51f-4318-8bf5-47cd5f0c2498_3124x340.png 1272w, https://substackcdn.com/image/fetch/$s_!9naH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ddd67e1-c51f-4318-8bf5-47cd5f0c2498_3124x340.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Policy gradient with TIS for GRPO or PPO (from [4])</figcaption></figure></div><p><strong>Does TIS work?</strong> To determine whether TIS solves the mismatch problem, authors in [4] first conduct experiments using <a href="https://huggingface.co/Qwen/Qwen2.5-32B">Qwen-2.5-32B</a> with DAPO [1] on the DAPO-Math-17K dataset. Due to resource limitations, RL training is stopped after 250 iterations, but these initial iterations can be used to analyze the properties of the training process. An early stopping approach is commonly used to efficiently test interventions to the RL training process. As shown below, we see a clear boost in performance when TIS is used in DAPO&#8212;<em>TIS benefits performance significantly</em>. Additionally, we see that similar performance cannot be achieved by addressing implementation gaps between engines (i.e., an engineering-centric approach).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!p2a0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0df9cdab-04c4-46c4-a476-32b20af8ec20_2444x1185.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!p2a0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0df9cdab-04c4-46c4-a476-32b20af8ec20_2444x1185.png 424w, https://substackcdn.com/image/fetch/$s_!p2a0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0df9cdab-04c4-46c4-a476-32b20af8ec20_2444x1185.png 848w, https://substackcdn.com/image/fetch/$s_!p2a0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0df9cdab-04c4-46c4-a476-32b20af8ec20_2444x1185.png 1272w, https://substackcdn.com/image/fetch/$s_!p2a0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0df9cdab-04c4-46c4-a476-32b20af8ec20_2444x1185.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!p2a0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0df9cdab-04c4-46c4-a476-32b20af8ec20_2444x1185.png" width="1456" height="706" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0df9cdab-04c4-46c4-a476-32b20af8ec20_2444x1185.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:706,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:666787,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0df9cdab-04c4-46c4-a476-32b20af8ec20_2444x1185.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!p2a0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0df9cdab-04c4-46c4-a476-32b20af8ec20_2444x1185.png 424w, https://substackcdn.com/image/fetch/$s_!p2a0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0df9cdab-04c4-46c4-a476-32b20af8ec20_2444x1185.png 848w, https://substackcdn.com/image/fetch/$s_!p2a0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0df9cdab-04c4-46c4-a476-32b20af8ec20_2444x1185.png 1272w, https://substackcdn.com/image/fetch/$s_!p2a0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0df9cdab-04c4-46c4-a476-32b20af8ec20_2444x1185.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [4])</figcaption></figure></div><p><strong>Quantized rollouts</strong>, which refers to sampling rollouts in a lower numerical precision (e.g., <code>fp8</code> or <code>int8</code> instead of <code>bf16</code>), can be used to study the impact of the distribution gap between sampler and learner engines. We can increase this gap by lowering the precision used for generating rollouts. To test the impact of increasing the mismatch in this way, a <a href="https://verl.readthedocs.io/en/latest/start/quickstart.html">basic GSM8K setup</a> is used in [4], where rollouts are sampled using either <code>bf16</code> or <code>int8</code> precision.</p><p>Using lower precision is shown in [4] to increase the maximum difference in token probabilities from ~0.4 to ~1.0, thus confirming that quantized rollouts do measurably increase the gap between the sampler and learner. As shown below, performing regular PPO training with quantized rollouts results in noticeable performance deterioration. By using TIS, we can mitigate this issue and match the performance of the higher precision (<code>bf16</code>) training setup; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pon3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7a7871b-a359-448f-ac89-b2a664ff2695_1442x656.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pon3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7a7871b-a359-448f-ac89-b2a664ff2695_1442x656.png 424w, https://substackcdn.com/image/fetch/$s_!pon3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7a7871b-a359-448f-ac89-b2a664ff2695_1442x656.png 848w, https://substackcdn.com/image/fetch/$s_!pon3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7a7871b-a359-448f-ac89-b2a664ff2695_1442x656.png 1272w, https://substackcdn.com/image/fetch/$s_!pon3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7a7871b-a359-448f-ac89-b2a664ff2695_1442x656.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pon3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7a7871b-a359-448f-ac89-b2a664ff2695_1442x656.png" width="1442" height="656" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c7a7871b-a359-448f-ac89-b2a664ff2695_1442x656.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:656,&quot;width&quot;:1442,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:426867,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7a7871b-a359-448f-ac89-b2a664ff2695_1442x656.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pon3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7a7871b-a359-448f-ac89-b2a664ff2695_1442x656.png 424w, https://substackcdn.com/image/fetch/$s_!pon3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7a7871b-a359-448f-ac89-b2a664ff2695_1442x656.png 848w, https://substackcdn.com/image/fetch/$s_!pon3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7a7871b-a359-448f-ac89-b2a664ff2695_1442x656.png 1272w, https://substackcdn.com/image/fetch/$s_!pon3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7a7871b-a359-448f-ac89-b2a664ff2695_1442x656.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [4])</figcaption></figure></div><p>Analyzing the impact of quantized rollouts further, we see in [4] that experiments using <code>int8</code> rollouts <em>i)</em> show clear signs of entropy collapse and <em>ii)</em> produce models with abnormally long average response lengths. <em>Both observations indicate poor health in the RL training process.</em> Entropy collapse is not observed when using <code>bf16</code> rollouts, revealing that the RL training process is negatively impacted by the mismatch introduced by quantized rollouts. However, using TIS is also found to effectively address the mismatch and reverse these observations; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!s35R!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00672cb3-fa64-4514-9c6e-728659e36bc4_1908x1308.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!s35R!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00672cb3-fa64-4514-9c6e-728659e36bc4_1908x1308.png 424w, https://substackcdn.com/image/fetch/$s_!s35R!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00672cb3-fa64-4514-9c6e-728659e36bc4_1908x1308.png 848w, https://substackcdn.com/image/fetch/$s_!s35R!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00672cb3-fa64-4514-9c6e-728659e36bc4_1908x1308.png 1272w, https://substackcdn.com/image/fetch/$s_!s35R!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00672cb3-fa64-4514-9c6e-728659e36bc4_1908x1308.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!s35R!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00672cb3-fa64-4514-9c6e-728659e36bc4_1908x1308.png" width="1456" height="998" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/00672cb3-fa64-4514-9c6e-728659e36bc4_1908x1308.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:998,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1272279,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00672cb3-fa64-4514-9c6e-728659e36bc4_1908x1308.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!s35R!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00672cb3-fa64-4514-9c6e-728659e36bc4_1908x1308.png 424w, https://substackcdn.com/image/fetch/$s_!s35R!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00672cb3-fa64-4514-9c6e-728659e36bc4_1908x1308.png 848w, https://substackcdn.com/image/fetch/$s_!s35R!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00672cb3-fa64-4514-9c6e-728659e36bc4_1908x1308.png 1272w, https://substackcdn.com/image/fetch/$s_!s35R!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00672cb3-fa64-4514-9c6e-728659e36bc4_1908x1308.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [4])</figcaption></figure></div><p>Although the <code>bf16</code> training setup is already stable, using TIS even with <code>bf16</code> rollouts is found to further improve entropy values, which can allow the model to explore more during RL; see above. Generally, TIS should not provide much of a benefit when the mismatch between sampler and learner engines is small&#8212;<em>the importance ratio in these cases is ~1.0 and the objective becomes identical to standard PPO or GRPO</em>. However, TIS does not deteriorate performance in these cases and can still yield some benefits, as shown in the case with <code>bf16</code> rollouts. </p><p><strong>What causes the gap?</strong> To conclude their analysis, authors in [4] study practical choices that can worsen the sampler-learner gap in RL. To quantify the size of the gap, token-level probability mismatch is measured per response&#8212;<em>either using the mean or maximum difference across tokens in the response</em>&#8212;over a set of 512 prompts from DAPO-Math-17K. From this analysis, we learn that:</p><ul><li><p>Mean mismatch tends to stay the same between most implementations&#8212;<em>the largest impact is observed in terms of maximum mismatch</em>. In other words, large sampler-learner gaps are characterized by a noticeable increase in the maximum token probability discrepancy across sequences. </p></li><li><p>Differences in parallelism strategies significantly increase the mismatch (e.g., <a href="https://robotchinwag.com/posts/demystifying-tensor-parallelism/https://insujang.github.io/2024-01-11/tensor-parallelism-and-sequence-parallelism-detailed-analysis/">sequence parallelism in the learner and tensor parallelism in the sampler</a>).</p></li><li><p>Using the same parallelism strategy with different settings (e.g., tensor parallelism with 2 versus 4 GPUs) is less problematic compared to using different distribution strategies altogether.</p></li><li><p>Using longer rollouts in RL tends to increase the sampler-learned gap. </p></li><li><p>Using different sampler backends (e.g., vLLM, SGLang, or SGLang with <a href="https://lmsys.org/blog/2025-09-22-sglang-deterministic/">deterministic kernel</a>) does not impact the sampler-learner gap. </p></li></ul><blockquote><p><em>&#8220;Responses capped at 20K tokens exhibit a higher maximum mismatch than those capped at 4K&#8230; the mean mismatch remains similar across both settings&#8230; longer sequences provide more opportunities for a single, large probability divergence, even when the average per-token difference remains stable.&#8221;</em> - from [4]</p></blockquote><p>Beyond the factors mentioned above, there are other choices that may impact the sampler-learner gap but are not deeply analyzed in [4]. For example, dense models exhibit different levels of mismatch compared to <a href="https://cameronrwolfe.substack.com/p/moe-llms">Mixture-of-Experts (MoE) models</a>, while base models tend to have a smaller mismatch compared to models that have already been post-trained. Additionally, the mismatch can fluctuate depending upon characteristics of our data (e.g., difficulty or domain).</p><h4>More Tweaks: GSPO, GMPO, CISPO and Beyond</h4><p>We have now learned about the most popular GRPO modifications that have been recently proposed, but there are still many other useful papers in this space. This section will provide a wider overview of such work with links to further reading.</p><blockquote><p><em>&#8220;Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization.&#8221;</em> - from [6]</p></blockquote><p><strong>Group Sequence Policy Optimization (GSPO) [6]</strong> is a modified version of GRPO that yields benefits in terms of stability and efficiency, <em>especially for MoE models</em>. The GSPO algorithm was used for training <a href="https://arxiv.org/abs/2505.09388">Qwen 3 models</a>, which are (at the time of writing) the most performant and widely used open weight models. The key idea behind GSPO is changing the loss to operate at the sequence level instead of the token level. Most LLMs are trained using outcome rewards, meaning the reward is assigned at the sequence level. Assuming a single outcome reward, GRPO assigns the same advantage to every token in a sequence. Despite using outcome supervision, however, the surrogate loss in GRPO defines a per-token policy (or importance) ratio that scales the gradient of each token; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8mDV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a1d5910-aa2c-491c-b82e-0ea3ca5ca43f_2212x879.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8mDV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a1d5910-aa2c-491c-b82e-0ea3ca5ca43f_2212x879.png 424w, https://substackcdn.com/image/fetch/$s_!8mDV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a1d5910-aa2c-491c-b82e-0ea3ca5ca43f_2212x879.png 848w, https://substackcdn.com/image/fetch/$s_!8mDV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a1d5910-aa2c-491c-b82e-0ea3ca5ca43f_2212x879.png 1272w, https://substackcdn.com/image/fetch/$s_!8mDV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a1d5910-aa2c-491c-b82e-0ea3ca5ca43f_2212x879.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8mDV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a1d5910-aa2c-491c-b82e-0ea3ca5ca43f_2212x879.png" width="1456" height="579" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a1d5910-aa2c-491c-b82e-0ea3ca5ca43f_2212x879.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:579,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:269110,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a1d5910-aa2c-491c-b82e-0ea3ca5ca43f_2212x879.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8mDV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a1d5910-aa2c-491c-b82e-0ea3ca5ca43f_2212x879.png 424w, https://substackcdn.com/image/fetch/$s_!8mDV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a1d5910-aa2c-491c-b82e-0ea3ca5ca43f_2212x879.png 848w, https://substackcdn.com/image/fetch/$s_!8mDV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a1d5910-aa2c-491c-b82e-0ea3ca5ca43f_2212x879.png 1272w, https://substackcdn.com/image/fetch/$s_!8mDV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a1d5910-aa2c-491c-b82e-0ea3ca5ca43f_2212x879.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Token-level importance and sequence-level advantage in the GRPO loss</figcaption></figure></div><p>In this standard formulation of the surrogate objective in GRPO, there is a misalignment between how the model is optimized&#8212;<em>on the token level</em>&#8212;and how rewards are assigned&#8212;<em>on the sequence level</em>. Using token-level importance ratios increases the variance of the policy gradient and can lead to training stability issues in large-scale RL runs. To avoid these issues, GSPO instead computes the importance ratio on the sequence-level, which aligns naturally with the reward structure used for LLMs and improves training stability; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TJKy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F730f70d6-f5a1-4478-b95f-2b85f255bcbc_2260x1127.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TJKy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F730f70d6-f5a1-4478-b95f-2b85f255bcbc_2260x1127.png 424w, https://substackcdn.com/image/fetch/$s_!TJKy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F730f70d6-f5a1-4478-b95f-2b85f255bcbc_2260x1127.png 848w, https://substackcdn.com/image/fetch/$s_!TJKy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F730f70d6-f5a1-4478-b95f-2b85f255bcbc_2260x1127.png 1272w, https://substackcdn.com/image/fetch/$s_!TJKy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F730f70d6-f5a1-4478-b95f-2b85f255bcbc_2260x1127.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TJKy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F730f70d6-f5a1-4478-b95f-2b85f255bcbc_2260x1127.png" width="1456" height="726" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/730f70d6-f5a1-4478-b95f-2b85f255bcbc_2260x1127.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:726,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:357458,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F730f70d6-f5a1-4478-b95f-2b85f255bcbc_2260x1127.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TJKy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F730f70d6-f5a1-4478-b95f-2b85f255bcbc_2260x1127.png 424w, https://substackcdn.com/image/fetch/$s_!TJKy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F730f70d6-f5a1-4478-b95f-2b85f255bcbc_2260x1127.png 848w, https://substackcdn.com/image/fetch/$s_!TJKy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F730f70d6-f5a1-4478-b95f-2b85f255bcbc_2260x1127.png 1272w, https://substackcdn.com/image/fetch/$s_!TJKy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F730f70d6-f5a1-4478-b95f-2b85f255bcbc_2260x1127.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">GSPO loss function (from [6])</figcaption></figure></div><p>The importance ratio is computed using the probability of the entire sequence, and we apply clipping to this sequence-level importance ratio. By doing this, we apply a stable sequence-level weight to all tokens, rather than introducing token-level importance weights with high variance. Notably, the importance ratio in GSPO is still normalized by the number of tokens in a completion <code>T</code>, ensuring that the ratio does not fluctuate drastically based on the length of a sequence. GSPO also uses the same advantage formulation as GRPO, allowing it to keep the same computational efficiency (i.e., from not using a value model).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MwMq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faef63e52-0d8d-4e36-b815-aafd3cba33d2_1686x1082.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MwMq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faef63e52-0d8d-4e36-b815-aafd3cba33d2_1686x1082.png 424w, https://substackcdn.com/image/fetch/$s_!MwMq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faef63e52-0d8d-4e36-b815-aafd3cba33d2_1686x1082.png 848w, https://substackcdn.com/image/fetch/$s_!MwMq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faef63e52-0d8d-4e36-b815-aafd3cba33d2_1686x1082.png 1272w, https://substackcdn.com/image/fetch/$s_!MwMq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faef63e52-0d8d-4e36-b815-aafd3cba33d2_1686x1082.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MwMq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faef63e52-0d8d-4e36-b815-aafd3cba33d2_1686x1082.png" width="1456" height="934" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aef63e52-0d8d-4e36-b815-aafd3cba33d2_1686x1082.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:934,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:303591,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faef63e52-0d8d-4e36-b815-aafd3cba33d2_1686x1082.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MwMq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faef63e52-0d8d-4e36-b815-aafd3cba33d2_1686x1082.png 424w, https://substackcdn.com/image/fetch/$s_!MwMq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faef63e52-0d8d-4e36-b815-aafd3cba33d2_1686x1082.png 848w, https://substackcdn.com/image/fetch/$s_!MwMq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faef63e52-0d8d-4e36-b815-aafd3cba33d2_1686x1082.png 1272w, https://substackcdn.com/image/fetch/$s_!MwMq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faef63e52-0d8d-4e36-b815-aafd3cba33d2_1686x1082.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [6])</figcaption></figure></div><p>When used in experiments, GSPO not only improves training stability, but also offers better sample efficiency and overall performance; see above. The stability of GSPO is found to be especially useful when training large MoE models, such as <a href="https://huggingface.co/Qwen/Qwen3-235B-A22B">Qwen3-235B-A22B</a>. In particular, we often experience expert-activation volatility when training MoEs with RL, meaning that a large portion of experts active for a given prompt change or fluctuate drastically after one or more policy updates. This volatility in expert selection can prevent convergence during RL training.</p><p>Initially, Qwen 3 models solved this issue via <a href="https://arxiv.org/abs/2510.11370">routing replay</a>, which caches the initial experts selected for a prompt and uses these same experts for computing several subsequent policy updates. Routing replay enables convergence of MoE models when trained with GRPO. However, GSPO naturally provides stable RL training for MoEs without the need for any complex workarounds; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8t5j!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d7afa76-4be0-4333-a1c4-d3728d3077e7_1684x666.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8t5j!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d7afa76-4be0-4333-a1c4-d3728d3077e7_1684x666.png 424w, https://substackcdn.com/image/fetch/$s_!8t5j!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d7afa76-4be0-4333-a1c4-d3728d3077e7_1684x666.png 848w, https://substackcdn.com/image/fetch/$s_!8t5j!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d7afa76-4be0-4333-a1c4-d3728d3077e7_1684x666.png 1272w, https://substackcdn.com/image/fetch/$s_!8t5j!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d7afa76-4be0-4333-a1c4-d3728d3077e7_1684x666.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8t5j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d7afa76-4be0-4333-a1c4-d3728d3077e7_1684x666.png" width="1456" height="576" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6d7afa76-4be0-4333-a1c4-d3728d3077e7_1684x666.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:576,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:176804,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d7afa76-4be0-4333-a1c4-d3728d3077e7_1684x666.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8t5j!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d7afa76-4be0-4333-a1c4-d3728d3077e7_1684x666.png 424w, https://substackcdn.com/image/fetch/$s_!8t5j!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d7afa76-4be0-4333-a1c4-d3728d3077e7_1684x666.png 848w, https://substackcdn.com/image/fetch/$s_!8t5j!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d7afa76-4be0-4333-a1c4-d3728d3077e7_1684x666.png 1272w, https://substackcdn.com/image/fetch/$s_!8t5j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d7afa76-4be0-4333-a1c4-d3728d3077e7_1684x666.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p><strong>Geometric Mean Policy Optimization (GMPO) [7]</strong> addresses the same problem observed by GSPO but uses a different approach. During RL training with GRPO, token-level importance ratios can become large in magnitude, creating outlier importance weights that cause training instability. GMPO solves this issue by using a new aggregation strategy for the loss. In  GRPO, the loss is aggregated by taking the mean of token-level losses over the sequence. GSPO improves stability by calculating importance ratios at a sequence level (i.e., not the token level). In contrast, GMPO still uses token-level importance ratios, but we aggregate the token-level loss by taking a <a href="https://en.wikipedia.org/wiki/Geometric_mean">geometric mean</a> over the sequence; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HXrW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9344095d-9e37-44d5-8f22-76368a5c898e_1948x818.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HXrW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9344095d-9e37-44d5-8f22-76368a5c898e_1948x818.png 424w, https://substackcdn.com/image/fetch/$s_!HXrW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9344095d-9e37-44d5-8f22-76368a5c898e_1948x818.png 848w, https://substackcdn.com/image/fetch/$s_!HXrW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9344095d-9e37-44d5-8f22-76368a5c898e_1948x818.png 1272w, https://substackcdn.com/image/fetch/$s_!HXrW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9344095d-9e37-44d5-8f22-76368a5c898e_1948x818.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HXrW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9344095d-9e37-44d5-8f22-76368a5c898e_1948x818.png" width="1456" height="611" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9344095d-9e37-44d5-8f22-76368a5c898e_1948x818.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:611,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:478808,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9344095d-9e37-44d5-8f22-76368a5c898e_1948x818.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!HXrW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9344095d-9e37-44d5-8f22-76368a5c898e_1948x818.png 424w, https://substackcdn.com/image/fetch/$s_!HXrW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9344095d-9e37-44d5-8f22-76368a5c898e_1948x818.png 848w, https://substackcdn.com/image/fetch/$s_!HXrW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9344095d-9e37-44d5-8f22-76368a5c898e_1948x818.png 1272w, https://substackcdn.com/image/fetch/$s_!HXrW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9344095d-9e37-44d5-8f22-76368a5c898e_1948x818.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [7])</figcaption></figure></div><p>Because geometric means involve taking roots, they are only defined for non-negative numbers. To get around this, the geometric mean in GMPO is computed over absolute values of token-level losses and multiplied by the sign of the advantage (i.e., either <code>-1</code> or <code>1</code>) to ensure correct directionality of the update. </p><blockquote><p><em>&#8220;GMPO is plug-and-play&#8212;simply replacing GRPO&#8217;s arithmetic mean with the geometric mean of token-level rewards, as the latter is inherently less sensitive to outliers.&#8221;</em> - from [7]</p></blockquote><p>Given that arithmetic means are sensitive to outliers, outlier importance ratios during RL training can cause instability in the standard GRPO loss. On the other hand, geometric means are less sensitive to outliers and can, therefore, help to reduce the variance of the policy gradient; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Xzuq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8da0aaa-bd88-4340-af1f-5d8e7d70ba3a_1952x486.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Xzuq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8da0aaa-bd88-4340-af1f-5d8e7d70ba3a_1952x486.png 424w, https://substackcdn.com/image/fetch/$s_!Xzuq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8da0aaa-bd88-4340-af1f-5d8e7d70ba3a_1952x486.png 848w, https://substackcdn.com/image/fetch/$s_!Xzuq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8da0aaa-bd88-4340-af1f-5d8e7d70ba3a_1952x486.png 1272w, https://substackcdn.com/image/fetch/$s_!Xzuq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8da0aaa-bd88-4340-af1f-5d8e7d70ba3a_1952x486.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Xzuq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8da0aaa-bd88-4340-af1f-5d8e7d70ba3a_1952x486.png" width="1456" height="363" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f8da0aaa-bd88-4340-af1f-5d8e7d70ba3a_1952x486.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:363,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:237970,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8da0aaa-bd88-4340-af1f-5d8e7d70ba3a_1952x486.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Xzuq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8da0aaa-bd88-4340-af1f-5d8e7d70ba3a_1952x486.png 424w, https://substackcdn.com/image/fetch/$s_!Xzuq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8da0aaa-bd88-4340-af1f-5d8e7d70ba3a_1952x486.png 848w, https://substackcdn.com/image/fetch/$s_!Xzuq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8da0aaa-bd88-4340-af1f-5d8e7d70ba3a_1952x486.png 1272w, https://substackcdn.com/image/fetch/$s_!Xzuq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8da0aaa-bd88-4340-af1f-5d8e7d70ba3a_1952x486.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(from [7])</figcaption></figure></div><p>Although GMPO still uses token-level importance ratios and applies clipping at the token level, a wider clipping range is needed relative to GRPO; e.g., authors in [7] use a range of <code>[~0.7, ~1.5]</code> instead of the default <code>[0.8, 1.2]</code> range used by GRPO. To ensure numerical stability, we usually compute importance ratios (and the entire geometric mean) using log probabilities instead of raw probability values. See below for an example&#8212;<em>this is a common practical trick used by most PPO-style algorithms</em><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-9" href="#footnote-9" target="_self">9</a>. The clipping range used for GMPO corresponds to clipping the log of the importance ratio within the range <code>[-0.4, 0.4]</code>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UrZq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407f1e11-4517-425c-b946-388e55faef33_1978x794.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UrZq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407f1e11-4517-425c-b946-388e55faef33_1978x794.png 424w, https://substackcdn.com/image/fetch/$s_!UrZq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407f1e11-4517-425c-b946-388e55faef33_1978x794.png 848w, https://substackcdn.com/image/fetch/$s_!UrZq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407f1e11-4517-425c-b946-388e55faef33_1978x794.png 1272w, https://substackcdn.com/image/fetch/$s_!UrZq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407f1e11-4517-425c-b946-388e55faef33_1978x794.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UrZq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407f1e11-4517-425c-b946-388e55faef33_1978x794.png" width="1456" height="584" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/407f1e11-4517-425c-b946-388e55faef33_1978x794.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:584,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:208341,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407f1e11-4517-425c-b946-388e55faef33_1978x794.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UrZq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407f1e11-4517-425c-b946-388e55faef33_1978x794.png 424w, https://substackcdn.com/image/fetch/$s_!UrZq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407f1e11-4517-425c-b946-388e55faef33_1978x794.png 848w, https://substackcdn.com/image/fetch/$s_!UrZq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407f1e11-4517-425c-b946-388e55faef33_1978x794.png 1272w, https://substackcdn.com/image/fetch/$s_!UrZq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407f1e11-4517-425c-b946-388e55faef33_1978x794.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Example implementation of the GMPO loss (from [7])</figcaption></figure></div><p>We learn from ablations in [7] that token-level clipping outperforms computing and clipping importance ratios at the sequence level. Importance ratios during RL training lie in a more stable range relative to GRPO as well; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Egxp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ee3395c-fe52-4ec6-b938-eff5ad5aed29_1948x952.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Egxp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ee3395c-fe52-4ec6-b938-eff5ad5aed29_1948x952.png 424w, https://substackcdn.com/image/fetch/$s_!Egxp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ee3395c-fe52-4ec6-b938-eff5ad5aed29_1948x952.png 848w, https://substackcdn.com/image/fetch/$s_!Egxp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ee3395c-fe52-4ec6-b938-eff5ad5aed29_1948x952.png 1272w, https://substackcdn.com/image/fetch/$s_!Egxp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ee3395c-fe52-4ec6-b938-eff5ad5aed29_1948x952.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Egxp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ee3395c-fe52-4ec6-b938-eff5ad5aed29_1948x952.png" width="1456" height="712" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1ee3395c-fe52-4ec6-b938-eff5ad5aed29_1948x952.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:712,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:635987,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ee3395c-fe52-4ec6-b938-eff5ad5aed29_1948x952.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Egxp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ee3395c-fe52-4ec6-b938-eff5ad5aed29_1948x952.png 424w, https://substackcdn.com/image/fetch/$s_!Egxp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ee3395c-fe52-4ec6-b938-eff5ad5aed29_1948x952.png 848w, https://substackcdn.com/image/fetch/$s_!Egxp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ee3395c-fe52-4ec6-b938-eff5ad5aed29_1948x952.png 1272w, https://substackcdn.com/image/fetch/$s_!Egxp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ee3395c-fe52-4ec6-b938-eff5ad5aed29_1948x952.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [7])</figcaption></figure></div><p>Compared to GRPO, GMPO also has more stable entropy during training, which is a positive sign of exploration. In the math domain, GMPO improves Pass@1 performance by as much as 4% absolute, and the largest performance benefits are observed when training multimodal and MoE models. </p><p><strong>Clipped Importance Sampling Weight Policy Optimization (CISPO) [8]</strong> is a modified variant of GRPO that is proposed in the MiniMax-M1 technical report and shown to benefit training stability in experiments with large-scale RL. In experiments with PPO and GRPO, authors in [8] observe that &#8220;fork&#8221; tokens in the model&#8217;s reasoning trace (e.g., &#8220;aha&#8221; or &#8220;wait&#8221;) are rare and tend to have low probabilities, leading them to be assigned large importance ratios. Unfortunately, these pivotal fork tokens, which play an important role in the LLM&#8217;s reasoning process and help to stabilize entropy during training, are usually clipped by the GRPO objective, which eliminates their contribution to the policy update.</p><div class="pullquote"><p>&#8220;We found that tokens associated with reflective behaviors&#8230; were typically rare and assigned low probabilities by our base model. During policy updates, these tokens were likely to exhibit high [importance ratio] values. As a result, these tokens were clipped out after the first on-policy update, preventing them from contributing to subsequent off-policy gradient updates&#8230; These low-probability tokens are often crucial for stabilizing entropy and facilitating scalable RL.&#8221; - from [8]</p></div><p>In DAPO [1], this issue is addressed via the clip higher approach, which lessens restrictions on policy updates for exploration tokens by increasing the upper bound of clipping in GRPO. However, such an approach is less effective for MiniMax-M1 because 16 policy updates are performed over each batch of data&#8212;<em>most standard RL setups perform fewer (~2-4) updates</em>. Usually, the importance ratio will exceed the clipping range after a few policy updates, and tokens with larger ratios will eventually be ignored by all subsequent policy updates. Ideally, we should allow pivotal exploration tokens to contribute to all policy updates.  </p><p>CISPO uses the same advantage estimation technique as GRPO, but the structure of the objective resembles that of <a href="https://cameronrwolfe.substack.com/p/reinforce">REINFORCE</a>; see below. Unlike REINFORCE, however, token-level losses in CISPO are scaled by a clipped version of the importance ratio. Due to the use of a stop gradient, the importance ratio is treated as a constant that scales each token&#8217;s contribution to the overall policy gradient, <em>but it is not backpropagated when computing the gradient</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eThP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0400f580-966c-4b30-8a42-70dddc7a89cc_2485x661.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eThP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0400f580-966c-4b30-8a42-70dddc7a89cc_2485x661.png 424w, https://substackcdn.com/image/fetch/$s_!eThP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0400f580-966c-4b30-8a42-70dddc7a89cc_2485x661.png 848w, https://substackcdn.com/image/fetch/$s_!eThP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0400f580-966c-4b30-8a42-70dddc7a89cc_2485x661.png 1272w, https://substackcdn.com/image/fetch/$s_!eThP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0400f580-966c-4b30-8a42-70dddc7a89cc_2485x661.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eThP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0400f580-966c-4b30-8a42-70dddc7a89cc_2485x661.png" width="1456" height="387" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0400f580-966c-4b30-8a42-70dddc7a89cc_2485x661.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:387,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:250861,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0400f580-966c-4b30-8a42-70dddc7a89cc_2485x661.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eThP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0400f580-966c-4b30-8a42-70dddc7a89cc_2485x661.png 424w, https://substackcdn.com/image/fetch/$s_!eThP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0400f580-966c-4b30-8a42-70dddc7a89cc_2485x661.png 848w, https://substackcdn.com/image/fetch/$s_!eThP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0400f580-966c-4b30-8a42-70dddc7a89cc_2485x661.png 1272w, https://substackcdn.com/image/fetch/$s_!eThP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0400f580-966c-4b30-8a42-70dddc7a89cc_2485x661.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">CISPO loss (from [8])</figcaption></figure></div><p>For PPO and GRPO, tokens that are clipped from the loss receive zero gradient&#8212;<em>they have no contribution to the policy update</em>. By treating the importance ratio as a capped constant, CISPO adopts a soft, token-level clipping strategy. Clipped tokens still contribute to the gradient, but their weight is capped at a maximum value, as determined by the clipping mechanism in CISPO. When compared to GRPO and DAPO [1] for training Qwen2.5-32B-Base on math reasoning tasks, CISPO is found to improve both stability and sample efficiency; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eN8U!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2983668c-bf0b-49f5-b375-d99d6f40291e_1278x658.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eN8U!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2983668c-bf0b-49f5-b375-d99d6f40291e_1278x658.png 424w, https://substackcdn.com/image/fetch/$s_!eN8U!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2983668c-bf0b-49f5-b375-d99d6f40291e_1278x658.png 848w, https://substackcdn.com/image/fetch/$s_!eN8U!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2983668c-bf0b-49f5-b375-d99d6f40291e_1278x658.png 1272w, https://substackcdn.com/image/fetch/$s_!eN8U!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2983668c-bf0b-49f5-b375-d99d6f40291e_1278x658.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eN8U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2983668c-bf0b-49f5-b375-d99d6f40291e_1278x658.png" width="1278" height="658" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2983668c-bf0b-49f5-b375-d99d6f40291e_1278x658.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:658,&quot;width&quot;:1278,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:141546,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/181791956?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2983668c-bf0b-49f5-b375-d99d6f40291e_1278x658.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eN8U!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2983668c-bf0b-49f5-b375-d99d6f40291e_1278x658.png 424w, https://substackcdn.com/image/fetch/$s_!eN8U!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2983668c-bf0b-49f5-b375-d99d6f40291e_1278x658.png 848w, https://substackcdn.com/image/fetch/$s_!eN8U!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2983668c-bf0b-49f5-b375-d99d6f40291e_1278x658.png 1272w, https://substackcdn.com/image/fetch/$s_!eN8U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2983668c-bf0b-49f5-b375-d99d6f40291e_1278x658.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [8])</figcaption></figure></div><p><strong>More GRPO variants.</strong> Given the popularity of reasoning and RL in current LLM research, there are many modified algorithms and practical tweaks that have been proposed in the wake of GRPO. Only a small (though notable!) part of this work has been covered in this overview. To learn more, there are <a href="https://ydnyshhh.github.io/posts/policy_optimization/">several</a> <a href="https://www.interconnects.ai/p/papers-im-reading-base-model-rl-grpo">great</a> <a href="https://magazine.sebastianraschka.com/p/the-state-of-llm-reasoning-model-training">posts</a> beyond this overview that have been written on the topic. Additionally, a list of other notable works in the area has been compiled below:</p><ul><li><p><em><a href="https://arxiv.org/abs/2510.23027">Router-Shift Policy Optimization (RSPO)</a></em> is an MoE-focused RL algorithm that rescales router logits to improve training stability.</p></li><li><p><em><a href="https://arxiv.org/abs/2511.20347">Soft Adaptive Policy Optimization (SAPO)</a> </em>replaces clipping for the policy ratio with a softer gating mechanism to encourage stable policy updates.</p></li><li><p><em><a href="https://arxiv.org/abs/2505.12929">Low-Probability Token Isolation (Lopti)</a></em> reduces the effect of low-probability tokens on the policy gradient and emphasizes parameter updates driven by high-probability tokens to improve the efficiency of RL. </p></li><li><p><em><a href="https://arxiv.org/abs/2504.05118">Value-based Augmented Proximal Policy Optimization (VAPO)</a></em> builds upon work in DAPO to improve RL efficiency via the introduction of <a href="https://cameronrwolfe.substack.com/i/175107358/proximal-policy-optimization-algorithms">value models</a>. </p></li><li><p><em><a href="https://arxiv.org/abs/2508.08221">Lite PPO</a></em> performs an extensive empirical analysis of RL for reasoning, arriving at a critic-free RL algorithm&#8212;<em>based upon the vanilla PPO loss</em>&#8212;that consistently outperforms GRPO and DAPO. The main idea is to perform token-level loss aggregation and compute the standard deviation from the GRPO advantage over the entire batch instead of the group.</p></li><li><p><em><a href="https://arxiv.org/abs/2509.02333">Dynamic Clipping Policy Optimization (DCPO)</a></em> proposes a dynamic clipping scheme for token-level importance ratios and standardizes rewards across consecutive training steps to avoid cases with zero policy gradients. </p></li><li><p><em><a href="https://arxiv.org/abs/2504.11343">Reinforce-Rej</a></em> proposes a simple scheme&#8212;<em>inspired by <a href="https://rlhfbook.com/c/10-rejection-sampling">rejection sampling</a></em>&#8212;that improves RL efficiency by removing entirely correct and incorrect samples during training (similarly to dynamic sampling). </p></li></ul><p>If you are aware of any other works that propose improvements to GRPO, please share them in the comments so that this list can be improved and expanded!</p><h2>Putting It All Together</h2><blockquote><p><em>&#8220;Our TIS fix addresses the distribution mismatch problem rooted in the system level&#8230; Such a problem widely exists in RL training frameworks&#8230; our fix can be applied irrespective of the specific RL algorithms used.&#8221;</em> - from [4]</p></blockquote><p>Throughout the course of this overview, we have seen a wide variety of tips and tricks that can be applied to improve the effectiveness of RL training with GRPO. Despite the breadth of this work, we must remember that these proposals are not mutually exclusive&#8212;<em>the most performant RL setups will combine many best practices together</em>. For example, Olmo 3 [5] provides a perfect example of an RL training pipeline that incorporates several techniques from recent research. Specifically, the following set of improvements are adopted for training the Olmo 3 Think reasoning models with GRPO:</p><ul><li><p><em>Zero Gradient Filtering</em>: prompts for which the entire group of completions or rollouts in GRPO receive the same reward are removed [1].</p></li><li><p><em>Active Sampling</em>: to maintain a constant batch size despite filtering zero-gradient examples, additional samples are always available to replace those that are filtered out [1].</p></li><li><p><em>Token-Level Loss</em>: the GRPO loss is normalized by the total number of tokens across the batch instead of per-sequence, which avoids instilling a length bias in the loss [1].</p></li><li><p><em>No KL Loss</em>: the KL divergence term is removed from the GRPO loss to allow for more flexibility in the policy updates, which is a common choice in recent reasoning research.</p></li><li><p><em>Clipping Upper Bound</em>: the upper bound in the <a href="https://cameronrwolfe.substack.com/i/175107358/proximal-policy-optimization-algorithms">PPO-style clipping</a> used by GRPO is set higher than the lower bound to enable larger policy updates [1].</p></li><li><p><em>Truncated Importance Sampling (TIS)</em>: an extra importance sampling term is added to the GRPO loss to adjust for differences in log probabilities between engines used for training and inference [4].</p></li><li><p><em>No Standard Deviation</em>: the standard deviation of rewards in a group is excluded from the denominator of the GRPO advantage calculation [3].</p></li></ul><p>The modified GRPO objective for Olmo 3 is shown below. Compared to vanilla GRPO, we maintain the high-level structure of the loss but <em>i)</em> normalize the objective differently, <em>ii)</em> slightly change the advantage, <em>iii)</em> tweak the upper bound for clipping, and <em>iv)</em> weight the objective using TIS. Plus, <em>there is no need to stop here</em>! RL is a rapidly evolving research domain. We must actively monitor work in this area over time, test new modifications to the GRPO objective, and continually incorporate the tricks that are found to be helpful empirically. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ih7u!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa761060e-d04d-4338-8ad9-412917fe2309_2374x693.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ih7u!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa761060e-d04d-4338-8ad9-412917fe2309_2374x693.png 424w, https://substackcdn.com/image/fetch/$s_!Ih7u!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa761060e-d04d-4338-8ad9-412917fe2309_2374x693.png 848w, https://substackcdn.com/image/fetch/$s_!Ih7u!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa761060e-d04d-4338-8ad9-412917fe2309_2374x693.png 1272w, https://substackcdn.com/image/fetch/$s_!Ih7u!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa761060e-d04d-4338-8ad9-412917fe2309_2374x693.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ih7u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa761060e-d04d-4338-8ad9-412917fe2309_2374x693.png" width="1456" height="425" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a761060e-d04d-4338-8ad9-412917fe2309_2374x693.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:425,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ih7u!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa761060e-d04d-4338-8ad9-412917fe2309_2374x693.png 424w, https://substackcdn.com/image/fetch/$s_!Ih7u!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa761060e-d04d-4338-8ad9-412917fe2309_2374x693.png 848w, https://substackcdn.com/image/fetch/$s_!Ih7u!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa761060e-d04d-4338-8ad9-412917fe2309_2374x693.png 1272w, https://substackcdn.com/image/fetch/$s_!Ih7u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa761060e-d04d-4338-8ad9-412917fe2309_2374x693.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Enhanced GRPO formulation for Olmo 3 (from [5])</figcaption></figure></div><h4>New to the newsletter?</h4><p>Hi! I&#8217;m <a href="https://cameronrwolfe.me/">Cameron R. Wolfe</a>, Deep Learning Ph.D. and Senior Research Scientist at <a href="https://research.netflix.com/research-area/nlp-and-conversations">Netflix</a>. This is the Deep (Learning) Focus newsletter, where I help readers better understand important topics in AI research. The newsletter will always be free and open to read. If you like the newsletter, please subscribe, consider a paid subscription, share it, or follow me on <a href="https://twitter.com/cwolferesearch">X</a> and <a href="https://www.linkedin.com/in/cameron-r-wolfe-ph-d-04744a238/">LinkedIn</a>!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://cameronrwolfe.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://cameronrwolfe.substack.com/subscribe?"><span>Subscribe now</span></a></p><h4>Bibliography</h4><p>[1] Yu, Qiying, et al. &#8220;Dapo: An open-source llm reinforcement learning system at scale.&#8221; <em>arXiv preprint arXiv:2503.14476</em> (2025).</p><p>[2] Guo, Daya, et al. &#8220;Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.&#8221; <em>arXiv preprint arXiv:2501.12948</em> (2025).</p><p>[3] Liu, Zichen, et al. &#8220;Understanding r1-zero-like training: A critical perspective.&#8221; <em>arXiv preprint arXiv:2503.20783</em> (2025).</p><p>[4] F. Yao, L. Liu, D. Zhang, C. Dong, J. Shang, and J. Gao. Your efficient rl framework secretly brings you off-policy rl training, Aug. 2025. URL <a href="https://fengyao.notion.site/off-policy-rl">https://fengyao.notion.site/off-policy-rl</a>.</p><p>[5] Olmo, Team, et al. &#8220;Olmo 3.&#8221; <em>arXiv preprint arXiv:2512.13961</em> (2025).</p><p>[6] Zheng, Chujie, et al. &#8220;Group sequence policy optimization.&#8221; <em>arXiv preprint arXiv:2507.18071</em> (2025).</p><p>[7] Zhao, Yuzhong, et al. &#8220;Geometric-mean policy optimization.&#8221; <em>arXiv preprint arXiv:2507.20673</em> (2025).</p><p>[8] Chen, Aili, et al. &#8220;MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention.&#8221; <em>arXiv preprint arXiv:2506.13585</em> (2025).</p><p>[9] Team, Kimi, et al. &#8220;Kimi k1. 5: Scaling reinforcement learning with llms.&#8221; <em>arXiv preprint arXiv:2501.12599</em> (2025).</p><p>[10] Hu, Jingcheng, et al. &#8220;Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.&#8221; <em>arXiv preprint arXiv:2503.24290</em> (2025).</p><p>[11] Schulman, John, et al. &#8220;Proximal policy optimization algorithms.&#8221; <em>arXiv preprint arXiv:1707.06347</em> (2017).</p><p>[12] Schulman, John. &#8220;Approximating KL Divergence.&#8221; Online (2020). <a href="http://joschu.net/blog/kl-approx.html">http://joschu.net/blog/kl-approx.html</a>.</p><p>[13] Shao, Zhihong, et al. &#8220;Deepseekmath: Pushing the limits of mathematical reasoning in open language models.&#8221; <em>arXiv preprint arXiv:2402.03300</em> (2024).</p><p>[14] Liu, Aixin, et al. &#8220;Deepseek-v3 technical report.&#8221; <em>arXiv preprint arXiv:2412.19437</em> (2024).</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>However, preference tuning in general can still play a useful role in modern LLM research; e.g., <a href="https://cameronrwolfe.substack.com/i/179769076/thinking-models">Olmo 3 Think</a> includes <a href="https://cameronrwolfe.substack.com/p/direct-preference-optimization">DPO</a>-based preference tuning as part of the post-training pipeline for reasoning capabilities.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Entropy can be computed in a language model as described <a href="https://thegradient.pub/understanding-evaluation-metrics-for-language-models/">here</a>. Put simply, entropy looks at the next-token distribution of our LLM and quantifies the amount of entropy that exists in this distribution. In plain English, low entropy means that almost all of the probability is assigned to a single token, while high entropy means that the probability mass is spread across a larger number of tokens.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>At the time of writing, curriculum learning for RL (at least with LLMs) is not widely used. Most focus is placed on data composition rather than curriculum. However, this could become an interesting future topic of study. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>The <a href="https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k">DAPO-Math-17K</a> dataset in HuggingFace actually contains ~1.8M rows, but many of these rows are duplicates. These rows are deduplicated in the DAPO code to arrive at a final set of ~17K prompts. Directions for how exactly to deduplicate this dataset properly can be found in <a href="https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k/discussions/3">these notes</a>. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>Only answer format is considered, not the actual correctness of the answer. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>Normalizing or whitening advantages is a very common practice improve in RL that is often used to improve training stability. However, rewards are usually normalized over a batch of data, whereas the bias demonstrated in [3] exists on a question-level. Batch-level normalization is consistent across all examples in the batch, but the question-level normalization in GRPO can lead to biased policy updates based on the difficulty of each individual question.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>This time can also be dominated by the long tail of completions that have many tokens. Most completions tend to be of short or average length&#8212;<em>these may complete quickly when sampling rollouts</em>. However, a much longer amount of time may be spent waiting for a few very long completions to finish. This long tail problem can significantly degrade the efficiency of RL training, especially in a synchronous setup. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>We usually express GRPO via the clipped surrogate objective, rather than as a direct policy gradient expression. However, the policy gradient in GRPO is just the gradient of this surrogate objective.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-9" href="#footnote-anchor-9" class="footnote-number" contenteditable="false" target="_self">9</a><div class="footnote-content"><p>For example, we can see in <a href="https://cameronrwolfe.substack.com/i/175107358/proximal-policy-optimization-algorithms">this implementation of the PPO loss</a> that we compute the importance ratio using log probabilities instead of raw probability values.  </p></div></div>]]></content:encoded></item><item><title><![CDATA[Olmo 3 and the Open LLM Renaissance]]></title><description><![CDATA[Fully-open artifacts with the potential to make LLM research a reality for anyone...]]></description><link>https://cameronrwolfe.substack.com/p/olmo-3</link><guid isPermaLink="false">https://cameronrwolfe.substack.com/p/olmo-3</guid><dc:creator><![CDATA[Cameron R. Wolfe, Ph.D.]]></dc:creator><pubDate>Mon, 15 Dec 2025 10:33:23 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/4535661c-4484-4944-b8ac-6ab546ee3b3d_2483x1398.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2noQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fde3124-4365-49be-a655-9551603c6c62_2480x1394.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2noQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fde3124-4365-49be-a655-9551603c6c62_2480x1394.png 424w, https://substackcdn.com/image/fetch/$s_!2noQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fde3124-4365-49be-a655-9551603c6c62_2480x1394.png 848w, https://substackcdn.com/image/fetch/$s_!2noQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fde3124-4365-49be-a655-9551603c6c62_2480x1394.png 1272w, https://substackcdn.com/image/fetch/$s_!2noQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fde3124-4365-49be-a655-9551603c6c62_2480x1394.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2noQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fde3124-4365-49be-a655-9551603c6c62_2480x1394.png" width="1456" height="818" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8fde3124-4365-49be-a655-9551603c6c62_2480x1394.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:818,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1767544,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fde3124-4365-49be-a655-9551603c6c62_2480x1394.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!2noQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fde3124-4365-49be-a655-9551603c6c62_2480x1394.png 424w, https://substackcdn.com/image/fetch/$s_!2noQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fde3124-4365-49be-a655-9551603c6c62_2480x1394.png 848w, https://substackcdn.com/image/fetch/$s_!2noQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fde3124-4365-49be-a655-9551603c6c62_2480x1394.png 1272w, https://substackcdn.com/image/fetch/$s_!2noQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fde3124-4365-49be-a655-9551603c6c62_2480x1394.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1, 5, 11])</figcaption></figure></div><p>As the capabilities of large language models (LLMs) have continued to progress, AI research has generally become less accessible to those outside of frontier labs. Although a variety of open-source LLMs are publicly available, there are two key issues that have consistently impeded progress in open research:</p><ul><li><p>The performance gap between closed and open models.</p></li><li><p>The prevalence of open-weight models, <em>and the scarcity of fully-open models</em>.</p></li></ul><p>Put simply, most &#8220;open&#8221; LLMs only publicly release the model&#8217;s weights (and sometimes an accompanying technical report). However, these weights are only a shallow snapshot of the model&#8217;s training process. To reproduce any component of this training process, more artifacts (e.g., data, code, training recipes, and deeper technical details) are needed. The limitations of open-weights LLMs have caused fully-open LLMs to become more popular, with AI2&#8217;s <a href="https://allenai.org/olmo">Open Language Model (Olmo) series</a> being one of the most prominent proposals in the space. In this post, we will provide a comprehensive and understandable overview of Olmo 3 [1]&#8212;<em>the most recent release in the Olmo series and top-performing fully-open LLM</em>.</p><div class="pullquote"><p>&#8220;We introduce Olmo 3, a family of state-of-the-art, fully open language models at the 7B and 32B parameter scales. The release includes&#8230; every stage, checkpoint, datapoint, and dependency used to build [Olmo 3]. Our flagship model, Olmo 3 Think-32B, is the strongest fully open thinking model released to-date.&#8221; - from [1]</p></div><p>As we will see, Olmo 3 lags behind the performance of top frontier models, but the value of these models lies in their transparency. In addition to providing a detailed technical report [1], Olmo 3 releases model checkpoints across the entire training process, all of the training data, and full training and evaluation code&#8212;<em>the models can be completely retrained from scratch using these resources</em>. For these reasons, the value of Olmo 3 goes beyond simply providing better, fully-open LLMs. For anyone interested in contributing to open LLM research, <em>Olmo 3 and its artifacts are among the most comprehensive starting points to ever be released</em>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!toOu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb010019a-c883-4aac-9632-c86601ec4e78_2112x590.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!toOu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb010019a-c883-4aac-9632-c86601ec4e78_2112x590.png 424w, https://substackcdn.com/image/fetch/$s_!toOu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb010019a-c883-4aac-9632-c86601ec4e78_2112x590.png 848w, https://substackcdn.com/image/fetch/$s_!toOu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb010019a-c883-4aac-9632-c86601ec4e78_2112x590.png 1272w, https://substackcdn.com/image/fetch/$s_!toOu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb010019a-c883-4aac-9632-c86601ec4e78_2112x590.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!toOu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb010019a-c883-4aac-9632-c86601ec4e78_2112x590.png" width="1456" height="407" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b010019a-c883-4aac-9632-c86601ec4e78_2112x590.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:407,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:100280,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb010019a-c883-4aac-9632-c86601ec4e78_2112x590.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!toOu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb010019a-c883-4aac-9632-c86601ec4e78_2112x590.png 424w, https://substackcdn.com/image/fetch/$s_!toOu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb010019a-c883-4aac-9632-c86601ec4e78_2112x590.png 848w, https://substackcdn.com/image/fetch/$s_!toOu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb010019a-c883-4aac-9632-c86601ec4e78_2112x590.png 1272w, https://substackcdn.com/image/fetch/$s_!toOu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb010019a-c883-4aac-9632-c86601ec4e78_2112x590.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Olmo 3 model flow (from [1])</figcaption></figure></div><p><strong>Model flow.</strong> The high-level training pipelines, referred to as &#8220;model flows&#8221; in [1], used for training both sizes (i.e., 7B and 32B) of Olmo 3 models are shown above. Base models for Olmo 3, which are also <a href="https://huggingface.co/allenai/Olmo-3-1125-32B">openly released</a>, are created via a three-stage process of general pretraining, midtraining on targeted data, and a context extension phase. From here, base models undergo a sequential post-training process that includes supervised finetuning (SFT), direct preference optimization (DPO), and RL training to produce multiple Olmo 3 model variants:</p><ul><li><p><em>Olmo 3 Instruct</em>: non-reasoning models that quickly respond to user queries and are optimized for multi-turn chat, instruction following, and tool usage.</p></li><li><p><em>Olmo 3 Think</em>: reasoning models that undergo specialized training to hone their complex reasoning capabilities by outputting long chains of thought (or reasoning trajectories) prior to providing a final answer. </p></li><li><p><em>Olmo 3 RL-Zero</em>: reasoning models that are created by running reinforcement learning (RL) training directly on the pretrained base model&#8212;<em>this setup was popularized by the DeepSeek-R1 model [9]</em>. </p></li></ul><p>Notably, the training algorithms and pipeline used for the Instruct and Think models are quite similar, but the data are modified to target unique capabilities. After covering necessary details of the Olmo 3 model architecture, we will explain in detail each component of this training process&#8212;<em>beginning with pretraining and ending with reasoning-oriented RL training</em>&#8212;in an end-to-end fashion. </p><p><strong>Preliminaries.</strong> This overview outlines the entire training pipeline for Olmo 3. In a single overview, we cannot cover all necessary background information needed to understand how a near-frontier-level LLM is trained. Instead, most important concepts will be explained inline as they are introduced throughout the overview. Additionally, an index of important topics that will appear throughout the overview (with links for further learning) is provided below:</p><ul><li><p><a href="https://cameronrwolfe.substack.com/p/language-model-training-and-inference">LLM pretraining</a> and <a href="https://cameronrwolfe.substack.com/p/llm-scaling-laws">scaling laws</a>.</p></li><li><p><a href="https://cameronrwolfe.substack.com/p/understanding-and-using-supervised">Supervised Finetuning (SFT)</a> and <a href="https://cameronrwolfe.substack.com/p/direct-preference-optimization">Direct Preference Optimization (DPO)</a>. </p></li><li><p><a href="https://cameronrwolfe.substack.com/p/demystifying-reasoning-models">Reasoning models</a>.</p></li><li><p><a href="https://cameronrwolfe.substack.com/i/153722335/reinforcement-learning-with-verifiable-rewards">Reinforcement Learning with Verifiable Rewards (RLVR)</a>. </p></li><li><p><a href="https://cameronrwolfe.substack.com/p/grpo">Group Relative Policy Optimization (GRPO)</a>.</p></li></ul><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://cameronrwolfe.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Join 50,000 others who use Deep (Learning) Focus to understand AI research.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Base Models</h2><blockquote><p><em>&#8220;The goal of Olmo 3 Base is to establish a strong foundation that supports a diversity of general capabilities while enabling downstream capabilities like thinking, tool-use, and instruction-following to be easily elicited during post-training.&#8221;</em> - from [1]</p></blockquote><p>A new base model is pretrained from scratch for Olmo 3 with a special focus on key capabilities like reasoning and agents (i.e., function calling or tool use). These capabilities are usually elicited during later post-training stages, but we lay the groundwork during pretraining by exposing the model to a diverse dataset and building a robust knowledge base. Specifically, Olmo 3 undergoes three separate phases of pretraining:</p><ol><li><p>A general <strong>pretraining</strong> stage over a large textual corpus.</p></li><li><p>A <strong>midtraining</strong> phase focusing on targeted, high-quality data.</p></li><li><p>A <strong>context extension</strong> phase teaching the model to handle longer inputs.</p></li></ol><p>To improve upon Olmo 2 [3], authors in [1] explore new data curation strategies and iterate on the pretraining process in a scientifically rigorous manner. An expanded suite of benchmarks and evaluations that meaningfully capture base model performance across diverse experimental settings is also created, allowing the highest-performing pretraining recipe to be discovered empirically.</p><p><strong>Training infrastructure. </strong>Pretraining code and recipes for Olmo 3 are available in the <a href="https://github.com/allenai/OLMo-core">Olmo-Core repository</a>, allowing all Olmo 3 model checkpoints to be exactly reproduced. The pretraining process relies upon <a href="https://arxiv.org/abs/2304.11277">Fully-Sharded Data Parallel (FSDP)</a><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> distributed training, which saves memory by sharding<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a> parameters, gradients, and optimizer states across GPUs; see below. During the forward and backward passes, each GPU gathers the full parameters for the current layer from the shards distributed across all GPUs, computes the necessary operations, and then re-shards the parameters&#8212;<em>and gradients after the backward pass</em>&#8212;before moving on to the next layer. As a result, we only store a single full layer in GPU memory at any given time, while all other layers are sharded across GPUs.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ssa2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc92c84de-0516-435e-97c3-6524e91e3483_4372x1975.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ssa2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc92c84de-0516-435e-97c3-6524e91e3483_4372x1975.png 424w, https://substackcdn.com/image/fetch/$s_!Ssa2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc92c84de-0516-435e-97c3-6524e91e3483_4372x1975.png 848w, https://substackcdn.com/image/fetch/$s_!Ssa2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc92c84de-0516-435e-97c3-6524e91e3483_4372x1975.png 1272w, https://substackcdn.com/image/fetch/$s_!Ssa2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc92c84de-0516-435e-97c3-6524e91e3483_4372x1975.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ssa2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc92c84de-0516-435e-97c3-6524e91e3483_4372x1975.png" width="1456" height="658" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c92c84de-0516-435e-97c3-6524e91e3483_4372x1975.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:658,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ssa2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc92c84de-0516-435e-97c3-6524e91e3483_4372x1975.png 424w, https://substackcdn.com/image/fetch/$s_!Ssa2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc92c84de-0516-435e-97c3-6524e91e3483_4372x1975.png 848w, https://substackcdn.com/image/fetch/$s_!Ssa2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc92c84de-0516-435e-97c3-6524e91e3483_4372x1975.png 1272w, https://substackcdn.com/image/fetch/$s_!Ssa2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc92c84de-0516-435e-97c3-6524e91e3483_4372x1975.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">FSDP configuration (<a href="https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/">source</a>)</figcaption></figure></div><p>FSDP is also performing data parallel training (i.e., the &#8220;DP&#8221; part of FSDP). In addition to sharding, each GPU processes a unique mini-batch of data, allowing the total batch size to reach 8&#215; (assuming eight GPUs) the maximum batch size of a single GPU. For example, we can see the full training settings for the Olmo 3 32B Base model below, which uses a total batch size of 1,024 during pretraining. Given that the pretraining process uses 1,024 GPUs in total, this means that each GPU in the cluster is processing a single sequence during a training step. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AOXK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F827a1887-8913-49f9-ae36-efeb8b7fa01d_1792x616.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AOXK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F827a1887-8913-49f9-ae36-efeb8b7fa01d_1792x616.png 424w, https://substackcdn.com/image/fetch/$s_!AOXK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F827a1887-8913-49f9-ae36-efeb8b7fa01d_1792x616.png 848w, https://substackcdn.com/image/fetch/$s_!AOXK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F827a1887-8913-49f9-ae36-efeb8b7fa01d_1792x616.png 1272w, https://substackcdn.com/image/fetch/$s_!AOXK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F827a1887-8913-49f9-ae36-efeb8b7fa01d_1792x616.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AOXK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F827a1887-8913-49f9-ae36-efeb8b7fa01d_1792x616.png" width="1792" height="616" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/827a1887-8913-49f9-ae36-efeb8b7fa01d_1792x616.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:616,&quot;width&quot;:1792,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:184901,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff503fc97-0686-4706-acd4-a1b83907bc7d_1792x1334.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!AOXK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F827a1887-8913-49f9-ae36-efeb8b7fa01d_1792x616.png 424w, https://substackcdn.com/image/fetch/$s_!AOXK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F827a1887-8913-49f9-ae36-efeb8b7fa01d_1792x616.png 848w, https://substackcdn.com/image/fetch/$s_!AOXK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F827a1887-8913-49f9-ae36-efeb8b7fa01d_1792x616.png 1272w, https://substackcdn.com/image/fetch/$s_!AOXK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F827a1887-8913-49f9-ae36-efeb8b7fa01d_1792x616.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>When pretraining a modern LLM like Olmo 3, we use more than just a single eight-GPU node. For example, we just mentioned that Olmo 3 is pretrained with 1,024 H100 GPUs (or 128 eight-GPU nodes), while midtraining and long context training use 128 and 256 GPUs, respectively. However, sharding across thousands of GPUs is inefficient because inter-node communication is much slower than intra-node communication. To solve this, we usually apply FSDP inside of each eight-GPU node and create replicas of the model across nodes to avoid constantly communicating model parameters&#8212;<em>which is very expensive</em>&#8212;between nodes.</p><div class="pullquote"><p><em>&#8220;We ran on 128 nodes with 8&#215; NVIDIA H100 (80GB HBM3) per node, connected via TCPXO (200 Gbps/GPU). We used HSDP via PyTorch FSDP2 with 8-way sharding so each node hosted a single model replica. Communication-intensive collectives were therefore restricted to within-node, improving efficiency.&#8221;</em> - from [1]</p></div><p>Within each node, FSDP is used to shard the model, while across nodes, standard data parallelism is used. Each node has a full copy of the model, and gradients are averaged across nodes at each step. This way, we sync parameters and gradients across nodes during each model update, rather than performing syncs at every layer as in FSDP. This approach, called <a href="https://blog.ezyang.com/2025/08/the-parallelism-mesh-zoo/">Hybrid-Sharded Data Parallel (HSDP)</a>, is used during all phases of training for the Olmo 3 Base models.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Wnt8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa40ea10a-7d76-441e-921b-2c6a3491bff5_1880x822.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Wnt8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa40ea10a-7d76-441e-921b-2c6a3491bff5_1880x822.png 424w, https://substackcdn.com/image/fetch/$s_!Wnt8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa40ea10a-7d76-441e-921b-2c6a3491bff5_1880x822.png 848w, https://substackcdn.com/image/fetch/$s_!Wnt8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa40ea10a-7d76-441e-921b-2c6a3491bff5_1880x822.png 1272w, https://substackcdn.com/image/fetch/$s_!Wnt8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa40ea10a-7d76-441e-921b-2c6a3491bff5_1880x822.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Wnt8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa40ea10a-7d76-441e-921b-2c6a3491bff5_1880x822.png" width="1456" height="637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a40ea10a-7d76-441e-921b-2c6a3491bff5_1880x822.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:637,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:228087,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa40ea10a-7d76-441e-921b-2c6a3491bff5_1880x822.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Wnt8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa40ea10a-7d76-441e-921b-2c6a3491bff5_1880x822.png 424w, https://substackcdn.com/image/fetch/$s_!Wnt8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa40ea10a-7d76-441e-921b-2c6a3491bff5_1880x822.png 848w, https://substackcdn.com/image/fetch/$s_!Wnt8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa40ea10a-7d76-441e-921b-2c6a3491bff5_1880x822.png 1272w, https://substackcdn.com/image/fetch/$s_!Wnt8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa40ea10a-7d76-441e-921b-2c6a3491bff5_1880x822.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Depiction of tensor and context parallelism, or TP and CP (<a href="https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/context_parallel.html">source</a>)</figcaption></figure></div><p>The primary limitation of the HSDP setup described above is the fact that it does not shard everything. For example, <em>full activations are stored on each GPU</em>! When performing long context training, storing the full activations on each GPU can lead to memory issues. As a solution, authors in [1] add <a href="https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/context_parallel.html">Context Parallelism (CP) </a>to their distributed training setup, which splits the model&#8217;s input across multiple GPUs in a node along the sequence dimension to reduce memory overhead; see above. To support a multi-node setup, we can apply CP in tandem with FSDP inside of a node, then create data parallel replicas across nodes as in HSDP. </p><p><strong>Base model evaluation.</strong> The performance of Olmo 3 Base models across a wide variety of benchmarks is presented in the table below. Among fully-open models&#8212;<em>meaning weights, data, and code are all available</em>&#8212;like <a href="https://marin.community/">Marin 32B</a> and <a href="https://huggingface.co/swiss-ai/Apertus-70B-2509">Apertus 70B</a>, Olmo 3 Base models achieve state-of-the-art performance and make notable gains in the math and coding domains. When including open-weights models like Qwen and Gemma, Olmo 3 performs comparably in some domains (e.g., question answering), while lagging behind in others (e.g., math and code). </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6yaq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5d46840-9365-42b0-8447-2f885c6f4fba_1889x1776.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6yaq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5d46840-9365-42b0-8447-2f885c6f4fba_1889x1776.png 424w, https://substackcdn.com/image/fetch/$s_!6yaq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5d46840-9365-42b0-8447-2f885c6f4fba_1889x1776.png 848w, https://substackcdn.com/image/fetch/$s_!6yaq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5d46840-9365-42b0-8447-2f885c6f4fba_1889x1776.png 1272w, https://substackcdn.com/image/fetch/$s_!6yaq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5d46840-9365-42b0-8447-2f885c6f4fba_1889x1776.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6yaq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5d46840-9365-42b0-8447-2f885c6f4fba_1889x1776.png" width="1456" height="1369" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c5d46840-9365-42b0-8447-2f885c6f4fba_1889x1776.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1369,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:612262,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5d46840-9365-42b0-8447-2f885c6f4fba_1889x1776.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6yaq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5d46840-9365-42b0-8447-2f885c6f4fba_1889x1776.png 424w, https://substackcdn.com/image/fetch/$s_!6yaq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5d46840-9365-42b0-8447-2f885c6f4fba_1889x1776.png 848w, https://substackcdn.com/image/fetch/$s_!6yaq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5d46840-9365-42b0-8447-2f885c6f4fba_1889x1776.png 1272w, https://substackcdn.com/image/fetch/$s_!6yaq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5d46840-9365-42b0-8447-2f885c6f4fba_1889x1776.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>When analyzing the performance of Olmo 3 models, we will notice that they are not usually state-of-the-art when compared to open-weights LLMs. However, Olmo 3 models do outperform fully open models and approach the performance of the best open-weight models in most domains. Because the Olmo 3 model series discloses its full training dataset, certain data sources must be removed from training to retain a commercial license. Open-weight models, due to not disclosing training data, do not operate under this restriction, which may (partially) explain the gap in performance. Despite lagging slightly behind the state-of-the-art, however, <em>Olmo 3 models are an invaluable contribution due to their transparency and the ecosystem of tools they provide for further research</em>. </p><h4>Model Architecture</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4zhO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12c1f6f8-a7a7-402f-99a0-424c00b41303_1545x814.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4zhO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12c1f6f8-a7a7-402f-99a0-424c00b41303_1545x814.png 424w, https://substackcdn.com/image/fetch/$s_!4zhO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12c1f6f8-a7a7-402f-99a0-424c00b41303_1545x814.png 848w, https://substackcdn.com/image/fetch/$s_!4zhO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12c1f6f8-a7a7-402f-99a0-424c00b41303_1545x814.png 1272w, https://substackcdn.com/image/fetch/$s_!4zhO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12c1f6f8-a7a7-402f-99a0-424c00b41303_1545x814.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4zhO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12c1f6f8-a7a7-402f-99a0-424c00b41303_1545x814.png" width="1456" height="767" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/12c1f6f8-a7a7-402f-99a0-424c00b41303_1545x814.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:767,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:404455,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12c1f6f8-a7a7-402f-99a0-424c00b41303_1545x814.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4zhO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12c1f6f8-a7a7-402f-99a0-424c00b41303_1545x814.png 424w, https://substackcdn.com/image/fetch/$s_!4zhO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12c1f6f8-a7a7-402f-99a0-424c00b41303_1545x814.png 848w, https://substackcdn.com/image/fetch/$s_!4zhO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12c1f6f8-a7a7-402f-99a0-424c00b41303_1545x814.png 1272w, https://substackcdn.com/image/fetch/$s_!4zhO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12c1f6f8-a7a7-402f-99a0-424c00b41303_1545x814.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from <a href="https://magazine.sebastianraschka.com/">Ahead of AI</a> by <a href="https://x.com/rasbt/status/1991656199394050380">Sebastian Raschka</a>)</figcaption></figure></div><p>The model architecture used by Olmo 3 [1] (shown above) is a dense<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>, <a href="https://cameronrwolfe.substack.com/p/decoder-only-transformers-the-workhorse">decoder-only transformer</a> architecture very similar to that of Olmo 2 [3]. There are two model sizes released&#8212;<em>7B and 32B parameters</em>&#8212;which have the same structure, differing only in the following aspects:</p><ul><li><p>Number of self-attention heads.</p></li><li><p>Number of key and value heads (in self-attention). </p></li><li><p>Dimension of hidden layers and token vectors.</p></li><li><p>Number of total layers. </p></li></ul><p>This architecture follows most design decisions found in other popular open LLMs, such as the Qwen-3 [21] series. Notably, Olmo 3 maintains the <a href="https://cameronrwolfe.substack.com/i/170257215/transformer-structure">post-normalization </a>structure (with <a href="https://arxiv.org/abs/1910.07467">RMSNorm</a>) that was shown by Olmo 2 to improve training stability. Additionally, <a href="https://arxiv.org/abs/2010.04245">QK-norm</a> is used, meaning an additional RMSNorm layer is applied to queries and keys before computing the attention operation. This additional normalization avoids attention logits from becoming too large, which can aid in training stability (especially for low precision training). This same approach is also used by models such as <a href="https://arxiv.org/abs/2503.19786">Gemma-3</a> and Olmo 2.</p><p>In the 7B model, attention layers in Olmo 3 are regular, <a href="https://cameronrwolfe.substack.com/i/155023686/masked-and-multi-headed-self-attention">multi-headed attention layers</a> instead of <a href="https://cameronrwolfe.substack.com/i/170257215/attention-implementation">Grouped Query Attention (GQA)</a> layers. In contrast, the 32B model uses GQA with 40 attention heads and only eight key and value heads. As shown below, GQA shares keys and values&#8212;<em>but not queries!</em>&#8212;between multiple attention heads, which benefits both parameter and compute efficiency. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QELC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a7dc1e2-e66c-4a30-a0a7-518ae7e3a566_1536x596.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QELC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a7dc1e2-e66c-4a30-a0a7-518ae7e3a566_1536x596.png 424w, https://substackcdn.com/image/fetch/$s_!QELC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a7dc1e2-e66c-4a30-a0a7-518ae7e3a566_1536x596.png 848w, https://substackcdn.com/image/fetch/$s_!QELC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a7dc1e2-e66c-4a30-a0a7-518ae7e3a566_1536x596.png 1272w, https://substackcdn.com/image/fetch/$s_!QELC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a7dc1e2-e66c-4a30-a0a7-518ae7e3a566_1536x596.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QELC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a7dc1e2-e66c-4a30-a0a7-518ae7e3a566_1536x596.png" width="1456" height="565" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8a7dc1e2-e66c-4a30-a0a7-518ae7e3a566_1536x596.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:565,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QELC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a7dc1e2-e66c-4a30-a0a7-518ae7e3a566_1536x596.png 424w, https://substackcdn.com/image/fetch/$s_!QELC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a7dc1e2-e66c-4a30-a0a7-518ae7e3a566_1536x596.png 848w, https://substackcdn.com/image/fetch/$s_!QELC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a7dc1e2-e66c-4a30-a0a7-518ae7e3a566_1536x596.png 1272w, https://substackcdn.com/image/fetch/$s_!QELC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a7dc1e2-e66c-4a30-a0a7-518ae7e3a566_1536x596.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(<a href="https://arxiv.org/abs/2305.13245">source</a>)</figcaption></figure></div><p>However, the biggest benefit of grouped-query attention comes at inference time. Memory bandwidth usage during inference is reduced because fewer keys and values need to be retrieved from the model&#8217;s <a href="https://huggingface.co/blog/not-lain/kv-caching">KV cache</a>. Given that memory bandwidth is the key bottleneck for the <a href="https://blog.vllm.ai/2025/09/05/anatomy-of-vllm.html">decode step</a> during the transformer&#8217;s inference process, this change drastically speeds up the inference process.</p><p>To further improve attention efficiency, Olmo 3 uses Sliding Window Attention (SWA), which only considers tokens inside of a sliding window&#8212;<em>Olmo 3 adopts a window size of 4K tokens is particular</em>&#8212;during attention to save costs; see below. SWA is used in <code>3/4</code> layers&#8212;<em>every fourth model layer uses full attention</em>. SWA is a common architectural choice used by <a href="https://cameronrwolfe.substack.com/p/gpt-oss">GPT-OSS</a>, <a href="https://arxiv.org/abs/2310.06825">Mistral</a>, <a href="https://arxiv.org/abs/2503.19786">Gemma</a> and more. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!j9Pw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f35da6e-afad-4fb4-843b-9a8f16dafb6c_1400x782.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!j9Pw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f35da6e-afad-4fb4-843b-9a8f16dafb6c_1400x782.png 424w, https://substackcdn.com/image/fetch/$s_!j9Pw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f35da6e-afad-4fb4-843b-9a8f16dafb6c_1400x782.png 848w, https://substackcdn.com/image/fetch/$s_!j9Pw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f35da6e-afad-4fb4-843b-9a8f16dafb6c_1400x782.png 1272w, https://substackcdn.com/image/fetch/$s_!j9Pw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f35da6e-afad-4fb4-843b-9a8f16dafb6c_1400x782.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!j9Pw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f35da6e-afad-4fb4-843b-9a8f16dafb6c_1400x782.png" width="539" height="301.07" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8f35da6e-afad-4fb4-843b-9a8f16dafb6c_1400x782.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:782,&quot;width&quot;:1400,&quot;resizeWidth&quot;:539,&quot;bytes&quot;:73173,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!j9Pw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f35da6e-afad-4fb4-843b-9a8f16dafb6c_1400x782.png 424w, https://substackcdn.com/image/fetch/$s_!j9Pw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f35da6e-afad-4fb4-843b-9a8f16dafb6c_1400x782.png 848w, https://substackcdn.com/image/fetch/$s_!j9Pw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f35da6e-afad-4fb4-843b-9a8f16dafb6c_1400x782.png 1272w, https://substackcdn.com/image/fetch/$s_!j9Pw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f35da6e-afad-4fb4-843b-9a8f16dafb6c_1400x782.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Regular (masked) attention versus SWA</figcaption></figure></div><p>Finally, Olmo 3 uses <a href="https://docs.pytorch.org/docs/stable/generated/torch.nn.SiLU.html">Sigmoid Linear Unit (SiLU)</a> activations and is pretrained with a context window of 8K tokens. In a later training stage, Olmo 3 undergoes context extension using YaRN [8], which will be discussed more later in the overview. For a from-scratch implementation and detailed explanation of the Olmo 3 architecture, see <a href="https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/13_olmo3/standalone-olmo3.ipynb">this recent notebook</a> from <a href="https://sebastianraschka.com/">Sebastian Raschka</a>, or his extensive architecture comparison that includes most open LLMs. </p><div class="embedded-post-wrap" data-attrs="{&quot;id&quot;:168650848,&quot;url&quot;:&quot;https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison&quot;,&quot;publication_id&quot;:1174659,&quot;publication_name&quot;:&quot;Ahead of AI&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!96vs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49f25d0a-212b-4853-8bcb-128d0a3edbbf_1196x1196.png&quot;,&quot;title&quot;:&quot;The Big LLM Architecture Comparison&quot;,&quot;truncated_body_text&quot;:&quot;Last updated: Dec 14, 2025&quot;,&quot;date&quot;:&quot;2025-07-19T11:11:10.901Z&quot;,&quot;like_count&quot;:1516,&quot;comment_count&quot;:74,&quot;bylines&quot;:[{&quot;id&quot;:27393275,&quot;name&quot;:&quot;Sebastian Raschka, PhD&quot;,&quot;handle&quot;:&quot;rasbt&quot;,&quot;previous_name&quot;:&quot;Sebastian Raschka&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F61f4c017-506f-4e9b-a24f-76340dad0309_800x800.jpeg&quot;,&quot;bio&quot;:&quot;I'm an LLM research engineer 10+ years of experience in artificial intelligence. My expertise lies in AI &amp; LLM research focusing on code-driven implementations. I am also the author of \&quot;Build a Large Language Model From Scratch\&quot; (amzn.to/4fqvn0D).&quot;,&quot;profile_set_up_at&quot;:&quot;2022-10-09T16:19:59.744Z&quot;,&quot;reader_installed_at&quot;:&quot;2022-11-07T19:56:32.129Z&quot;,&quot;publicationUsers&quot;:[{&quot;id&quot;:1127862,&quot;user_id&quot;:27393275,&quot;publication_id&quot;:1174659,&quot;role&quot;:&quot;admin&quot;,&quot;public&quot;:true,&quot;is_primary&quot;:true,&quot;publication&quot;:{&quot;id&quot;:1174659,&quot;name&quot;:&quot;Ahead of AI&quot;,&quot;subdomain&quot;:&quot;sebastianraschka&quot;,&quot;custom_domain&quot;:&quot;magazine.sebastianraschka.com&quot;,&quot;custom_domain_optional&quot;:false,&quot;hero_text&quot;:&quot;Ahead of AI specializes in Machine Learning &amp; AI research and is read by tens of thousands of researchers and practitioners who want to stay ahead in the ever-evolving field.&quot;,&quot;logo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/49f25d0a-212b-4853-8bcb-128d0a3edbbf_1196x1196.png&quot;,&quot;author_id&quot;:27393275,&quot;primary_user_id&quot;:27393275,&quot;theme_var_background_pop&quot;:&quot;#2096FF&quot;,&quot;created_at&quot;:&quot;2022-11-04T18:30:05.218Z&quot;,&quot;email_from_name&quot;:null,&quot;copyright&quot;:&quot;Raschka AI Research (RAIR) Lab LLC&quot;,&quot;founding_plan_name&quot;:&quot;Founding plan&quot;,&quot;community_enabled&quot;:true,&quot;invite_only&quot;:false,&quot;payments_state&quot;:&quot;enabled&quot;,&quot;language&quot;:null,&quot;explicit&quot;:false,&quot;homepage_type&quot;:&quot;newspaper&quot;,&quot;is_personal_mode&quot;:false}}],&quot;twitter_screen_name&quot;:&quot;rasbt&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100,&quot;status&quot;:{&quot;bestsellerTier&quot;:100,&quot;subscriberTier&quot;:1,&quot;leaderboard&quot;:null,&quot;vip&quot;:false,&quot;badge&quot;:{&quot;type&quot;:&quot;bestseller&quot;,&quot;tier&quot;:100},&quot;paidPublicationIds&quot;:[9873],&quot;subscriber&quot;:null}}],&quot;utm_campaign&quot;:null,&quot;belowTheFold&quot;:true,&quot;type&quot;:&quot;newsletter&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="EmbeddedPostToDOM"><a class="embedded-post" native="true" href="https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison?utm_source=substack&amp;utm_campaign=post_embed&amp;utm_medium=web"><div class="embedded-post-header"><img class="embedded-post-publication-logo" src="https://substackcdn.com/image/fetch/$s_!96vs!,w_56,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49f25d0a-212b-4853-8bcb-128d0a3edbbf_1196x1196.png" loading="lazy"><span class="embedded-post-publication-name">Ahead of AI</span></div><div class="embedded-post-title-wrapper"><div class="embedded-post-title">The Big LLM Architecture Comparison</div></div><div class="embedded-post-body">Last updated: Dec 14, 2025&#8230;</div><div class="embedded-post-cta-wrapper"><span class="embedded-post-cta">Read more</span></div><div class="embedded-post-meta">10 months ago &#183; 1516 likes &#183; 74 comments &#183; Sebastian Raschka, PhD</div></a></div><h4>Evaluating the Base Model</h4><p>Developing a solid pretraining recipe is an empirical process&#8212;<em>we need to test a bunch of settings and see what works well</em>. Given that pretraining is expensive, the number of full-scale pretraining runs we can perform is limited. Instead, we test interventions to the pretraining process by:</p><ol><li><p>Formulating smaller-scale tests to validate our ideas.</p></li><li><p>Applying promising interventions to full-scale runs.</p></li></ol><p>However, such an approach can still be difficult&#8212;<em>results at a small scale may not translate well to larger-scale experiment</em>s. Some benchmarks may only be sensitive at specific scales. For example, small-scale pretraining tends to yield models with random performance on math and code benchmarks, but other benchmarks may already be saturated even at smaller scales. Additionally, the LLM evaluation process is generally noisy, so small differences in results may not be meaningful.</p><div class="pullquote"><p><em>&#8220;If something hurts performance at small scale, you can confidently rule it out for large scale. But if something works at small scale, you should still make sure you&#8217;ve trained on a reasonable number of tokens to conclude with high probability that these findings will extrapolate to larger scales. The longer you train and the closer the ablation models are to the final model, the better.&#8221; </em>- from [2]</p></div><p>OlmoBaseEval is a set of 43 total benchmarks that is created to guide pretraining experiments for Olmo 3. This suite is 4&#215; larger than the benchmarks used by Olmo 2. It covers a wide range of capabilities (including math and code), presents multiple newly proposed benchmarks, and maintains held-out test sets for several important capabilities targeted during pretraining. The benchmark is developed according to three major design principles:</p><ol><li><p><em>Task Clusters</em>: benchmarks are grouped into task clusters over which scores are aggregated, where each cluster targets a core capability.</p></li><li><p><em>Proxy Metrics</em>: a detailed scaling analysis is performed to determine which tasks provide a useful signal at which scales.</p></li><li><p><em>Signal-to-Noise Ratio (SNR)</em>: benchmarks with high SNR<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a> are either removed from the evaluation suite or evaluated using a larger number of samples.</p></li></ol><p>To form the task clusters, a pool of 23K benchmark scores is collected using 70 different open-weight models, then a clustering approach<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a> is used to group tasks with similar evaluation results together. In other words, <em>a cluster includes tasks that tend to rank models similarly during evaluation</em>. Some manual post-processing is performed to arrive at the final task clusters: multiple-choice (MC) STEM, MC non-stem, Math, Code, and Code Fill-in-the-Middle (FIM); see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dPCT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf3cdea1-a586-4fa3-a4ae-c11cedfbbf98_2278x694.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dPCT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf3cdea1-a586-4fa3-a4ae-c11cedfbbf98_2278x694.png 424w, https://substackcdn.com/image/fetch/$s_!dPCT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf3cdea1-a586-4fa3-a4ae-c11cedfbbf98_2278x694.png 848w, https://substackcdn.com/image/fetch/$s_!dPCT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf3cdea1-a586-4fa3-a4ae-c11cedfbbf98_2278x694.png 1272w, https://substackcdn.com/image/fetch/$s_!dPCT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf3cdea1-a586-4fa3-a4ae-c11cedfbbf98_2278x694.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dPCT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf3cdea1-a586-4fa3-a4ae-c11cedfbbf98_2278x694.png" width="1456" height="444" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/af3cdea1-a586-4fa3-a4ae-c11cedfbbf98_2278x694.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:444,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:316432,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf3cdea1-a586-4fa3-a4ae-c11cedfbbf98_2278x694.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dPCT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf3cdea1-a586-4fa3-a4ae-c11cedfbbf98_2278x694.png 424w, https://substackcdn.com/image/fetch/$s_!dPCT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf3cdea1-a586-4fa3-a4ae-c11cedfbbf98_2278x694.png 848w, https://substackcdn.com/image/fetch/$s_!dPCT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf3cdea1-a586-4fa3-a4ae-c11cedfbbf98_2278x694.png 1272w, https://substackcdn.com/image/fetch/$s_!dPCT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf3cdea1-a586-4fa3-a4ae-c11cedfbbf98_2278x694.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>A suite of 25 Olmo 2 [3] models trained with varying amounts of compute&#8212;<em>and a few other open-weight base models</em>&#8212;are used to conduct a scaling analysis, allowing us to observe the scale at which particular metrics become useful; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cZpX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c49b566-1c75-4a1e-bceb-1f4a2a77b887_1860x682.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cZpX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c49b566-1c75-4a1e-bceb-1f4a2a77b887_1860x682.png 424w, https://substackcdn.com/image/fetch/$s_!cZpX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c49b566-1c75-4a1e-bceb-1f4a2a77b887_1860x682.png 848w, https://substackcdn.com/image/fetch/$s_!cZpX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c49b566-1c75-4a1e-bceb-1f4a2a77b887_1860x682.png 1272w, https://substackcdn.com/image/fetch/$s_!cZpX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c49b566-1c75-4a1e-bceb-1f4a2a77b887_1860x682.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cZpX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c49b566-1c75-4a1e-bceb-1f4a2a77b887_1860x682.png" width="1456" height="534" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2c49b566-1c75-4a1e-bceb-1f4a2a77b887_1860x682.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:534,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:255107,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c49b566-1c75-4a1e-bceb-1f4a2a77b887_1860x682.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cZpX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c49b566-1c75-4a1e-bceb-1f4a2a77b887_1860x682.png 424w, https://substackcdn.com/image/fetch/$s_!cZpX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c49b566-1c75-4a1e-bceb-1f4a2a77b887_1860x682.png 848w, https://substackcdn.com/image/fetch/$s_!cZpX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c49b566-1c75-4a1e-bceb-1f4a2a77b887_1860x682.png 1272w, https://substackcdn.com/image/fetch/$s_!cZpX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c49b566-1c75-4a1e-bceb-1f4a2a77b887_1860x682.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>Based on this analysis, evaluation tasks are separated into two groups:</p><ul><li><p><em>Base Easy</em>: tasks that show signal at smaller scale.</p></li><li><p><em>Base Main</em>: tasks that were not yet saturated at larger scales. </p></li></ul><p>The Base Easy task suite includes all tasks from Base Main that have ground truth answers available. Performance on this suite is measured in bits-per-byte<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a>, which is computed by dividing the <a href="https://sebastianraschka.com/faq/docs/negative-log-likelihood-logistic-loss.html">negative log-likelihood</a> of the ground truth answer by the number of bytes in the answer string. Using bits-per-byte as a proxy metric for evaluating a pretrained LLM provides a less noisy measure of performance without requiring advanced instruction following capabilities. Other common strategies include <a href="https://huggingface.co/docs/transformers/perplexity">perplexity-based evaluation</a> or multiple choice questions. </p><blockquote><p><em>&#8220;Continuous proxy metrics have been shown to be a better decision making tool for model performance before we exit the noise floor.&#8221;</em> - from [1]</p></blockquote><p>The OlmoBaseEval suite is used across pretraining and midtraining. The Base Easy suite is used as a proxy for evaluating smaller-scale pretraining runs, while full-scale pretraining and midtraining runs are evaluated with Base Main. The entire OlmoBaseEval suite is openly available in <a href="https://github.com/allenai/olmes">the Olmes repo from AI2</a> and can be run on any model, as shown below (taken from the Olmes README). </p><pre><code># Run the base easy evaluation (for evaluating small-scale experiments)
olmes \
    --model allenai/Olmo-3-1025-7B \
    --task \
        olmo3:base_easy:code_bpb \
        olmo3:base_easy:math_bpb \
        olmo3:base_easy:qa_rc \
        olmo3:base_easy:qa_bpb \
    --output-dir &lt;output_dir&gt;

# Run the base main evaluation
olmes \
    --model allenai/Olmo-3-1025-7B \
    --task \
        olmo3:base:stem_qa_mc \
        olmo3:base:nonstem_qa_mc \
        olmo3:base:gen \
        olmo3:base:math \
        olmo3:base:code \
        olmo3:base:code_fim \
    --output-dir &lt;output_dir&gt;

# Run the base held-out evaluation
olmes \
    --model allenai/Olmo-3-1025-7B \
    --task \
        olmo3:heldout \
    --output-dir &lt;output_dir&gt;</code></pre><p><strong>Evaluation during pretraining.</strong> When running pretraining, we usually want to monitor the intermediate performance of our model. However, the learning rate has a huge impact on evaluation results. To get meaningful metrics, we must anneal (or decrease according to a schedule) our learning rate to zero prior to this evaluation being performed&#8212;<em>this simple approach is followed for the Olmo 3 7B model but is expensive</em>. As an efficient alternative, authors in [1] adopt a model merging approach from [6] for their 32B model that merges four checkpoints that are 1,000 steps apart before performing evaluation. This approach has been found to accurately simulate learning rate annealing behavior during pretraining.</p><blockquote><p><em>&#8220;We demonstrate that merging checkpoints trained with constant learning rates not only achieves significant performance improvements but also enables accurate prediction of annealing behavior. These improvements lead to both more efficient model development and significantly lower training costs.&#8221;</em> - from [6]</p></blockquote><p><strong>Model merging</strong> combines multiple models with the same architecture by taking a linear combination of their weights. This approach might seem bizarre, but it works well because LLMs finetuned from the same pretrained model are <a href="https://cameronrwolfe.substack.com/i/147448898/linear-mode-connectivity">mode connected</a>&#8212;<em>taking a linear combination of two such models&#8217; weights produces another model that performs well.</em> We can use model merging to combine multiple models into a hybrid model that shares the models&#8217; capabilities. One of the simplest model merging approaches is a model soup [22], which simply averages the weights of multiple model checkpoints. We can find public implementations of various model merging techniques in <a href="https://github.com/arcee-ai/mergekit">MergeKit</a>, which is also used in [1]. A full overview of model merging techniques can be found at the link below. </p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;d437fdf7-d18b-4403-bb79-57e99b0cce52&quot;,&quot;caption&quot;:&quot;To improve the performance of a machine learning model, we can train several models independently and average their predictions at inference time to form an ensemble. Ensembling has been used for decades in machine learning, but this approach comes with the downside of increased inference costs&#8212;&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Model Merging: A Survey&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:29736521,&quot;name&quot;:&quot;Cameron R. Wolfe, Ph.D.&quot;,&quot;bio&quot;:&quot;Research @ Netflix &#8226; Rice University PhD &#8226; I make AI understandable&quot;,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/69aba7df-b571-4609-aa47-fc2d031c11b8_1242x1595.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-09-16T09:33:51.978Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eea90630-3376-4b9a-8a7c-c410713b195d_2564x1426.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://cameronrwolfe.substack.com/p/model-merging&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:147448898,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:78,&quot;comment_count&quot;:8,&quot;publication_id&quot;:1092659,&quot;publication_name&quot;:&quot;Deep (Learning) Focus&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!87xa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab9b43fb-52d5-40da-995d-5b7cd3f91064_896x896.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h4>Pretraining</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3XpJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5db7c599-c067-473d-bb5c-0d374644bdfb_1834x762.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3XpJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5db7c599-c067-473d-bb5c-0d374644bdfb_1834x762.png 424w, https://substackcdn.com/image/fetch/$s_!3XpJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5db7c599-c067-473d-bb5c-0d374644bdfb_1834x762.png 848w, https://substackcdn.com/image/fetch/$s_!3XpJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5db7c599-c067-473d-bb5c-0d374644bdfb_1834x762.png 1272w, https://substackcdn.com/image/fetch/$s_!3XpJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5db7c599-c067-473d-bb5c-0d374644bdfb_1834x762.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3XpJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5db7c599-c067-473d-bb5c-0d374644bdfb_1834x762.png" width="1456" height="605" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5db7c599-c067-473d-bb5c-0d374644bdfb_1834x762.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:605,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:161660,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5db7c599-c067-473d-bb5c-0d374644bdfb_1834x762.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3XpJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5db7c599-c067-473d-bb5c-0d374644bdfb_1834x762.png 424w, https://substackcdn.com/image/fetch/$s_!3XpJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5db7c599-c067-473d-bb5c-0d374644bdfb_1834x762.png 848w, https://substackcdn.com/image/fetch/$s_!3XpJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5db7c599-c067-473d-bb5c-0d374644bdfb_1834x762.png 1272w, https://substackcdn.com/image/fetch/$s_!3XpJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5db7c599-c067-473d-bb5c-0d374644bdfb_1834x762.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Creating Dolma 3 Mix</figcaption></figure></div><p>The pretraining process for Olmo 3&#8212;<em>including both experiments and final training runs</em>&#8212;consumes over 90% of total compute for the project and targets four key capabilities: science, medical, math and coding. <a href="https://huggingface.co/datasets/allenai/dolma3_mix-6T-1025">Dolma 3 Mix</a>, which contains 6T tokens derived from the full <a href="https://huggingface.co/datasets/allenai/dolma3_pool">Dolma 3 pool</a> of 9T tokens, is the primary data source used for pretraining and is created using the steps illustrated above. These steps mostly match other open pretraining recipes [2, 3, 4], aside from:</p><ul><li><p>Using token-constrained mixing and quality-aware upsampling (details to follow) to improve the overall quality of tokens included in the mixture.</p></li><li><p>Including a new set of academic PDF data&#8212;<em>238M unique PDFs in total with a knowledge cutoff of December 2024</em>. This data is curated using a custom PDF crawler that prioritizes academic sites and paper repositories then converted into linear plain text with <a href="https://olmocr.allenai.org/">OlmOCR</a>.</p></li></ul><p>For Olmo 3 pretraining, authors only consider data sources that have a sufficient number of tokens to meaningfully impact model capabilities during pretraining&#8212;<em>additional small but high-quality data sources are reserved for midtraining</em>. Structured data (e.g., question-answer pairs or <a href="https://cameronrwolfe.substack.com/i/170257215/tokenizer">chat templated</a> data) is also saved for midtraining. Including structured data in pretraining&#8212;<em>even if the token quantity is small</em>&#8212;significantly impacts evaluation results and can complicate data ablations. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zOrQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01e6eafc-9d25-4c90-bf87-e12160b37e3f_2128x650.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zOrQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01e6eafc-9d25-4c90-bf87-e12160b37e3f_2128x650.png 424w, https://substackcdn.com/image/fetch/$s_!zOrQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01e6eafc-9d25-4c90-bf87-e12160b37e3f_2128x650.png 848w, https://substackcdn.com/image/fetch/$s_!zOrQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01e6eafc-9d25-4c90-bf87-e12160b37e3f_2128x650.png 1272w, https://substackcdn.com/image/fetch/$s_!zOrQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01e6eafc-9d25-4c90-bf87-e12160b37e3f_2128x650.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zOrQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01e6eafc-9d25-4c90-bf87-e12160b37e3f_2128x650.png" width="1456" height="445" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/01e6eafc-9d25-4c90-bf87-e12160b37e3f_2128x650.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:445,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:218173,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01e6eafc-9d25-4c90-bf87-e12160b37e3f_2128x650.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zOrQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01e6eafc-9d25-4c90-bf87-e12160b37e3f_2128x650.png 424w, https://substackcdn.com/image/fetch/$s_!zOrQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01e6eafc-9d25-4c90-bf87-e12160b37e3f_2128x650.png 848w, https://substackcdn.com/image/fetch/$s_!zOrQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01e6eafc-9d25-4c90-bf87-e12160b37e3f_2128x650.png 1272w, https://substackcdn.com/image/fetch/$s_!zOrQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01e6eafc-9d25-4c90-bf87-e12160b37e3f_2128x650.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p><strong>Mixing approach.</strong> The <a href="https://huggingface.co/datasets/allenai/dolma3_pool">full Dolma 3 pool</a> contains 9T tokens, but we must mix and sample this pool&#8212;<em>under the constraint of the total number of tokens we want to use for training (i.e., 6T for Olmo 3)</em>&#8212;to create the best possible pretraining corpus. As shown in the table above, Dolma 3 is partitioned into groups by type, and we must determine the optimal mixing ratio for each of these groups. The strategy for determining the best data mixture in [1] has two components:</p><ol><li><p>A <strong>base procedure</strong> that constructs a high-quality data mix over a fixed (i.e., not being actively changed or developed) set of data sources.</p></li><li><p>A <strong>conditional mixing</strong> step that efficiently updates our existing mix as data sources change during the model development process. </p></li></ol><blockquote><p><em>&#8220;We apply a mixing strategy that draws on swarm-based methods to train and evaluate many smaller proxy models, using these results to inform an optimal mix. Further, we apply a novel conditional mixing procedure to account for the fact that our data sources were being constantly refined and updated.&#8221;</em> - from [1]</p></blockquote><p>The base procedure in [1] uses a <a href="https://en.wikipedia.org/wiki/Particle_swarm_optimization">swarm optimization</a> approach that is similar to the idea of RegMix [5]. The swarm optimization proceeds as follows:</p><ol><li><p>Randomly sample a large number of mixtures. In [1], the number of mixtures sampled is set to 5&#215; the number of data sources being mixed.</p></li><li><p>Perform small proxy experiments by training a 30M parameter Olmo 3 model over 3B tokens from each mixture.</p></li><li><p>Evaluate each proxy model on the Base Easy suite.</p></li><li><p>For each task in the Base Easy suite, train a <a href="https://en.wikipedia.org/wiki/Generalized_linear_model">generalized linear model</a> that predicts task performance given the mixing parameters as input.</p></li><li><p>Use the generalized linear models to simulate performance of different data mixtures and search for the optimal data mixture under constraints. </p></li></ol><p>In [1], authors have a maximum token budget and aim to not repeat any domain in the data more than four to seven times. These constraints are added to the final optimization step when searching for the optimal data mixture. From here, we can take the optimal data mixture and test it at a larger scale; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bJIp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91356fc5-0884-4e84-96ba-9448881a5c4c_1706x1288.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bJIp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91356fc5-0884-4e84-96ba-9448881a5c4c_1706x1288.png 424w, https://substackcdn.com/image/fetch/$s_!bJIp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91356fc5-0884-4e84-96ba-9448881a5c4c_1706x1288.png 848w, https://substackcdn.com/image/fetch/$s_!bJIp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91356fc5-0884-4e84-96ba-9448881a5c4c_1706x1288.png 1272w, https://substackcdn.com/image/fetch/$s_!bJIp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91356fc5-0884-4e84-96ba-9448881a5c4c_1706x1288.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bJIp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91356fc5-0884-4e84-96ba-9448881a5c4c_1706x1288.png" width="1456" height="1099" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/91356fc5-0884-4e84-96ba-9448881a5c4c_1706x1288.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1099,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:461898,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91356fc5-0884-4e84-96ba-9448881a5c4c_1706x1288.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!bJIp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91356fc5-0884-4e84-96ba-9448881a5c4c_1706x1288.png 424w, https://substackcdn.com/image/fetch/$s_!bJIp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91356fc5-0884-4e84-96ba-9448881a5c4c_1706x1288.png 848w, https://substackcdn.com/image/fetch/$s_!bJIp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91356fc5-0884-4e84-96ba-9448881a5c4c_1706x1288.png 1272w, https://substackcdn.com/image/fetch/$s_!bJIp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F91356fc5-0884-4e84-96ba-9448881a5c4c_1706x1288.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [5])</figcaption></figure></div><p>During model development, we will usually search for the optimal data mixture more than once. Data sources are constantly changing and being improved, which influences the optimal mixture<em>.</em> Additionally, some sources of data may become available later in the development process. Re-running the base procedure from scratch is inefficient&#8212;<em>not all sources of data are changing</em>. Instead, a conditional mixing approach is proposed [1], which avoids re-computing the full swarm by:</p><ul><li><p>Beginning with a base mixture that has already been optimized.</p></li><li><p>Treating this mixture as a &#8220;virtual&#8221; data source with frozen mixing ratios.</p></li><li><p>Considering all new or modified data sources. </p></li><li><p>Re-running the base procedure with both new and virtual domains.</p></li></ul><p>Multiple rounds of data mixing are performed for Olmo 3, including an initial round to optimize the mixture of web data and several conditional mixing rounds that added code and PDF data to the mixture. Properties of the final data mixture are shown below<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a>, where we can see that training on the optimal data mixture&#8212;<em>as opposed to the natural data distribution</em>&#8212;improves performance on most tasks.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QX47!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1556431-82c7-4e63-9732-fc8b63352fa1_1632x1460.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QX47!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1556431-82c7-4e63-9732-fc8b63352fa1_1632x1460.png 424w, https://substackcdn.com/image/fetch/$s_!QX47!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1556431-82c7-4e63-9732-fc8b63352fa1_1632x1460.png 848w, https://substackcdn.com/image/fetch/$s_!QX47!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1556431-82c7-4e63-9732-fc8b63352fa1_1632x1460.png 1272w, https://substackcdn.com/image/fetch/$s_!QX47!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1556431-82c7-4e63-9732-fc8b63352fa1_1632x1460.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QX47!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1556431-82c7-4e63-9732-fc8b63352fa1_1632x1460.png" width="1456" height="1303" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d1556431-82c7-4e63-9732-fc8b63352fa1_1632x1460.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1303,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:338999,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1556431-82c7-4e63-9732-fc8b63352fa1_1632x1460.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QX47!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1556431-82c7-4e63-9732-fc8b63352fa1_1632x1460.png 424w, https://substackcdn.com/image/fetch/$s_!QX47!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1556431-82c7-4e63-9732-fc8b63352fa1_1632x1460.png 848w, https://substackcdn.com/image/fetch/$s_!QX47!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1556431-82c7-4e63-9732-fc8b63352fa1_1632x1460.png 1272w, https://substackcdn.com/image/fetch/$s_!QX47!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1556431-82c7-4e63-9732-fc8b63352fa1_1632x1460.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>The mixing strategy described above is also flexible and can be used to optimize more than just domain mixtures. For example, when optimizing the code mixture in [1], authors fix the overall ratio of code data at 25% and instead optimize the mixture of programming languages within this hard-coded token budget.</p><blockquote><p><em>&#8220;We found that quality-aware upsampling improves performance in data-constrained settings&#8230; We achieved better results by upsampling the highest-quality data: including multiple copies of the top 5% and single copies of the remaining data to reach the target token count.&#8221;</em> - from [1]</p></blockquote><p><strong>Quality-aware upsampling.</strong> We can further improve performance by upsampling&#8212;<em>or including multiple copies of</em>&#8212;the highest quality data in the training mixture. This effect can be achieved by first running all data through a quality classifier and forming an upsampling curve as shown below, where the x-axis represents data quality and the y-axis is the upsampling factor. If we were to filter data with a fixed quality threshold, this upsampling curve would be a step function, but authors in [1] model upsampling as a monotonically increasing curve. For example, we see below that the highest quality percentile of data receives an upsampling factor of ~7&#215;, <em>meaning the data is repeated seven times in training</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lnZb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9cf73b-4ece-4216-bb72-81cb7a023403_1616x774.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lnZb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9cf73b-4ece-4216-bb72-81cb7a023403_1616x774.png 424w, https://substackcdn.com/image/fetch/$s_!lnZb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9cf73b-4ece-4216-bb72-81cb7a023403_1616x774.png 848w, https://substackcdn.com/image/fetch/$s_!lnZb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9cf73b-4ece-4216-bb72-81cb7a023403_1616x774.png 1272w, https://substackcdn.com/image/fetch/$s_!lnZb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9cf73b-4ece-4216-bb72-81cb7a023403_1616x774.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lnZb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9cf73b-4ece-4216-bb72-81cb7a023403_1616x774.png" width="1456" height="697" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2a9cf73b-4ece-4216-bb72-81cb7a023403_1616x774.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:697,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:220667,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9cf73b-4ece-4216-bb72-81cb7a023403_1616x774.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lnZb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9cf73b-4ece-4216-bb72-81cb7a023403_1616x774.png 424w, https://substackcdn.com/image/fetch/$s_!lnZb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9cf73b-4ece-4216-bb72-81cb7a023403_1616x774.png 848w, https://substackcdn.com/image/fetch/$s_!lnZb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9cf73b-4ece-4216-bb72-81cb7a023403_1616x774.png 1272w, https://substackcdn.com/image/fetch/$s_!lnZb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9cf73b-4ece-4216-bb72-81cb7a023403_1616x774.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>A separate upsampling curve is formed for every topic in the pretraining data. To find this curve, we start with our known constraints for pretraining:</p><ul><li><p>The optimal mixture of topics (i.e., determined by data mixing).</p></li><li><p>The total number of desired tokens for training.</p></li><li><p>The maximum upsampling factor.</p></li></ul><p>From here, we can perform a search over the space of parametric curves to find one that meets these constraints. Once the curve is found, the data for a topic is separated into a discrete set of quality buckets or percentile ranges. We can compute the upsampling factor for a given bucket by <a href="https://www.youtube.com/watch?v=rfG8ce4nNh0">integrating</a> the upsampling curve over this bucket and dividing this integral by the width of the bucket. </p><h4>Midtraining &amp; Long Context</h4><p>Following the primary pretraining phase for Olmo 3, the model undergoes continued midtraining and long context training. The training objective during these phases is identical to that of pretraining, but we <em>i)</em> adopt more targeted datasets and <em>ii)</em> train for fewer tokens. For example, midtraining and long context training for Olmo 3 each train the model over an additional 100B tokens. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LizC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce4d4e1-da42-40a5-9941-aa74843c99ed_1622x560.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LizC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce4d4e1-da42-40a5-9941-aa74843c99ed_1622x560.png 424w, https://substackcdn.com/image/fetch/$s_!LizC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce4d4e1-da42-40a5-9941-aa74843c99ed_1622x560.png 848w, https://substackcdn.com/image/fetch/$s_!LizC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce4d4e1-da42-40a5-9941-aa74843c99ed_1622x560.png 1272w, https://substackcdn.com/image/fetch/$s_!LizC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce4d4e1-da42-40a5-9941-aa74843c99ed_1622x560.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LizC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce4d4e1-da42-40a5-9941-aa74843c99ed_1622x560.png" width="1456" height="503" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bce4d4e1-da42-40a5-9941-aa74843c99ed_1622x560.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:503,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:203984,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce4d4e1-da42-40a5-9941-aa74843c99ed_1622x560.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LizC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce4d4e1-da42-40a5-9941-aa74843c99ed_1622x560.png 424w, https://substackcdn.com/image/fetch/$s_!LizC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce4d4e1-da42-40a5-9941-aa74843c99ed_1622x560.png 848w, https://substackcdn.com/image/fetch/$s_!LizC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce4d4e1-da42-40a5-9941-aa74843c99ed_1622x560.png 1272w, https://substackcdn.com/image/fetch/$s_!LizC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce4d4e1-da42-40a5-9941-aa74843c99ed_1622x560.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p><strong>Midtraining</strong> for Olmo 3 uses the <a href="http://allenai/dolma3_dolmino_mix-100B-1125">Dolma 3 Dolmino Mix</a>, which contains 100B tokens curated to enhance key model capabilities. This data mix is derived via a two-part iterative process (illustrated above):</p><ol><li><p><em>Parallel (or distributed) feedback</em>: many data sources are considered in parallel via efficient microannealing experiments [2] that use lightweight training runs to ablate each data source<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a>. </p></li><li><p><em>Integration tests</em>: any data sources yielding promising microannealing results are combined into a centralized annealing run over a 100B token dataset that includes all promising sources of data at that time. </p></li></ol><p>This approach creates a distributed feedback loop that allows many sources of data to be efficiently explored and brings promising data sources together for centralized integration tests. Put simply, <em>we can repeatedly vet data sources in parallel and validate them at larger scales until we arrive at the final mixtraining mix</em>. Five rounds of integration tests were performed when developing Olmo 3.</p><blockquote><p><em>&#8220;This methodology allowed us to make rapid, targeted assessments of the quality of datasets being considered for the midtraining mix, and to iterate on many data domains in parallel.&#8221; </em>- from [1]</p></blockquote><p>To evaluate models during midtraining, authors rely primarily upon the Base Main dataset, which consists of benchmarks that are not yet saturated during pretraining. Additionally, lightweight SFT experiments are performed with midtrained models to test the &#8220;post-trainability&#8221; of various data mixtures. The performance of Olmo 3 models on these benchmarks after iterative rounds of microannealing experiments and integration testing is outlined below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yfiN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c81ffe8-30ea-4f61-8fb6-fde5c0963e2d_1624x410.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yfiN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c81ffe8-30ea-4f61-8fb6-fde5c0963e2d_1624x410.png 424w, https://substackcdn.com/image/fetch/$s_!yfiN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c81ffe8-30ea-4f61-8fb6-fde5c0963e2d_1624x410.png 848w, https://substackcdn.com/image/fetch/$s_!yfiN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c81ffe8-30ea-4f61-8fb6-fde5c0963e2d_1624x410.png 1272w, https://substackcdn.com/image/fetch/$s_!yfiN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c81ffe8-30ea-4f61-8fb6-fde5c0963e2d_1624x410.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yfiN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c81ffe8-30ea-4f61-8fb6-fde5c0963e2d_1624x410.png" width="1456" height="368" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3c81ffe8-30ea-4f61-8fb6-fde5c0963e2d_1624x410.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:368,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:115044,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c81ffe8-30ea-4f61-8fb6-fde5c0963e2d_1624x410.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yfiN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c81ffe8-30ea-4f61-8fb6-fde5c0963e2d_1624x410.png 424w, https://substackcdn.com/image/fetch/$s_!yfiN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c81ffe8-30ea-4f61-8fb6-fde5c0963e2d_1624x410.png 848w, https://substackcdn.com/image/fetch/$s_!yfiN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c81ffe8-30ea-4f61-8fb6-fde5c0963e2d_1624x410.png 1272w, https://substackcdn.com/image/fetch/$s_!yfiN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c81ffe8-30ea-4f61-8fb6-fde5c0963e2d_1624x410.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>The final midtraining mix includes some pretraining data to avoid model drift. Additionally, instruction and thinking (or reasoning) data is included, which is found to benefit performance almost universally across benchmarks and helps to lay early groundwork for post-training. All instruction and reasoning data avoids using templates or special tokens during midtraining due to the complexity that this additional formatting introduces into the evaluation process. Instead, text formatting is adopted, which maintains the pretrained model&#8217;s output format.</p><blockquote><p><em>&#8220;Although individual sources and domains present performance tradeoffs, the inclusion of these cross-domain post-training data types in aggregate is consistently beneficial, and this benefit begins even before post-training.&#8221;</em> - from [1]</p></blockquote><p>We observe very clear domain tradeoffs during midtraining. For example, math and code performance can be improved by increasing the ratio of this data in the midtraining mixture, but such improved performance comes at the cost of degraded performance in other domains. The Dolma 3 Dolmino mix strikes a balance between important domains. Interestingly, the final midtraining model is also a merge of two independently-trained models with different seeds, which authors find to improve performance compared to using an individual model.</p><p><strong>Long context</strong><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-9" href="#footnote-9" target="_self">9</a> training is an important component of modern LLMs that plays a huge role in real-world tasks (e.g., tool usage or multi-turn chat) and helps with enabling test-time scaling for reasoning models. However, pretraining an LLM from scratch with natively long context would be incredibly expensive&#8212;<em>long sequences consume a lot of memory and compute during training</em>. To get around this, most LLMs are pretrained using much shorter sequences (e.g., 8K tokens in the case of Olmo 3) and undergo a context extension phase after pretraining.</p><blockquote><p><em>&#8220;Because training with long sequence lengths is computationally costly, most language models are pretrained with shorter sequences and extended only in a later stage of model development. During the extension phase, models are trained on longer documents, and positional embedding hyperparameters are typically adjusted to ease positional generalization.&#8221;</em> - from [1]</p></blockquote><p>The details of this context extension phase vary drastically between models. For example, the number of tokens used for long context training can be anywhere between 100B (or less) to 1T tokens, and the order of training phases changes between models&#8212;<em>the long context phase could be placed before midtraining or even included as part of post-training</em>. Olmo 3 adopts a straightforward pipeline that performs long context extension after midtraining and before post-training. This long context phase uses a 100B token mix<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-10" href="#footnote-10" target="_self">10</a> of the <a href="https://huggingface.co/datasets/allenai/dolma3_longmino_pool">full 600B token Dolma 3 Longmino pool</a> to extend the context of Olmo 3 from 8K to ~65K tokens. </p><p><strong>Long context data.</strong> The dataset for long context training includes a combination synthetic data and long documents sourced from the academic PDF pretraining corpus. This data undergoes heuristic GZIP filtering that removes any document in the top and bottom 20% of GZIP compressibility. In other words, <em>we remove long context documents that are the least or most redundant</em>. Interestingly, this GZIP heuristic outperforms <a href="https://arxiv.org/abs/2410.23771">more sophisticated, model-based techniques</a> that use perplexity metrics to identify documents with long-range token dependencies.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9YBn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbffad7c1-9985-4515-9bf1-632ff76c6b65_2212x1098.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9YBn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbffad7c1-9985-4515-9bf1-632ff76c6b65_2212x1098.png 424w, https://substackcdn.com/image/fetch/$s_!9YBn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbffad7c1-9985-4515-9bf1-632ff76c6b65_2212x1098.png 848w, https://substackcdn.com/image/fetch/$s_!9YBn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbffad7c1-9985-4515-9bf1-632ff76c6b65_2212x1098.png 1272w, https://substackcdn.com/image/fetch/$s_!9YBn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbffad7c1-9985-4515-9bf1-632ff76c6b65_2212x1098.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9YBn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbffad7c1-9985-4515-9bf1-632ff76c6b65_2212x1098.png" width="1456" height="723" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bffad7c1-9985-4515-9bf1-632ff76c6b65_2212x1098.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:723,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:686083,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbffad7c1-9985-4515-9bf1-632ff76c6b65_2212x1098.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9YBn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbffad7c1-9985-4515-9bf1-632ff76c6b65_2212x1098.png 424w, https://substackcdn.com/image/fetch/$s_!9YBn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbffad7c1-9985-4515-9bf1-632ff76c6b65_2212x1098.png 848w, https://substackcdn.com/image/fetch/$s_!9YBn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbffad7c1-9985-4515-9bf1-632ff76c6b65_2212x1098.png 1272w, https://substackcdn.com/image/fetch/$s_!9YBn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbffad7c1-9985-4515-9bf1-632ff76c6b65_2212x1098.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [7])</figcaption></figure></div><p>Beyond PDF data, authors in [1] collect synthetic long context data that is focused on information extraction tasks over long documents. Specifically, the technique used to generate long context data is inspired by CLIPPER [7]; see above. This approach avoids making the assumption that the LLM being used to generate synthetic data already has long context abilities. Instead, we do the following:</p><ol><li><p>Partition a long document into several sections.</p></li><li><p>Identify the most common noun phrases in each section<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-11" href="#footnote-11" target="_self">11</a>.</p></li><li><p>Extract (<code>k=8</code>) text snippets from the section for each noun phrase.</p></li><li><p>Provide this information in a prompt to an LLM&#8212;<em><a href="https://huggingface.co/allenai/OLMo-2-0325-32B">Olmo 2 32B</a> is used in [1]</em>&#8212;to synthesize an aggregation task; e.g., writing a summary, providing a list of true or false claims, creating a conversational explainer, and many more. </p></li></ol><p>We can then train our model to replicate these synthetic outputs using only the long document as input, which teaches the model to reliably extract information. During long context training, this data&#8212;<em>including both the PDF and synthetic data</em>&#8212;is mixed with short context data from midtraining at a <code>1:2</code> ratio (i.e., 34% long context and 66% short context data) to form the <a href="http://allenai/dolma3_longmino_mix-100B-1125">Dolma 3 Longmino Mix</a>. </p><p>Data during long context training varies drastically in terms of sequence length. <em>Naively batching sequences together would yield excessive padding</em>. When we batch sequences together, we create a fixed-size tensor of size<code> B (batch size) &#215; S (sequence length) &#215; d (embedding dimension)</code>. Here, <code>S</code> is either the maximum context length during training or the size of the longest sequence in our batch. Usually, each sequence is shorter than <code>S</code>, and we occupy the rest of this tensor with padding tokens to maintain the fixed shape needed by the GPU.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IDZK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaaaa7df-e69f-4bfb-9286-06be1f37614e_1915x932.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IDZK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaaaa7df-e69f-4bfb-9286-06be1f37614e_1915x932.png 424w, https://substackcdn.com/image/fetch/$s_!IDZK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaaaa7df-e69f-4bfb-9286-06be1f37614e_1915x932.png 848w, https://substackcdn.com/image/fetch/$s_!IDZK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaaaa7df-e69f-4bfb-9286-06be1f37614e_1915x932.png 1272w, https://substackcdn.com/image/fetch/$s_!IDZK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaaaa7df-e69f-4bfb-9286-06be1f37614e_1915x932.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IDZK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaaaa7df-e69f-4bfb-9286-06be1f37614e_1915x932.png" width="607" height="295.5789835164835" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/faaaa7df-e69f-4bfb-9286-06be1f37614e_1915x932.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:709,&quot;width&quot;:1456,&quot;resizeWidth&quot;:607,&quot;bytes&quot;:123822,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaaaa7df-e69f-4bfb-9286-06be1f37614e_1915x932.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!IDZK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaaaa7df-e69f-4bfb-9286-06be1f37614e_1915x932.png 424w, https://substackcdn.com/image/fetch/$s_!IDZK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaaaa7df-e69f-4bfb-9286-06be1f37614e_1915x932.png 848w, https://substackcdn.com/image/fetch/$s_!IDZK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaaaa7df-e69f-4bfb-9286-06be1f37614e_1915x932.png 1272w, https://substackcdn.com/image/fetch/$s_!IDZK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaaaa7df-e69f-4bfb-9286-06be1f37614e_1915x932.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Standard batching compared to document packing</figcaption></figure></div><p>In the case of long context training, most examples will have length <code>&#8810;S</code>&#8212;<em>most of this tensor will be occupied by empty padding tokens that waste computation</em>; see above. To solve this issue, we can use <a href="https://huggingface.co/blog/sirluk/llm-sequence-packing">document packing</a>, which batches sequences together in the same row to avoid excessive padding; see above. Additionally, we add an inter-document mask to the attention process to avoid attention across examples that are packed together. This approach is used by Olmo 3 to improve the efficiency of the long context training process; see <a href="https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook#which-hyperparameters-actually-matter">here</a> for more details. </p><div class="pullquote"><p><em>&#8220;We experiment with several methods for extending RoPE&#8230; including adjusted base frequency scaling, position interpolation, and YaRN. Each approach is applied either to all RoPE instances or is restricted to RoPE used in full attention layers. We find that applying YaRN only to full attention layers yields the best overall performance&#8221; - from [1]</em></p></div><p><strong>Context Extension.</strong> Several different context extension techniques are tested in [1], and YarN [7] is found to yield the best performance on key evaluations like <a href="https://arxiv.org/abs/2407.01437">advanced Needle-in-a-Haystack (NIH) tests</a>, <a href="https://arxiv.org/abs/2404.06654">RULER</a>, and <a href="https://arxiv.org/abs/2410.02694">HELMET</a>. Full details on YarN and other context extension techniques can be found in <a href="https://cameronrwolfe.substack.com/i/170257215/long-context">this overview</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!csdN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e31645c-280e-42ad-9d95-de301b869476_1612x1064.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!csdN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e31645c-280e-42ad-9d95-de301b869476_1612x1064.png 424w, https://substackcdn.com/image/fetch/$s_!csdN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e31645c-280e-42ad-9d95-de301b869476_1612x1064.png 848w, https://substackcdn.com/image/fetch/$s_!csdN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e31645c-280e-42ad-9d95-de301b869476_1612x1064.png 1272w, https://substackcdn.com/image/fetch/$s_!csdN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e31645c-280e-42ad-9d95-de301b869476_1612x1064.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!csdN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e31645c-280e-42ad-9d95-de301b869476_1612x1064.png" width="1456" height="961" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4e31645c-280e-42ad-9d95-de301b869476_1612x1064.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:961,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:246526,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e31645c-280e-42ad-9d95-de301b869476_1612x1064.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!csdN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e31645c-280e-42ad-9d95-de301b869476_1612x1064.png 424w, https://substackcdn.com/image/fetch/$s_!csdN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e31645c-280e-42ad-9d95-de301b869476_1612x1064.png 848w, https://substackcdn.com/image/fetch/$s_!csdN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e31645c-280e-42ad-9d95-de301b869476_1612x1064.png 1272w, https://substackcdn.com/image/fetch/$s_!csdN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e31645c-280e-42ad-9d95-de301b869476_1612x1064.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>YaRN is only applied to full attention layers, while positional embeddings are left unchanged in layers that use SWA. As shown in the figures above, this extension approach, when combined with an increasing amount of curated long context data, significantly benefits the long context performance of Olmo 3 models. </p><p>Model merging continues to play a role in long context training, but we cannot run multiple long context training runs with different seeds due to the high cost of long context training. Instead, authors in [1] take three (adjacent) checkpoints from the end of a single long context training run and merge them, which further benefits performance. The long context capabilities of Olmo 3 are comparable to or slightly worse than that of the Qwen-2.5 models, as shown in the table below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fIbZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f020c1a-ceaa-4db8-bbcd-847f56d446e9_1606x1092.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fIbZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f020c1a-ceaa-4db8-bbcd-847f56d446e9_1606x1092.png 424w, https://substackcdn.com/image/fetch/$s_!fIbZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f020c1a-ceaa-4db8-bbcd-847f56d446e9_1606x1092.png 848w, https://substackcdn.com/image/fetch/$s_!fIbZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f020c1a-ceaa-4db8-bbcd-847f56d446e9_1606x1092.png 1272w, https://substackcdn.com/image/fetch/$s_!fIbZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f020c1a-ceaa-4db8-bbcd-847f56d446e9_1606x1092.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fIbZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f020c1a-ceaa-4db8-bbcd-847f56d446e9_1606x1092.png" width="1456" height="990" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3f020c1a-ceaa-4db8-bbcd-847f56d446e9_1606x1092.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:990,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:301360,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f020c1a-ceaa-4db8-bbcd-847f56d446e9_1606x1092.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fIbZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f020c1a-ceaa-4db8-bbcd-847f56d446e9_1606x1092.png 424w, https://substackcdn.com/image/fetch/$s_!fIbZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f020c1a-ceaa-4db8-bbcd-847f56d446e9_1606x1092.png 848w, https://substackcdn.com/image/fetch/$s_!fIbZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f020c1a-ceaa-4db8-bbcd-847f56d446e9_1606x1092.png 1272w, https://substackcdn.com/image/fetch/$s_!fIbZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f020c1a-ceaa-4db8-bbcd-847f56d446e9_1606x1092.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><h2>Thinking Models</h2><blockquote><p><em>&#8220;Olmo 3 Think is trained for reasoning by generating extended thoughts before producing a final answer. To achieve this, we curate high-quality reasoning data (Dolci Think), apply a three-stage training recipe (SFT, DPO, and RLVR), and introduce OlmoRL infrastructure, which brings algorithmic and engineering advances in reinforcement learning with verifiable rewards.&#8221; </em>- from [1]</p></blockquote><p>Expanding upon the Olmo 3 Base models, authors in [1] explore post-training strategies to create a suite of reasoning models, referred to as Olmo 3 Think. These models are trained to reason by outputting long reasoning traces or trajectories prior to their final output via large-scale RLVR. For an in-depth overview of LLM-based reasoning models, please see the link below.</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;f71a71d3-eb65-4bbf-8971-a92711ac857b&quot;,&quot;caption&quot;:&quot;For the last several years, we have used a relatively fixed pipeline for training large language models (LLMs); see below. First, we pretrain these language models over raw textual data from the internet. Afterwards, we align them&#8212;or train them to produce outputs that are preferable to humans&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Demystifying Reasoning Models&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:29736521,&quot;name&quot;:&quot;Cameron R. Wolfe, Ph.D.&quot;,&quot;bio&quot;:&quot;Research @ Netflix &#8226; Rice University PhD &#8226; I make AI understandable&quot;,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/69aba7df-b571-4609-aa47-fc2d031c11b8_1242x1595.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-02-18T10:33:55.513Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/23d9c87e-b238-4fdd-996e-4ed4465b9931_2334x1282.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://cameronrwolfe.substack.com/p/demystifying-reasoning-models&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:153722335,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:261,&quot;comment_count&quot;:5,&quot;publication_id&quot;:1092659,&quot;publication_name&quot;:&quot;Deep (Learning) Focus&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!87xa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab9b43fb-52d5-40da-995d-5b7cd3f91064_896x896.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p>The reasoning training process for Olmo 3 Think models differs from other work in two keys respects:</p><ol><li><p>Models are trained with both SFT and DPO prior to RLVR.</p></li><li><p>A multi-objective RLVR approach is used that mixes data from both verifiable and non-verifiable domains. </p></li></ol><p>Despite differing slightly from related work, this post-training pipeline is shown in [1] to yield consistent gains across all stages (i.e., SFT, DPO, and RLVR). </p><p><strong>Evaluation results.</strong> Relative to Olmo 2 [2], Olmo 3 Think models are evaluated over a much wider set of benchmarks that capture capabilities like math, general reasoning, knowledge, coding, instruction following, question answering, chat, and more. At the 32B scale, Olmo 3 Think models achieve state-of-the-art metrics among other fully-open thinking models, as well as match the performance of some popular open-weight models like Qwen-2.5 and Gemma-3; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NG1I!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0422809-9bc1-44f0-b538-dc89bc97501d_2110x1262.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NG1I!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0422809-9bc1-44f0-b538-dc89bc97501d_2110x1262.png 424w, https://substackcdn.com/image/fetch/$s_!NG1I!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0422809-9bc1-44f0-b538-dc89bc97501d_2110x1262.png 848w, https://substackcdn.com/image/fetch/$s_!NG1I!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0422809-9bc1-44f0-b538-dc89bc97501d_2110x1262.png 1272w, https://substackcdn.com/image/fetch/$s_!NG1I!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0422809-9bc1-44f0-b538-dc89bc97501d_2110x1262.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NG1I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0422809-9bc1-44f0-b538-dc89bc97501d_2110x1262.png" width="1456" height="871" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d0422809-9bc1-44f0-b538-dc89bc97501d_2110x1262.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:871,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:354743,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0422809-9bc1-44f0-b538-dc89bc97501d_2110x1262.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NG1I!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0422809-9bc1-44f0-b538-dc89bc97501d_2110x1262.png 424w, https://substackcdn.com/image/fetch/$s_!NG1I!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0422809-9bc1-44f0-b538-dc89bc97501d_2110x1262.png 848w, https://substackcdn.com/image/fetch/$s_!NG1I!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0422809-9bc1-44f0-b538-dc89bc97501d_2110x1262.png 1272w, https://substackcdn.com/image/fetch/$s_!NG1I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0422809-9bc1-44f0-b538-dc89bc97501d_2110x1262.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>Compared to top open-weight reasoning models like Qwen-3, Olmo 3 Think narrows the gap in performance but is still lags behind. This gap is especially pronounced for 7B-scale models, where we see that Olmo 3 Think is significantly outperformed by Qwen 3 on knowledge-based tasks (e.g., <a href="https://huggingface.co/datasets/cais/mmlu">MMLU</a>). Such results align with general trends in performance for Olmo 3&#8212;<em>these models are close to state-of-the-art and provide many benefits in terms of transparency and openness</em>. </p><h4>SFT &amp; DPO</h4><p>Prior to RL training, we finetune the base model using both SFT and DPO in order to create a more useful starting point for RL. The purpose of these training stages is to both improve capabilities and, more specifically, teach the model to produce thinking traces prior to its final answer. <em>We are seeding the model with the correct output format before performing RL</em>. Notably, recent work on LLM post-training typically does not use all of these stages. For example, DeepSeek-R1 [9] either performs a lightweight SFT stage before RLVR or applies RLVR directly to the base model (i.e., an RL-Zero setup). We see in [1] that consistent gains can be realized by performing SFT and DPO prior to RL given proper data curation. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!z4I1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a2fbab0-4355-426b-8d49-10695f9db168_1784x434.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!z4I1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a2fbab0-4355-426b-8d49-10695f9db168_1784x434.png 424w, https://substackcdn.com/image/fetch/$s_!z4I1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a2fbab0-4355-426b-8d49-10695f9db168_1784x434.png 848w, https://substackcdn.com/image/fetch/$s_!z4I1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a2fbab0-4355-426b-8d49-10695f9db168_1784x434.png 1272w, https://substackcdn.com/image/fetch/$s_!z4I1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a2fbab0-4355-426b-8d49-10695f9db168_1784x434.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!z4I1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a2fbab0-4355-426b-8d49-10695f9db168_1784x434.png" width="1456" height="354" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1a2fbab0-4355-426b-8d49-10695f9db168_1784x434.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:354,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:102494,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a2fbab0-4355-426b-8d49-10695f9db168_1784x434.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!z4I1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a2fbab0-4355-426b-8d49-10695f9db168_1784x434.png 424w, https://substackcdn.com/image/fetch/$s_!z4I1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a2fbab0-4355-426b-8d49-10695f9db168_1784x434.png 848w, https://substackcdn.com/image/fetch/$s_!z4I1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a2fbab0-4355-426b-8d49-10695f9db168_1784x434.png 1272w, https://substackcdn.com/image/fetch/$s_!z4I1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a2fbab0-4355-426b-8d49-10695f9db168_1784x434.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>The key training settings for the SFT and DPO training processes performed with Olmo 3 are provided in the tables shown below for reference. The training code is present in <a href="https://github.com/allenai/OLMo-core">Olmo-Core</a> (for SFT) and <a href="https://github.com/allenai/open-instruct">OpenInstruct</a> (for DPO). </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4EBQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84426c17-2a76-438f-ab76-9fc1a6e4e5b6_2034x1292.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4EBQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84426c17-2a76-438f-ab76-9fc1a6e4e5b6_2034x1292.png 424w, https://substackcdn.com/image/fetch/$s_!4EBQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84426c17-2a76-438f-ab76-9fc1a6e4e5b6_2034x1292.png 848w, https://substackcdn.com/image/fetch/$s_!4EBQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84426c17-2a76-438f-ab76-9fc1a6e4e5b6_2034x1292.png 1272w, https://substackcdn.com/image/fetch/$s_!4EBQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84426c17-2a76-438f-ab76-9fc1a6e4e5b6_2034x1292.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4EBQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84426c17-2a76-438f-ab76-9fc1a6e4e5b6_2034x1292.png" width="1456" height="925" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/84426c17-2a76-438f-ab76-9fc1a6e4e5b6_2034x1292.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:925,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:524487,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84426c17-2a76-438f-ab76-9fc1a6e4e5b6_2034x1292.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4EBQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84426c17-2a76-438f-ab76-9fc1a6e4e5b6_2034x1292.png 424w, https://substackcdn.com/image/fetch/$s_!4EBQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84426c17-2a76-438f-ab76-9fc1a6e4e5b6_2034x1292.png 848w, https://substackcdn.com/image/fetch/$s_!4EBQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84426c17-2a76-438f-ab76-9fc1a6e4e5b6_2034x1292.png 1272w, https://substackcdn.com/image/fetch/$s_!4EBQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84426c17-2a76-438f-ab76-9fc1a6e4e5b6_2034x1292.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p><strong>SFT.</strong> <a href="https://huggingface.co/datasets/allenai/Dolci-Think-SFT-7B">Dolci Think SFT</a> is a set of ~2.3M supervised training examples that is used for the SFT stage of Olmo 3 and spans several important capabilities like math, science, coding, instruction following, chat and safety. This data is curated as follows (see above for a step-by-step illustration):</p><ul><li><p><em>Prompt sourcing</em>: prompts are sourced for each capability from a wide variety of public datasets<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-12" href="#footnote-12" target="_self">12</a>.</p></li><li><p><em>Re-generating examples</em>: for prompts with incomplete completions, we generate new completion(s)&#8212;<em>including both a reasoning trace and final answer for each completion</em>&#8212;using either <a href="https://huggingface.co/deepseek-ai/DeepSeek-R1">DeepSeek-R1</a> or <a href="https://huggingface.co/Qwen/QwQ-32B">QwQ-32B</a>. </p></li><li><p><em>Correctness filtering</em>: completions are verified using various domain-specific strategies (e.g., synthetically-generated test cases for code or verifiers for specific precise instruction following constraints). </p></li><li><p><em>Heuristic filtering</em>: prompts are removed based on having unclear usage licenses, incomplete reasoning traces, excessive repetition, mention of other model providers, and other heuristics.</p></li><li><p><em>Topic filtering</em>: prompts are classified by topic according to the <a href="https://openai.com/index/how-people-are-using-chatgpt/">OpenAI query taxonomy</a>, and any topics that are irrelevant to Olmo 3 (e.g., requests for image generation) are either filtered or downsampled. </p></li></ul><p>This post-training data curation process is generic and goes beyond SFT&#8212;<em>a similar pipeline is used to curate data for DPO and RLVR</em>. After prompts are sourced and filtered, the data mixture is derived using an approach very similar to that of midtraining: <em>many data sources are gathered in parallel and tested via lightweight SFT experiments that train an LLM over 100B tokens from the domain of interest combined with an 100B token SFT base mixture</em>. After evaluating data sources in parallel, we can perform centralized integration tests with data sources that are found to meaningfully benefit performance. Interestingly, all data sources in [1] were found to benefit performance on at least one evaluation benchmark; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YVpF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bb87a28-52ba-47fd-b5f7-3e1eeac27061_2208x702.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YVpF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bb87a28-52ba-47fd-b5f7-3e1eeac27061_2208x702.png 424w, https://substackcdn.com/image/fetch/$s_!YVpF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bb87a28-52ba-47fd-b5f7-3e1eeac27061_2208x702.png 848w, https://substackcdn.com/image/fetch/$s_!YVpF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bb87a28-52ba-47fd-b5f7-3e1eeac27061_2208x702.png 1272w, https://substackcdn.com/image/fetch/$s_!YVpF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bb87a28-52ba-47fd-b5f7-3e1eeac27061_2208x702.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YVpF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bb87a28-52ba-47fd-b5f7-3e1eeac27061_2208x702.png" width="1456" height="463" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3bb87a28-52ba-47fd-b5f7-3e1eeac27061_2208x702.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:463,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:204401,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bb87a28-52ba-47fd-b5f7-3e1eeac27061_2208x702.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YVpF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bb87a28-52ba-47fd-b5f7-3e1eeac27061_2208x702.png 424w, https://substackcdn.com/image/fetch/$s_!YVpF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bb87a28-52ba-47fd-b5f7-3e1eeac27061_2208x702.png 848w, https://substackcdn.com/image/fetch/$s_!YVpF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bb87a28-52ba-47fd-b5f7-3e1eeac27061_2208x702.png 1272w, https://substackcdn.com/image/fetch/$s_!YVpF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bb87a28-52ba-47fd-b5f7-3e1eeac27061_2208x702.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>Beyond the post-training benchmarks used for Olmo 3, authors in [1] emphasize the role of &#8220;vibe checks&#8221;&#8212;<em>or the manual inspection of a diverse (but usually small) set of model outputs by researchers</em>&#8212;in evaluating models. Evaluation metrics and benchmark scores are useful, <em>but they rarely tell the full story</em>. By manually inspecting model outputs, we can discover trends in performance across experiments and training stages that might be difficult to uncover otherwise. </p><blockquote><p><em>&#8220;Using [Olmo-Core], we can train a 7B model at 7700 tokens per second per GPU and a 32B at 1900 tokens per second per GPU&#8230; by relying on PyTorch&#8217;s built-in torch.compile(), custom kernels for operations such as attention and language modeling head, asynchronous and batched gathering of metrics, and asynchronous writing of checkpoints.&#8221; </em>- from [1]</p></blockquote><p>Similarly to pretraining and midtraining, the SFT training process uses the <a href="https://github.com/allenai/OLMo-core">Olmo-Core</a> codebase, which provides optimized code for supervised training. Compared to prior SFT training code for Olmo (i.e., found <a href="https://github.com/allenai/open-instruct/blob/main/open_instruct/finetune.py">here</a> in OpenInstruct), Olmo-Core is ~8&#215; faster. Two epochs of training are conducted over Dolci Think SFT, and we again derive the final model via model merging. Specifically, we linearly merge the weights of two model checkpoints trained with different learning rates over the same data, forming the Olmo 3 <a href="https://huggingface.co/allenai/Olmo-3-7B-Think-SFT">7B</a> and <a href="https://huggingface.co/allenai/Olmo-3-32B-Think-SFT">32B</a> Think SFT models. </p><p><strong>DPO.</strong> Preference tuning is typically used for improving the alignment of an LLM to human preferences. In recent research on reasoning models, preference tuning is rarely used, but we see in [1] that DPO-based preference tuning yields an improvement in capabilities when used in tandem with SFT prior to the RL training phase. More specifically, Olmo 3 undergoes <a href="https://cameronrwolfe.substack.com/p/direct-preference-optimization">DPO</a>-based preference tuning using a strategy that is inspired by Delta Learning [11]; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!P41w!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe75c75c-b5d6-448d-819c-a2074a3fdefe_1714x782.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!P41w!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe75c75c-b5d6-448d-819c-a2074a3fdefe_1714x782.png 424w, https://substackcdn.com/image/fetch/$s_!P41w!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe75c75c-b5d6-448d-819c-a2074a3fdefe_1714x782.png 848w, https://substackcdn.com/image/fetch/$s_!P41w!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe75c75c-b5d6-448d-819c-a2074a3fdefe_1714x782.png 1272w, https://substackcdn.com/image/fetch/$s_!P41w!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe75c75c-b5d6-448d-819c-a2074a3fdefe_1714x782.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!P41w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe75c75c-b5d6-448d-819c-a2074a3fdefe_1714x782.png" width="1456" height="664" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/be75c75c-b5d6-448d-819c-a2074a3fdefe_1714x782.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:664,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:327293,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe75c75c-b5d6-448d-819c-a2074a3fdefe_1714x782.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!P41w!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe75c75c-b5d6-448d-819c-a2074a3fdefe_1714x782.png 424w, https://substackcdn.com/image/fetch/$s_!P41w!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe75c75c-b5d6-448d-819c-a2074a3fdefe_1714x782.png 848w, https://substackcdn.com/image/fetch/$s_!P41w!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe75c75c-b5d6-448d-819c-a2074a3fdefe_1714x782.png 1272w, https://substackcdn.com/image/fetch/$s_!P41w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe75c75c-b5d6-448d-819c-a2074a3fdefe_1714x782.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [11])</figcaption></figure></div><p>To create a preference dataset for DPO, prior models like Olmo 2 [3] leverage a synthetic data pipeline similar to UltraFeedback [20] that generates completions from a diverse pool of models. For each prompt, we do the following:</p><ul><li><p>Generate completions with each model.</p></li><li><p>Rate each completion with an <a href="https://cameronrwolfe.substack.com/p/llm-as-a-judge">LLM judge</a>.</p></li><li><p>Form preference pairs based on these ratings (i.e., higher-scoring responses are preferred in a preference pair).</p></li></ul><p>This approach hinges upon the diversity of the underlying model pool to yield high-quality preference pairs. Applying a similar model pooling approach in the reasoning domain would be difficult, as the number of LLMs with open reasoning traces is limited&#8212;<em>most (proprietary) reasoning models surface only final outputs and hide their reasoning process</em>. Delta Learning uses an alternative approach of forming high-quality preference pairs by minimizing the quality of rejected completions. </p><p>This approach focuses less on the absolute quality of completions in a preference pair and more on the relative quality difference between the chosen and rejected completions. For example, authors in [1] show that further training the Olmo 3 Think SFT model on synthetic completions from <a href="https://huggingface.co/Qwen/Qwen3-32B">Qwen-3-32B</a> actually degrades performance. However, we can improve Olmo 3 Think SFT performance via DPO with preference pairs that contain <em>i)</em> a chosen completion from Qwen-3-32B and <em>ii)</em> a rejected completion from the weaker <a href="https://huggingface.co/Qwen/Qwen3-0.6B">Qwen-3-0.6B</a> model.</p><blockquote><p><em>&#8220;The intuition behind delta learning is that the quality of preference data depends primarily on the quality of the delta between chosen and rejected responses; the quality of either response individually is less important.&#8221;</em> - from [1]</p></blockquote><p>Olmo 3 Think DPO models are trained on Dolci Think DPO, a preference dataset comprised of completions with clear capability deltas that are generated using Delta Learning. As described above, model size is adopted as a simple heuristic for completion quality&#8212;<em>chosen completions are sampled from the 32B Qwen model, while rejection completions are sampled from the 0.6B model</em>. While all Olmo 3 Think SFT models are trained on a similarly-sized dataset, 7B and 32B Olmo 3 Think DPO models use preference datasets with <a href="https://huggingface.co/datasets/allenai/Dolci-Think-DPO-7B">150K</a> and <a href="https://huggingface.co/datasets/allenai/Dolci-Think-DPO-32B">200K</a> pairs, respectively.</p><p>Prompts for Dolci Think DPO are mostly reused from SFT, but additional sources of preference data from Olmo 2 (e.g., <a href="https://huggingface.co/datasets/openbmb/UltraFeedback">UltraFeedback</a> and <a href="https://huggingface.co/datasets/nvidia/Daring-Anteater">DaringAnteater</a>) are also added. The same filtering operations from SFT are used by DPO, but filtering is only applied to chosen completions&#8212;<em>rejected completions are left unfiltered</em>. Due to the computational expense of experiments with reasoning traces, a hierarchical approach is used for finding the best data mixture. First, a wide variety of mixing experiments are performed using standard LLMs that directly provide output with no reasoning. The top three data mixtures from this phrase are then used in full reasoning experiments to find the best-performing preference mix. </p><h4>RLVR with GRPO</h4><p>As a final touch, Olmo 3 Think models undergo RL training using a combination of verifiable and non-verifiable rewards to improve the models&#8217; reasoning skills while maintaining their general utility. The RL training process focuses upon the domains of math, code, instruction following, and general chat. </p><blockquote><p><em>&#8220;We introduce OlmoRL, which includes our algorithm and closely intertwined engineering infrastructure to address challenges for RL with long reasoning traces, extending RLVR to include a wider variety of verifiable tasks.&#8221;</em> - from [1]</p></blockquote><p>Detailed training configurations for each of the RL training processes performed using Olmo 3 are provided in the table shown below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_Arq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3eba590-926b-4d5e-b1d9-bffebe7f6d6f_2092x1266.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_Arq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3eba590-926b-4d5e-b1d9-bffebe7f6d6f_2092x1266.png 424w, https://substackcdn.com/image/fetch/$s_!_Arq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3eba590-926b-4d5e-b1d9-bffebe7f6d6f_2092x1266.png 848w, https://substackcdn.com/image/fetch/$s_!_Arq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3eba590-926b-4d5e-b1d9-bffebe7f6d6f_2092x1266.png 1272w, https://substackcdn.com/image/fetch/$s_!_Arq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3eba590-926b-4d5e-b1d9-bffebe7f6d6f_2092x1266.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_Arq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3eba590-926b-4d5e-b1d9-bffebe7f6d6f_2092x1266.png" width="1456" height="881" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e3eba590-926b-4d5e-b1d9-bffebe7f6d6f_2092x1266.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:881,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:268856,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3eba590-926b-4d5e-b1d9-bffebe7f6d6f_2092x1266.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_Arq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3eba590-926b-4d5e-b1d9-bffebe7f6d6f_2092x1266.png 424w, https://substackcdn.com/image/fetch/$s_!_Arq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3eba590-926b-4d5e-b1d9-bffebe7f6d6f_2092x1266.png 848w, https://substackcdn.com/image/fetch/$s_!_Arq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3eba590-926b-4d5e-b1d9-bffebe7f6d6f_2092x1266.png 1272w, https://substackcdn.com/image/fetch/$s_!_Arq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3eba590-926b-4d5e-b1d9-bffebe7f6d6f_2092x1266.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p><strong>Reward signals.</strong> Most recent work on RL for reasoning models considers a pure RLVR setup with only verifiable rewards. For example, many works apply RL in math or coding domains [15, 17], where we can easily check the correctness of the model&#8217;s output via rules or test cases. In [1], the standard RLVR setup is extended to include rewards from both deterministic verifiers and LLM judges; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AIw3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14fae1d0-17ba-401a-a949-ea497481ada4_2166x864.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AIw3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14fae1d0-17ba-401a-a949-ea497481ada4_2166x864.png 424w, https://substackcdn.com/image/fetch/$s_!AIw3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14fae1d0-17ba-401a-a949-ea497481ada4_2166x864.png 848w, https://substackcdn.com/image/fetch/$s_!AIw3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14fae1d0-17ba-401a-a949-ea497481ada4_2166x864.png 1272w, https://substackcdn.com/image/fetch/$s_!AIw3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14fae1d0-17ba-401a-a949-ea497481ada4_2166x864.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AIw3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14fae1d0-17ba-401a-a949-ea497481ada4_2166x864.png" width="1456" height="581" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/14fae1d0-17ba-401a-a949-ea497481ada4_2166x864.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:581,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:397385,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14fae1d0-17ba-401a-a949-ea497481ada4_2166x864.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AIw3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14fae1d0-17ba-401a-a949-ea497481ada4_2166x864.png 424w, https://substackcdn.com/image/fetch/$s_!AIw3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14fae1d0-17ba-401a-a949-ea497481ada4_2166x864.png 848w, https://substackcdn.com/image/fetch/$s_!AIw3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14fae1d0-17ba-401a-a949-ea497481ada4_2166x864.png 1272w, https://substackcdn.com/image/fetch/$s_!AIw3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14fae1d0-17ba-401a-a949-ea497481ada4_2166x864.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>The math domain uses a standard verifier that performs basic normalization of answers and equivalence checks via <a href="https://www.sympy.org/en/index.html">sympy</a> to yield a binary correctness score. For coding and instruction following, correctness is checked via either test cases or constraint-specific verification functions. The reward in these domains can be binary (i.e., all tests must pass to receive a reward) or the ratio of tests that pass.</p><p>The general chat domain is not verifiable&#8212;<em>we must rely upon an LLM judge to derive a reward</em>. Authors in [1] use <a href="https://huggingface.co/Qwen/Qwen3-32B">Qwen-3-32B</a> as their judge model with thinking mode turned off and the prompt shown below. Depending on if ground truth outputs are available, the judge can either be reference-based or reference-free.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!h0nA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3728aed0-e9fd-4abb-99cc-ad6e08e4dbc9_1574x1282.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!h0nA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3728aed0-e9fd-4abb-99cc-ad6e08e4dbc9_1574x1282.png 424w, https://substackcdn.com/image/fetch/$s_!h0nA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3728aed0-e9fd-4abb-99cc-ad6e08e4dbc9_1574x1282.png 848w, https://substackcdn.com/image/fetch/$s_!h0nA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3728aed0-e9fd-4abb-99cc-ad6e08e4dbc9_1574x1282.png 1272w, https://substackcdn.com/image/fetch/$s_!h0nA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3728aed0-e9fd-4abb-99cc-ad6e08e4dbc9_1574x1282.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!h0nA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3728aed0-e9fd-4abb-99cc-ad6e08e4dbc9_1574x1282.png" width="1456" height="1186" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3728aed0-e9fd-4abb-99cc-ad6e08e4dbc9_1574x1282.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1186,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:248929,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3728aed0-e9fd-4abb-99cc-ad6e08e4dbc9_1574x1282.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!h0nA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3728aed0-e9fd-4abb-99cc-ad6e08e4dbc9_1574x1282.png 424w, https://substackcdn.com/image/fetch/$s_!h0nA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3728aed0-e9fd-4abb-99cc-ad6e08e4dbc9_1574x1282.png 848w, https://substackcdn.com/image/fetch/$s_!h0nA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3728aed0-e9fd-4abb-99cc-ad6e08e4dbc9_1574x1282.png 1272w, https://substackcdn.com/image/fetch/$s_!h0nA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3728aed0-e9fd-4abb-99cc-ad6e08e4dbc9_1574x1282.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p><strong>Enhancements to GRPO.</strong> Olmo 3 Think uses <a href="https://cameronrwolfe.substack.com/p/grpo">Group Relative Policy Optimization (GRPO)</a> as the underlying optimizer for RL training. Inspired by a swath of recent papers that propose useful modifications to GRPO, authors in [1] adopt a wide set of improvements to the vanilla GRPO algorithm. The following enhancements are used in particular:</p><ul><li><p><em>Zero Gradient Filtering</em>: prompts for which the entire group of completions or rollouts in GRPO receive the same reward are removed [16].  </p></li><li><p><em>Active Sampling</em>: despite filtering zero gradient examples, a constant batch size is maintained by ensuring additional samples are always available to replace those that get filtered [16].</p></li><li><p><em>Token-Level Loss</em>: the GRPO loss is normalized by the total number of tokens across the batch instead of per-sequence, which avoids instilling a length bias in the loss [16].</p></li><li><p><em>No KL Loss</em>: the KL divergence term is removed from the GRPO loss to allow for more flexibility in the policy updates, which is a common choice in recent reasoning research [16, 17, 18].</p></li><li><p><em>Clipping Upper Bound</em>: the upper-bound term in the <a href="https://cameronrwolfe.substack.com/i/175107358/proximal-policy-optimization-algorithms">PPO-style clipping</a> used by GRPO is set to a higher value than the lower bound to enable larger policy updates [16].</p></li><li><p><em>Truncated Importance Sampling (TIS)</em>: an extra importance sampling term is added to the GRPO loss to adjust for differences in log probabilities between engines used for training and inference [18]. </p></li><li><p><em>No Standard Deviation</em>: the standard deviation of rewards in a group is excluded from the denominator of the GRPO advantage calculation [19]. </p></li></ul><p>When considering all of these enhancements, the GRPO objective function is formulated as shown below. This objective maintains the structure of the GRPO objective, which is nearly identical to the objective from PPO but uses a modified advantage formulation. Compared to vanilla GRPO, however, we normalize the objective differently, slightly change the advantage, tweak the upper bound for clipping, and weight the objective using a capped importance sampling ratio.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ih7u!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa761060e-d04d-4338-8ad9-412917fe2309_2374x693.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ih7u!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa761060e-d04d-4338-8ad9-412917fe2309_2374x693.png 424w, https://substackcdn.com/image/fetch/$s_!Ih7u!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa761060e-d04d-4338-8ad9-412917fe2309_2374x693.png 848w, https://substackcdn.com/image/fetch/$s_!Ih7u!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa761060e-d04d-4338-8ad9-412917fe2309_2374x693.png 1272w, https://substackcdn.com/image/fetch/$s_!Ih7u!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa761060e-d04d-4338-8ad9-412917fe2309_2374x693.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ih7u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa761060e-d04d-4338-8ad9-412917fe2309_2374x693.png" width="1456" height="425" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a761060e-d04d-4338-8ad9-412917fe2309_2374x693.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:425,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:276202,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa761060e-d04d-4338-8ad9-412917fe2309_2374x693.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ih7u!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa761060e-d04d-4338-8ad9-412917fe2309_2374x693.png 424w, https://substackcdn.com/image/fetch/$s_!Ih7u!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa761060e-d04d-4338-8ad9-412917fe2309_2374x693.png 848w, https://substackcdn.com/image/fetch/$s_!Ih7u!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa761060e-d04d-4338-8ad9-412917fe2309_2374x693.png 1272w, https://substackcdn.com/image/fetch/$s_!Ih7u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa761060e-d04d-4338-8ad9-412917fe2309_2374x693.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Enhanced GRPO formulation for Olmo 3 (from [1])</figcaption></figure></div><p><strong>More details on TIS.</strong> During RL training, we are constantly alternating between two key operations:</p><ol><li><p><em>Rollouts</em>: given a set of prompts, sample multiple completions to each prompt using the current LLM (or policy). </p></li><li><p><em>Policy Updates</em>: computing an weight update for our LLM using the sampled rollouts and the objective function outlined above. </p></li></ol><p>To improve efficiency, these operations are usually handled by separate engines. We sample rollouts using an optimized inference engine like <a href="https://docs.vllm.ai/en/latest/">vLLM</a> or <a href="https://docs.sglang.io/">SGLang</a> and compute policy updates with training frameworks like transformers&#8212;<em>or usually a distributed version of this framework that uses an algorithm like <a href="https://arxiv.org/abs/2304.11277">FSDP</a> or <a href="https://arxiv.org/abs/1910.02054v3">ZeRO</a></em>. The use of different backends for rollouts and policy updates can lead to a mismatch between the two environments in which the log probabilities for a rollout differ significantly from those used in the policy update; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uA3X!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b2343b6-246c-4930-b030-5583c50d4cb9_1434x698.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uA3X!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b2343b6-246c-4930-b030-5583c50d4cb9_1434x698.png 424w, https://substackcdn.com/image/fetch/$s_!uA3X!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b2343b6-246c-4930-b030-5583c50d4cb9_1434x698.png 848w, https://substackcdn.com/image/fetch/$s_!uA3X!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b2343b6-246c-4930-b030-5583c50d4cb9_1434x698.png 1272w, https://substackcdn.com/image/fetch/$s_!uA3X!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b2343b6-246c-4930-b030-5583c50d4cb9_1434x698.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uA3X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b2343b6-246c-4930-b030-5583c50d4cb9_1434x698.png" width="1434" height="698" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0b2343b6-246c-4930-b030-5583c50d4cb9_1434x698.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:698,&quot;width&quot;:1434,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:311076,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b2343b6-246c-4930-b030-5583c50d4cb9_1434x698.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uA3X!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b2343b6-246c-4930-b030-5583c50d4cb9_1434x698.png 424w, https://substackcdn.com/image/fetch/$s_!uA3X!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b2343b6-246c-4930-b030-5583c50d4cb9_1434x698.png 848w, https://substackcdn.com/image/fetch/$s_!uA3X!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b2343b6-246c-4930-b030-5583c50d4cb9_1434x698.png 1272w, https://substackcdn.com/image/fetch/$s_!uA3X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b2343b6-246c-4930-b030-5583c50d4cb9_1434x698.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [18])</figcaption></figure></div><p>This mismatch persists even when steps are taken to reduce differences between inference and training backends. As a solution, authors in [18] use a truncated importance sampling scheme that re-weights the GRPO objective by the ratio of log probabilities from the two engines. We cap (or truncate) this importance sampling ratio at a maximum value of <code>&#961;</code>. Without this correction, the RL training process becomes slightly <a href="https://cameronrwolfe.substack.com/p/online-rl">off-policy</a>, which can degrade performance. Using TIS re-weights examples with significant mismatches to solve this issue; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DZgv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50e0fe21-e609-4a7d-918b-46084ec91bf1_1436x660.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DZgv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50e0fe21-e609-4a7d-918b-46084ec91bf1_1436x660.png 424w, https://substackcdn.com/image/fetch/$s_!DZgv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50e0fe21-e609-4a7d-918b-46084ec91bf1_1436x660.png 848w, https://substackcdn.com/image/fetch/$s_!DZgv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50e0fe21-e609-4a7d-918b-46084ec91bf1_1436x660.png 1272w, https://substackcdn.com/image/fetch/$s_!DZgv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50e0fe21-e609-4a7d-918b-46084ec91bf1_1436x660.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DZgv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50e0fe21-e609-4a7d-918b-46084ec91bf1_1436x660.png" width="1436" height="660" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/50e0fe21-e609-4a7d-918b-46084ec91bf1_1436x660.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:660,&quot;width&quot;:1436,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:426855,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50e0fe21-e609-4a7d-918b-46084ec91bf1_1436x660.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DZgv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50e0fe21-e609-4a7d-918b-46084ec91bf1_1436x660.png 424w, https://substackcdn.com/image/fetch/$s_!DZgv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50e0fe21-e609-4a7d-918b-46084ec91bf1_1436x660.png 848w, https://substackcdn.com/image/fetch/$s_!DZgv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50e0fe21-e609-4a7d-918b-46084ec91bf1_1436x660.png 1272w, https://substackcdn.com/image/fetch/$s_!DZgv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50e0fe21-e609-4a7d-918b-46084ec91bf1_1436x660.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [18])</figcaption></figure></div><blockquote><p><em>&#8220;This importance sampling term seems to be essential to getting modern RL infrastructure right, as without it, scaling to more complex systems is hard to get numerical stability with&#8230;. the advantage or reward is getting re-weighted by an importance sampling log-ratio corresponding to the difference in probabilities from the two sets of model implementations (e.g. VLLM vs Transformers).&#8221;</em> - <a href="https://www.interconnects.ai/p/the-new-rl-scaling-laws">source</a></p></blockquote><p>The importance sampling expression used by TIS is derived from the <a href="https://en.wikipedia.org/wiki/Importance_sampling">statistical definition of importance sampling</a>. Formally, importance sampling is a statistical method used to estimate properties<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-13" href="#footnote-13" target="_self">13</a> of a target probability distribution <code>f(x)</code> by sampling from a different proposal distribution <code>g(x)</code>. Usually, taking samples from <code>g(x)</code> is much cheaper than <code>f(x)</code>, which is the motivation for importance sampling. Because sampling from <code>f(x)</code> is difficult, we instead draw samples from <code>g(x)</code> and correct for the discrepancy between <code>f(x)</code> and <code>g(x)</code> by weighting each sample by the importance ratio <code>f(x) / g(x)</code>; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iEKF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb69437bf-88b3-4485-b263-f2828f40db17_2288x762.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iEKF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb69437bf-88b3-4485-b263-f2828f40db17_2288x762.png 424w, https://substackcdn.com/image/fetch/$s_!iEKF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb69437bf-88b3-4485-b263-f2828f40db17_2288x762.png 848w, https://substackcdn.com/image/fetch/$s_!iEKF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb69437bf-88b3-4485-b263-f2828f40db17_2288x762.png 1272w, https://substackcdn.com/image/fetch/$s_!iEKF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb69437bf-88b3-4485-b263-f2828f40db17_2288x762.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iEKF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb69437bf-88b3-4485-b263-f2828f40db17_2288x762.png" width="589" height="196.19848901098902" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b69437bf-88b3-4485-b263-f2828f40db17_2288x762.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:485,&quot;width&quot;:1456,&quot;resizeWidth&quot;:589,&quot;bytes&quot;:406204,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb69437bf-88b3-4485-b263-f2828f40db17_2288x762.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iEKF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb69437bf-88b3-4485-b263-f2828f40db17_2288x762.png 424w, https://substackcdn.com/image/fetch/$s_!iEKF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb69437bf-88b3-4485-b263-f2828f40db17_2288x762.png 848w, https://substackcdn.com/image/fetch/$s_!iEKF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb69437bf-88b3-4485-b263-f2828f40db17_2288x762.png 1272w, https://substackcdn.com/image/fetch/$s_!iEKF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb69437bf-88b3-4485-b263-f2828f40db17_2288x762.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(<a href="https://ionides.github.io/pubs/ionides08-jcgs.pdf">source</a>)</figcaption></figure></div><p>In the case of RL, we are interested in log probabilities sampled from our training engine&#8212;<em>this is our target distribution </em><code>f(x)</code>. However, we can take samples more efficiently from our optimized inference engine&#8212;<em>this is our proposal distribution </em><code>g(x)</code>. From here, we can use importance sampling to correct for any mismatch between these two distributions. Specifically, the importance sampling ratio (highlighted in the explanation above) is <code>f(x) / g(x)</code>, or the log probability from the training engine divided by the log probability from the inference engine. As we might recall, <em>this is exactly the importance ratio used within TIS</em>!</p><p><strong>Dolci Think RL.</strong> Similarly to other training phases, prompts for RL training are sampled from a wide variety of public sources. The full dataset, called <a href="https://huggingface.co/datasets/allenai/Dolci-Think-RL-32B">Dolci Think RL</a>, contains ~100K prompts spanning math, code, instruction following and chat domains. When curating code data, we need pairs of problems with associated test cases, which are not always available. As a solution, authors in [1] develop the following synthetic data pipeline:</p><ul><li><p>Rewrite the problem and solution.</p></li><li><p>Generate test cases for the problem.</p></li><li><p>Execute the test cases to see if they pass.</p></li><li><p>Keep all problems that pass &gt;80% of test cases.</p></li><li><p>Remove any remaining test cases that fail. </p></li></ul><p>A similar rewriting and filtering approach is used for chat data. First, GPT-4.1 is used to rewrite samples for better clarity and reference answers are extracted. We then generate eight samples for each prompt using an LLM, compute the F1 score<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-14" href="#footnote-14" target="_self">14</a> between the reference answer and each response, then remove samples with an F1 score outside of the range <code>[0.1, 0.9]</code>. Intuitively, this filtering operation aims to remove noisy or difficult examples from RL training. </p><p>Prior to RL, the dataset also undergoes <strong>offline difficulty filtering</strong>. Concretely, this means that we:</p><ol><li><p>Generate eight rollouts for each prompt using the DPO model (i.e., the starting policy for RL training).</p></li><li><p>Remove and prompts that are already easily solved by the model before any RL (i.e., a majority pass rate of &gt;62.5%).</p></li></ol><p>The goal of difficulty filtering is to improve the sample efficiency from RL by not training on trivial data. This offline filtering is performed for the Olmo 3 Think 7B model, then the results of this filtering are re-used for the 32B model due to cost constraints. Intuitively, the 32B model should be able to solve any problem that is easily solved by the 7B model, and any remaining easy samples would still be filtered via active sampling in GRPO. In specific cases, authors also filter out data that is found to be too difficult for the model to solve during RL training.</p><blockquote><p><em>&#8220;We found RL experiments were both long and compute-expensive&#8230; we established a pipeline in which: we performed dataset-specific runs on an intermediate SFT checkpoint and observed downstream evaluation trends over the first 500-1000 RL steps; focused on math domain training when testing new algorithmic changes; periodically ran overall mixture experiments to ensure mixing was stable.&#8221;</em> - from [1]</p></blockquote><p>The prohibitive cost of RL training makes discovering optimal data mixtures more difficult relative to prior training phases, forcing authors to design cheaper proxy experiments for tuning their RL setup. Candidate data mixtures are vetted with short RL training runs (~1K training steps) and combined into a larger mixture that is intermittently tested in centralized experiments. Similarly, algorithmic changes, such as modifications to GRPO, are tested in a simplified single-objective (i.e., math only) RL environment. Most tuning is also performed with the 7B model, while the 32B model just uses the same settings. Put simply, any ablations must use a simplified setup&#8212;<em>running full RL training is too costly.</em></p><p><strong>Key findings.</strong> We learn in [1] that DPO tends to be a better starting point for RL training&#8212;<em>further preference tuning improves the performance of the SFT model (further SFT does not) and yields higher performance after downstream RL training</em>. Starting from a DPO checkpoint, training rewards increase steadily throughout the RL training process; see below. Additionally, training on a mixture of different reward signals is found to be beneficial, as it prevents over-optimization to a particular domain. The training reward is actually lower when reward signals are mixed together, but the model is found to generalize better in downstream evaluations. <em>This finding indicates that performing RL training over a diverse dataset with varying reward signals can aid performance and prevent reward hacking.</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sAP0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e3ff3e6-6ce6-48e2-9fa5-c68ac077270d_2106x646.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sAP0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e3ff3e6-6ce6-48e2-9fa5-c68ac077270d_2106x646.png 424w, https://substackcdn.com/image/fetch/$s_!sAP0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e3ff3e6-6ce6-48e2-9fa5-c68ac077270d_2106x646.png 848w, https://substackcdn.com/image/fetch/$s_!sAP0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e3ff3e6-6ce6-48e2-9fa5-c68ac077270d_2106x646.png 1272w, https://substackcdn.com/image/fetch/$s_!sAP0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e3ff3e6-6ce6-48e2-9fa5-c68ac077270d_2106x646.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sAP0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e3ff3e6-6ce6-48e2-9fa5-c68ac077270d_2106x646.png" width="1456" height="447" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9e3ff3e6-6ce6-48e2-9fa5-c68ac077270d_2106x646.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:447,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:305935,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e3ff3e6-6ce6-48e2-9fa5-c68ac077270d_2106x646.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sAP0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e3ff3e6-6ce6-48e2-9fa5-c68ac077270d_2106x646.png 424w, https://substackcdn.com/image/fetch/$s_!sAP0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e3ff3e6-6ce6-48e2-9fa5-c68ac077270d_2106x646.png 848w, https://substackcdn.com/image/fetch/$s_!sAP0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e3ff3e6-6ce6-48e2-9fa5-c68ac077270d_2106x646.png 1272w, https://substackcdn.com/image/fetch/$s_!sAP0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e3ff3e6-6ce6-48e2-9fa5-c68ac077270d_2106x646.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><div class="pullquote"><p>&#8220;In RL, the key technical challenge for finetuning models that generate long sequences is managing inference &#8211; also called the rollouts. For our final models, we performed RL rollouts that were up to 32k tokens in length, and on average over 10k tokens (for the reasoner models). Inference dominated our costs, using 8 H100 nodes for training and 20 nodes for inference for the 32B OlmoRL reasoner model. Given the cost of autoregressive inference, our learner spends 75% of the time waiting for data, so in terms of GPU utilization, we use approximately 5x as much for inference vs training.&#8221; - from [1]</p></div><p><strong>Infrastructure for RL.</strong> One key focus of Olmo 3 is improving the efficiency of the RL training process. The cost of RL training is dominated by rollouts; e.g., Olmo 3 models use 5-14&#215; more compute for inference compared to policy updates. During RL training, most of the time is spent waiting for inference to finish, and this inference process can have a long tail if certain completions are longer than others. <em>All of these issues degrade throughput and lead to poor hardware utilization</em>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!o2An!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febee479c-8132-4047-83e0-f06ebbe05ac1_2096x998.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!o2An!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febee479c-8132-4047-83e0-f06ebbe05ac1_2096x998.png 424w, https://substackcdn.com/image/fetch/$s_!o2An!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febee479c-8132-4047-83e0-f06ebbe05ac1_2096x998.png 848w, https://substackcdn.com/image/fetch/$s_!o2An!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febee479c-8132-4047-83e0-f06ebbe05ac1_2096x998.png 1272w, https://substackcdn.com/image/fetch/$s_!o2An!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febee479c-8132-4047-83e0-f06ebbe05ac1_2096x998.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!o2An!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febee479c-8132-4047-83e0-f06ebbe05ac1_2096x998.png" width="1456" height="693" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ebee479c-8132-4047-83e0-f06ebbe05ac1_2096x998.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:693,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:264123,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febee479c-8132-4047-83e0-f06ebbe05ac1_2096x998.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!o2An!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febee479c-8132-4047-83e0-f06ebbe05ac1_2096x998.png 424w, https://substackcdn.com/image/fetch/$s_!o2An!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febee479c-8132-4047-83e0-f06ebbe05ac1_2096x998.png 848w, https://substackcdn.com/image/fetch/$s_!o2An!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febee479c-8132-4047-83e0-f06ebbe05ac1_2096x998.png 1272w, https://substackcdn.com/image/fetch/$s_!o2An!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febee479c-8132-4047-83e0-f06ebbe05ac1_2096x998.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>To make the RL training process more efficient, authors in [1] propose OlmoRL, an optimized setup for RL training that focuses proposes the following:</p><ul><li><p>A <strong>fully-asynchronous</strong>, off-policy RL setup is used that reduces idle time by allowing inference and model updates to continue running without waiting for all components to finish. </p></li><li><p><strong>Continuous batching</strong> (see <a href="https://huggingface.co/blog/continuous_batching">here</a> for details) is used to constantly enqueue new inference requests in real-time as generations finish.</p></li><li><p>To compensate for examples removed by <strong>active sampling</strong>, OlmoRL&#8212;<em>due to its asynchronous setup</em>&#8212;can just continue sampling and filtering examples until the desired batch size is reached. </p></li><li><p><strong>Inflight updates</strong> to the model weights being used for inference are performed without pausing generation or clearing the KV cache, which is found in [1] to improve throughput by ~4&#215; with no deterioration in accuracy.</p></li></ul><p>Several low-level threading updates are also made to each of the inference and policy update actors; see <a href="https://github.com/allenai/open-instruct/blob/main/open_instruct/grpo_fast.py">here</a> for the full code. When applied in tandem, the set of optimizations proposed for OlmoRL allows the wall-clock RL training time of Olmo 3 RL Think to be decreased from over 15 days to ~6 days!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gzwp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0585a022-b994-4b86-a440-39eb57b6cb7c_3414x1877.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gzwp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0585a022-b994-4b86-a440-39eb57b6cb7c_3414x1877.png 424w, https://substackcdn.com/image/fetch/$s_!gzwp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0585a022-b994-4b86-a440-39eb57b6cb7c_3414x1877.png 848w, https://substackcdn.com/image/fetch/$s_!gzwp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0585a022-b994-4b86-a440-39eb57b6cb7c_3414x1877.png 1272w, https://substackcdn.com/image/fetch/$s_!gzwp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0585a022-b994-4b86-a440-39eb57b6cb7c_3414x1877.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gzwp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0585a022-b994-4b86-a440-39eb57b6cb7c_3414x1877.png" width="1456" height="801" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0585a022-b994-4b86-a440-39eb57b6cb7c_3414x1877.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:801,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gzwp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0585a022-b994-4b86-a440-39eb57b6cb7c_3414x1877.png 424w, https://substackcdn.com/image/fetch/$s_!gzwp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0585a022-b994-4b86-a440-39eb57b6cb7c_3414x1877.png 848w, https://substackcdn.com/image/fetch/$s_!gzwp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0585a022-b994-4b86-a440-39eb57b6cb7c_3414x1877.png 1272w, https://substackcdn.com/image/fetch/$s_!gzwp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0585a022-b994-4b86-a440-39eb57b6cb7c_3414x1877.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(<a href="https://substack.com/@natolambert/note/c-187080576?r=hpcuh&amp;utm_source=notes-share-action&amp;utm_medium=web">source</a>)</figcaption></figure></div><p><strong>Olmo 3.1 Think.</strong> After the initial release of Olmo 3, authors kept the RL training process running for an extra three weeks, producing the <a href="https://huggingface.co/allenai/Olmo-3.1-32B-Think">Olmo 3.1 Think</a> model. This model perfectly demonstrates the value of scaling RL training and the necessity of creating stable RL training frameworks (like OlmoRL) that can run for long periods of time without instability. After the initial release, authors were unsure whether continuing the RL training process would yield further benefits, but the model continued to improve during this time. Interestingly, the model&#8217;s performance was also <a href="https://substack.com/@natolambert/note/c-187080576?r=hpcuh&amp;utm_source=notes-share-action&amp;utm_medium=web">still not fully saturated</a> after further training. </p><h4>RL-Zero</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8rFM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe19787e1-df29-413b-8ab3-7ed137eca9d9_1844x1028.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8rFM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe19787e1-df29-413b-8ab3-7ed137eca9d9_1844x1028.png 424w, https://substackcdn.com/image/fetch/$s_!8rFM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe19787e1-df29-413b-8ab3-7ed137eca9d9_1844x1028.png 848w, https://substackcdn.com/image/fetch/$s_!8rFM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe19787e1-df29-413b-8ab3-7ed137eca9d9_1844x1028.png 1272w, https://substackcdn.com/image/fetch/$s_!8rFM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe19787e1-df29-413b-8ab3-7ed137eca9d9_1844x1028.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8rFM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe19787e1-df29-413b-8ab3-7ed137eca9d9_1844x1028.png" width="1456" height="812" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e19787e1-df29-413b-8ab3-7ed137eca9d9_1844x1028.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:812,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8rFM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe19787e1-df29-413b-8ab3-7ed137eca9d9_1844x1028.png 424w, https://substackcdn.com/image/fetch/$s_!8rFM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe19787e1-df29-413b-8ab3-7ed137eca9d9_1844x1028.png 848w, https://substackcdn.com/image/fetch/$s_!8rFM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe19787e1-df29-413b-8ab3-7ed137eca9d9_1844x1028.png 1272w, https://substackcdn.com/image/fetch/$s_!8rFM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe19787e1-df29-413b-8ab3-7ed137eca9d9_1844x1028.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [9])</figcaption></figure></div><p><a href="https://cameronrwolfe.substack.com/i/153722335/deepseek-r-zero">DeepSeek-R1-Zero</a> [9] demonstrated that LLMs can learn complex reasoning behavior by applying RL training directly to a base model (i.e., with no SFT); see above. This work was the first to demonstrate that reasoning capabilities could be developed without supervised data, making the RL-Zero setup&#8212;<em>or just running RLVR on top of a base model</em>&#8212;a popular benchmark for RL research. Although this setup is widely used in LLM research, most RL-Zero experiments are performed using models with no data transparency, preventing proper decontamination.</p><blockquote><p><em>&#8220;[A lack of data transparency] can lead to a myriad of issues with benchmark evaluations being contaminated; e.g. midtraining data containing the evaluation which makes spurious rewards as effective as true reward or improvements from fixing prompt templates outweighing the improvements from RL.&#8221;</em> - from [1]</p></blockquote><p>Going further, a variety of unexpected findings have been recently published by work that leverages an RL-Zero setup. For example, researchers have shown:</p><ul><li><p>RLVR with random rewards still improve model performance [12].</p></li><li><p>RLVR with a single training example can improve performance [13]. </p></li><li><p>Base models can match the reasoning capabilities of models trained with RLVR if a sufficient number of samples are taken per prompt [14]. </p></li></ul><p>Understanding the cause of these findings is necessary to develop a deeper collective knowledge of RL training. Although many hypothesis exist, one possible explanation for this behavior is data contamination&#8212;<em>these observations may simply be an artifact of evaluation data leaking into the base model&#8217;s dataset</em>. Unfortunately, existing RL-Zero setups provide no way of validating the impact of data contamination, which makes drawing definitive conclusions from this work difficult (potentially even impossible). Olmo 3 solves this problem!</p><p>Authors in [1] release a fully open RL-Zero setup based upon Olmo 3 Base, which has fully transparent pretraining and midtraining datasets, and a new dataset for RLVR called <a href="https://huggingface.co/datasets/allenai/Dolci-RL-Zero-Math-7B">Dolci RL-Zero</a>. While most RL-Zero setups are single-objective (e.g., running RLVR from a base model on <a href="https://huggingface.co/datasets/HuggingFaceH4/MATH-500">Math-500</a> is a <a href="https://sebastianraschka.com/blog/2025/hello-world-ai.html">common benchmark</a>), Dolci RL-Zero is comprised of four domains: math, code, precise instruction following, and a mixture of all three objectives. Additionally, decontamination&#8212;<em>or ensuring pretraining and midtraining data have no overlap with evaluation data</em>&#8212;is prioritized, allowing more confident conclusions to be drawn from experiments with RLVR. </p><p><strong>Notable findings.</strong> The RL-Zero setup proposed by Olmo 3 is mostly positioned as a cleaner and more reliable starting point for future research. However, authors in [1] also perform some interesting analysis using this setup. First, we see in [1] that using simpler prompt templates&#8212;<em>mostly text-based with no special tokens as shown below</em>&#8212;is more conducive to performant RLVR. This behavior stems from base models being primarily trained on raw text without special tokens or templates. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6dEm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad27640e-8ffc-4e68-b3e7-ee33737d4c2f_2178x534.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6dEm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad27640e-8ffc-4e68-b3e7-ee33737d4c2f_2178x534.png 424w, https://substackcdn.com/image/fetch/$s_!6dEm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad27640e-8ffc-4e68-b3e7-ee33737d4c2f_2178x534.png 848w, https://substackcdn.com/image/fetch/$s_!6dEm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad27640e-8ffc-4e68-b3e7-ee33737d4c2f_2178x534.png 1272w, https://substackcdn.com/image/fetch/$s_!6dEm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad27640e-8ffc-4e68-b3e7-ee33737d4c2f_2178x534.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6dEm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad27640e-8ffc-4e68-b3e7-ee33737d4c2f_2178x534.png" width="1456" height="357" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ad27640e-8ffc-4e68-b3e7-ee33737d4c2f_2178x534.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:357,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:105173,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad27640e-8ffc-4e68-b3e7-ee33737d4c2f_2178x534.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6dEm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad27640e-8ffc-4e68-b3e7-ee33737d4c2f_2178x534.png 424w, https://substackcdn.com/image/fetch/$s_!6dEm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad27640e-8ffc-4e68-b3e7-ee33737d4c2f_2178x534.png 848w, https://substackcdn.com/image/fetch/$s_!6dEm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad27640e-8ffc-4e68-b3e7-ee33737d4c2f_2178x534.png 1272w, https://substackcdn.com/image/fetch/$s_!6dEm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad27640e-8ffc-4e68-b3e7-ee33737d4c2f_2178x534.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>To achieve the best results, Olmo 3 RL-Zero performs lightweight prompt tuning with the base model to derive a simple, custom prompt with no special formatting for each RL domain. Performance of RL-Zero models is shown in [1] to improve steadily throughout RL training in terms of both training reward and held-out evaluation metrics; see below. As expected, we see an improvement in Pass@1 metrics, aligning with prior findings on RLVR<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-15" href="#footnote-15" target="_self">15</a> [15]. Interestingly, we also see a slight improvement in Pass@32<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-16" href="#footnote-16" target="_self">16</a> metrics, indicating that the base model learns to solve some problems that go beyond its initial reasoning capabilities.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HkYl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29d258d2-0f63-4c95-8368-a705fd5df1d0_2166x1270.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HkYl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29d258d2-0f63-4c95-8368-a705fd5df1d0_2166x1270.png 424w, https://substackcdn.com/image/fetch/$s_!HkYl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29d258d2-0f63-4c95-8368-a705fd5df1d0_2166x1270.png 848w, https://substackcdn.com/image/fetch/$s_!HkYl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29d258d2-0f63-4c95-8368-a705fd5df1d0_2166x1270.png 1272w, https://substackcdn.com/image/fetch/$s_!HkYl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29d258d2-0f63-4c95-8368-a705fd5df1d0_2166x1270.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HkYl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29d258d2-0f63-4c95-8368-a705fd5df1d0_2166x1270.png" width="1456" height="854" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/29d258d2-0f63-4c95-8368-a705fd5df1d0_2166x1270.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:854,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:676203,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29d258d2-0f63-4c95-8368-a705fd5df1d0_2166x1270.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HkYl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29d258d2-0f63-4c95-8368-a705fd5df1d0_2166x1270.png 424w, https://substackcdn.com/image/fetch/$s_!HkYl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29d258d2-0f63-4c95-8368-a705fd5df1d0_2166x1270.png 848w, https://substackcdn.com/image/fetch/$s_!HkYl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29d258d2-0f63-4c95-8368-a705fd5df1d0_2166x1270.png 1272w, https://substackcdn.com/image/fetch/$s_!HkYl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29d258d2-0f63-4c95-8368-a705fd5df1d0_2166x1270.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>The multi-objective nature of Olmo 3 RL-Zero also presents new challenges in RLVR research. We see above that models trained over the mix of rewards from each domain improve in their performance, but they still lag behind models that are explicitly trained on a single domain. Solving this under-optimization and developing effective techniques for balancing multi-objective RLVR is a tough research problem, but Olmo 3 provides a clean and efficient&#8212;<em>RL-Zero is cheaper than the full post-training pipeline for Olmo 3 Think!</em>&#8212;test bed for further analysis. For example, the Dolci RL-Zero setup is used in [1] to test several changes to the underlying RL algorithm, as well as to study the impact of different data mixtures during midtraining on the downstream RL training process. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Wr6K!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c71ec2b-0d34-4896-9ca1-64b950bff45f_2098x716.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Wr6K!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c71ec2b-0d34-4896-9ca1-64b950bff45f_2098x716.png 424w, https://substackcdn.com/image/fetch/$s_!Wr6K!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c71ec2b-0d34-4896-9ca1-64b950bff45f_2098x716.png 848w, https://substackcdn.com/image/fetch/$s_!Wr6K!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c71ec2b-0d34-4896-9ca1-64b950bff45f_2098x716.png 1272w, https://substackcdn.com/image/fetch/$s_!Wr6K!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c71ec2b-0d34-4896-9ca1-64b950bff45f_2098x716.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Wr6K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c71ec2b-0d34-4896-9ca1-64b950bff45f_2098x716.png" width="1456" height="497" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0c71ec2b-0d34-4896-9ca1-64b950bff45f_2098x716.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:497,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:351747,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c71ec2b-0d34-4896-9ca1-64b950bff45f_2098x716.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Wr6K!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c71ec2b-0d34-4896-9ca1-64b950bff45f_2098x716.png 424w, https://substackcdn.com/image/fetch/$s_!Wr6K!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c71ec2b-0d34-4896-9ca1-64b950bff45f_2098x716.png 848w, https://substackcdn.com/image/fetch/$s_!Wr6K!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c71ec2b-0d34-4896-9ca1-64b950bff45f_2098x716.png 1272w, https://substackcdn.com/image/fetch/$s_!Wr6K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c71ec2b-0d34-4896-9ca1-64b950bff45f_2098x716.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p></p><p><strong>Fixing RLVR with random rewards.</strong> RLVR with random rewards no longer benefits model performance when using the decontaminated Olmo 3 RL-Zero setup; see above. Although this finding clearly demonstrates the value of fully-open models for research, the results shown for RLVR with random rewards in [12] may still not be completely a product of data contamination. As shown below, these results were only found to hold true for the Qwen-2.5 model series on the Math 500 dataset&#8212;<em>other models and tasks did not clearly benefit from random rewards</em>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KHM7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4acb4edd-aee0-40b4-bbaa-91ebddda978f_2048x959.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KHM7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4acb4edd-aee0-40b4-bbaa-91ebddda978f_2048x959.png 424w, https://substackcdn.com/image/fetch/$s_!KHM7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4acb4edd-aee0-40b4-bbaa-91ebddda978f_2048x959.png 848w, https://substackcdn.com/image/fetch/$s_!KHM7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4acb4edd-aee0-40b4-bbaa-91ebddda978f_2048x959.png 1272w, https://substackcdn.com/image/fetch/$s_!KHM7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4acb4edd-aee0-40b4-bbaa-91ebddda978f_2048x959.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KHM7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4acb4edd-aee0-40b4-bbaa-91ebddda978f_2048x959.png" width="1456" height="682" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4acb4edd-aee0-40b4-bbaa-91ebddda978f_2048x959.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:682,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KHM7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4acb4edd-aee0-40b4-bbaa-91ebddda978f_2048x959.png 424w, https://substackcdn.com/image/fetch/$s_!KHM7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4acb4edd-aee0-40b4-bbaa-91ebddda978f_2048x959.png 848w, https://substackcdn.com/image/fetch/$s_!KHM7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4acb4edd-aee0-40b4-bbaa-91ebddda978f_2048x959.png 1272w, https://substackcdn.com/image/fetch/$s_!KHM7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4acb4edd-aee0-40b4-bbaa-91ebddda978f_2048x959.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [12])</figcaption></figure></div><p>Therefore, there may be some unique aspects of Qwen-2.5&#8212;<em>including potential data contamination</em>&#8212;that lead to these observations, which are very hard to debug without full openness. For example, beyond data contamination, an alternative rationale exists for the performance benefit of RLVR with random rewards:</p><ul><li><p>Qwen models are very good at generating code to assist in solving math reasoning problems, and code reasoning&#8212;<em>even when no code execution is allowed</em>&#8212;is positively correlated with math performance.</p></li><li><p>In the DAPO [16] paper, authors observe that <a href="https://thegradient.pub/understanding-evaluation-metrics-for-language-models/">entropy</a> decreases quickly during both PPO and GRPO training. Token distributions become concentrated, so outputs are similar when you sample multiple times and existing model behaviors are reinforced (i.e., made more likely).</p></li><li><p>This entropy collapse occurs because the clipping operation in PPO (and GRPO) restricts policy updates for low probability tokens more strictly than for high probability tokens, due to the structure of the <a href="https://cameronrwolfe.substack.com/i/175107358/trust-region-policy-optimization-trpo">policy ratio</a>.</p></li><li><p>To solve this issue, DAPO recommends a &#8220;clip higher&#8221; approach, which increases the upper bound of the clipping range in PPO so that clipping is not too restrictive of policy updates.</p></li></ul><p>In the case of RLVR with random rewards, clipping can reinforce the existing behavior of performing code reasoning for solving math problems in Qwen-2.5 and, in turn, improve its performance. Although this behavior is not observed in Olmo 3, the GRPO variant used in [1] also adopts the clip higher approach from DAPO. As a result, it is unclear in this case whether RLVR from random rewards is fixed due to algorithmic changes or the lack of data contamination. However, <em>analyzing such a property would be impossible without fully open models like Olmo 3</em>.</p><h2>Instruct Models</h2><p>Although reasoning models are very powerful, much of the <a href="https://openai.com/index/how-people-are-using-chatgpt/">real-world usage for LLMs</a> is still based on general tasks that do not require extensive reasoning (e.g., information or advice-seeking queries). With this in mind, authors in [1] create Instruct versions of the Olmo 3 models that quickly respond to user queries without the need to output a reasoning trajectory. The training pipeline for Olmo 3 Instruct is similar to that of the Think models&#8212;<em>it includes SFT, DPO and RLVR</em>. Rather than focusing upon reasoning, however, the data used for Instruct post-training emphasizes multi-turn chat, conciseness of responses, and tool use.</p><blockquote><p><em>&#8220;Everyday chat settings often do not require the inference-time scaling of Olmo 3 Think, allowing us to be more efficient at inference time on common tasks by not generating extended internal thoughts.&#8221;</em> - from [1]</p></blockquote><p><strong>Instruct evaluation.</strong> The benchmarks used for evaluating Olmo 3 Instruct models include benchmarks from Olmo 3 Think along with a few additional benchmarks (i.e., <a href="https://gorilla.cs.berkeley.edu/leaderboard.html">Berkley function calling leaderboard</a>, <a href="https://github.com/Future-House/LAB-Bench">LitQA2</a>, and <a href="https://openai.com/index/introducing-simpleqa/">SimpleQA</a>) for evaluating function calling capabilities. As shown below, Olmo 3 Instruct models are found to benefit significantly from tool use, indicating that post-training has instilled correct tool usage behavior. Across other benchmarks, Olmo 3 Instruct models perform comparably to popular non-thinking models. Interestingly, Olmo 3 outperforms Qwen-3 with thinking mode turned off at the 7B scale on several benchmarks, though this gap in performance is not present at the 32B scale.  </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xQUz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f3719de-be3f-419f-9dc5-1f3c4b116df1_2088x648.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xQUz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f3719de-be3f-419f-9dc5-1f3c4b116df1_2088x648.png 424w, https://substackcdn.com/image/fetch/$s_!xQUz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f3719de-be3f-419f-9dc5-1f3c4b116df1_2088x648.png 848w, https://substackcdn.com/image/fetch/$s_!xQUz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f3719de-be3f-419f-9dc5-1f3c4b116df1_2088x648.png 1272w, https://substackcdn.com/image/fetch/$s_!xQUz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f3719de-be3f-419f-9dc5-1f3c4b116df1_2088x648.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xQUz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f3719de-be3f-419f-9dc5-1f3c4b116df1_2088x648.png" width="1456" height="452" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6f3719de-be3f-419f-9dc5-1f3c4b116df1_2088x648.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:452,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:135493,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f3719de-be3f-419f-9dc5-1f3c4b116df1_2088x648.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xQUz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f3719de-be3f-419f-9dc5-1f3c4b116df1_2088x648.png 424w, https://substackcdn.com/image/fetch/$s_!xQUz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f3719de-be3f-419f-9dc5-1f3c4b116df1_2088x648.png 848w, https://substackcdn.com/image/fetch/$s_!xQUz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f3719de-be3f-419f-9dc5-1f3c4b116df1_2088x648.png 1272w, https://substackcdn.com/image/fetch/$s_!xQUz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f3719de-be3f-419f-9dc5-1f3c4b116df1_2088x648.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p><strong>SFT.</strong> A new <a href="https://huggingface.co/datasets/allenai/Dolci-Instruct-SFT">Dolci Instruct SFT dataset</a> is created for Olmo 3 Instruct models that emphasizes multi-turn chat and agentic capabilities (i.e., function calling). This dataset builds upon that of Olmo 2 [3] but makes a few key changes:</p><ul><li><p>Any reasoning traces that exist in the data are removed. </p></li><li><p>Synthetic completions are updated to use newer model generations (e.g., GPT-4.1 instead of GPT-3.5 or GPT-4). </p></li><li><p>An extensive set of supervised function calling examples is included.</p></li></ul><p>When curating function calling data, authors focus heavily upon collecting data in realistic environments, primarily <a href="https://www.anthropic.com/news/model-context-protocol">MCP servers</a>. More specifically, there are two key strategies used:</p><ol><li><p><em>Real trajectories</em>: ScienceQA and WebSearchQA datasets are created by using GPT-4.1 or GPT-5&#8212;<em>equipped with tools for <a href="https://serper.dev/">querying the internet</a> or <a href="https://allenai.org/asta/resources/mcp">a corpus of scientific papers</a> via separate MCP servers</em>&#8212;to generate problem solving trajectories for real-world queries.</p></li><li><p><em>Simulated interactions</em>: starting with a pool of tools and API specifications taken from public datasets, a large synthetic function calling dataset is created by prompting a pool of LLMs (GPT-4o, GPT-4.1, and GPT-5) to generate user queries, tool responses, and assistant messages.</p></li></ol><p>Executable function calling environments provide valuable training data by exposing the model to complex interactions with real tool outputs&#8212;<em>including errors</em>. Because collecting real tool-use data is hard to scale, however, simulated environments are used to create data for a wider set of function calling scenarios; see below for details. While real trajectories are more complex, simulated data has higher tool diversity and can be used to create examples with both multiple chat turns and multiple agent-environment interaction steps.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KmR9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b580dbf-1932-4fc0-8ca1-c67526eed370_3052x712.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KmR9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b580dbf-1932-4fc0-8ca1-c67526eed370_3052x712.png 424w, https://substackcdn.com/image/fetch/$s_!KmR9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b580dbf-1932-4fc0-8ca1-c67526eed370_3052x712.png 848w, https://substackcdn.com/image/fetch/$s_!KmR9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b580dbf-1932-4fc0-8ca1-c67526eed370_3052x712.png 1272w, https://substackcdn.com/image/fetch/$s_!KmR9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b580dbf-1932-4fc0-8ca1-c67526eed370_3052x712.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KmR9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b580dbf-1932-4fc0-8ca1-c67526eed370_3052x712.png" width="1456" height="340" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8b580dbf-1932-4fc0-8ca1-c67526eed370_3052x712.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:340,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:224460,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b580dbf-1932-4fc0-8ca1-c67526eed370_3052x712.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KmR9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b580dbf-1932-4fc0-8ca1-c67526eed370_3052x712.png 424w, https://substackcdn.com/image/fetch/$s_!KmR9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b580dbf-1932-4fc0-8ca1-c67526eed370_3052x712.png 848w, https://substackcdn.com/image/fetch/$s_!KmR9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b580dbf-1932-4fc0-8ca1-c67526eed370_3052x712.png 1272w, https://substackcdn.com/image/fetch/$s_!KmR9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b580dbf-1932-4fc0-8ca1-c67526eed370_3052x712.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>Interestingly, we see in [1] that using a unified format for function calling data is necessary for the model to perform well. Specifically, authors provide a tool spec in the system prompt, wrap tool calls in XML tags in assistant messages, and use a special environment role&#8212;<em>represented with dedicated special tokens</em>&#8212;for all tool outputs. An example of the unified tool format for Olmo 3 is shown below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Gu-6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faac52be3-5572-4d26-8f64-5d5d319a1a9f_1660x1364.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Gu-6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faac52be3-5572-4d26-8f64-5d5d319a1a9f_1660x1364.png 424w, https://substackcdn.com/image/fetch/$s_!Gu-6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faac52be3-5572-4d26-8f64-5d5d319a1a9f_1660x1364.png 848w, https://substackcdn.com/image/fetch/$s_!Gu-6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faac52be3-5572-4d26-8f64-5d5d319a1a9f_1660x1364.png 1272w, https://substackcdn.com/image/fetch/$s_!Gu-6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faac52be3-5572-4d26-8f64-5d5d319a1a9f_1660x1364.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Gu-6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faac52be3-5572-4d26-8f64-5d5d319a1a9f_1660x1364.png" width="1456" height="1196" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aac52be3-5572-4d26-8f64-5d5d319a1a9f_1660x1364.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1196,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:450006,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faac52be3-5572-4d26-8f64-5d5d319a1a9f_1660x1364.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Gu-6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faac52be3-5572-4d26-8f64-5d5d319a1a9f_1660x1364.png 424w, https://substackcdn.com/image/fetch/$s_!Gu-6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faac52be3-5572-4d26-8f64-5d5d319a1a9f_1660x1364.png 848w, https://substackcdn.com/image/fetch/$s_!Gu-6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faac52be3-5572-4d26-8f64-5d5d319a1a9f_1660x1364.png 1272w, https://substackcdn.com/image/fetch/$s_!Gu-6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faac52be3-5572-4d26-8f64-5d5d319a1a9f_1660x1364.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Tool calling example from Dolci Instruct SFT</figcaption></figure></div><p>To obtain the final data mixture for <a href="https://huggingface.co/allenai/Olmo-3-7B-Instruct-SFT">Olmo 3 Instruct SFT</a>, authors adopt the same strategy used for tuning the Olmo 3 Think models. Namely, we start with a base data mixture of 100K supervised examples and ablate the performance impact of each data domain that is added on top of the original dataset from Olmo 2 [3].</p><blockquote><p><em>&#8220;We find that training our instruct model on top of the thinking SFT model both increases model performance on benchmarks&#8230; and also does not increase average model response length.&#8221;</em> - from [1] </p></blockquote><p>As we might recall from the model flow at the beginning of this overview, the Olmo 3 Instruct models are trained starting from Olmo 3 Think SFT, which the authors find to benefit performance of Instruct models.</p><p><strong>DPO.</strong> Olmo 3 Instruct models are trained using a similar (but expanded) Delta Learning approach that is adapted from Olmo 3 Think to better prioritize general chat capabilities. Specifically, three types of preference pairs are used:</p><ul><li><p><strong>Delta Learning </strong>is used to construct contrastive preference pairs in an identical fashion to Olmo 3 Think, but both chosen and rejected completions are generated via Qwen-3 with thinking mode turned off.</p></li><li><p><strong>Delta-maximized GPT-judged pairs</strong> are created by generating synthetic completions from a pool of diverse models (including at least one model that is known to be much worse than the others), scoring them with a GPT-4.1 judge, then choosing the best and worse completion as a preference pair. </p></li><li><p><strong>Multi-turn preferences</strong> are synthetically generated by first prompting an LLM to self-talk or synthetically generate context from an existing prompt to create multi-turn chat data, then sampling a final assistant response for this multi-turn chat via Delta Learning.</p></li></ul><p>Multi-turn preferences only differ in the final assistant response, where chosen and rejected completions use models with a large quality gap (e.g., GPT-3.5 versus GPT-4.1 or Qwen-3-0.6B versus Qwen-3-32B) to generate this final turn. The GPT-judged preference data pipeline is inspired by UltraFeedback [20] but has been updated to use a more modern model pool and LLM judge; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cfxF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72acca4f-219e-4c8d-86c9-0e7acbc2882f_2654x1104.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cfxF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72acca4f-219e-4c8d-86c9-0e7acbc2882f_2654x1104.png 424w, https://substackcdn.com/image/fetch/$s_!cfxF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72acca4f-219e-4c8d-86c9-0e7acbc2882f_2654x1104.png 848w, https://substackcdn.com/image/fetch/$s_!cfxF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72acca4f-219e-4c8d-86c9-0e7acbc2882f_2654x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!cfxF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72acca4f-219e-4c8d-86c9-0e7acbc2882f_2654x1104.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cfxF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72acca4f-219e-4c8d-86c9-0e7acbc2882f_2654x1104.png" width="1456" height="606" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/72acca4f-219e-4c8d-86c9-0e7acbc2882f_2654x1104.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:606,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:548417,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72acca4f-219e-4c8d-86c9-0e7acbc2882f_2654x1104.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cfxF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72acca4f-219e-4c8d-86c9-0e7acbc2882f_2654x1104.png 424w, https://substackcdn.com/image/fetch/$s_!cfxF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72acca4f-219e-4c8d-86c9-0e7acbc2882f_2654x1104.png 848w, https://substackcdn.com/image/fetch/$s_!cfxF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72acca4f-219e-4c8d-86c9-0e7acbc2882f_2654x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!cfxF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72acca4f-219e-4c8d-86c9-0e7acbc2882f_2654x1104.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [20])</figcaption></figure></div><p>Interestingly, authors in [1] mention that naively applying the UltraFeedback approach with a modern model pool and judge performs poorly&#8212;<em>all models in the modern pool perform well and tend to have minimal quality deltas in their output. </em>As a solution, Olmo 3 proposes a &#8220;Delta Maximization&#8221; approach that <em>i)</em> ensures that at least one model in the pool is of much lower quality than others and <em>ii)</em> always constructs preference pairs from the best and worst completion in the pool.  </p><blockquote><p><em>&#8220;Our initial attempts to modernize the Ultrafeedback pipeline from OLMo 2 and T&#252;lu 3 by improving the quality of the LLM judge (GPT-4o &#8594; GPT-4.1) and updating our data-generator model pool failed to yield gains and even hurt relative to the OLMo 2 preference dataset baseline.&#8221;</em> - from [1]</p></blockquote><p>Ensuring a large delta between preference pairs is found to be essential for model performance. Additionally, we see clear benefits in [1] by combining GPT-judge preference pairs with those from Delta Learning, revealing the benefit of using different preference signals. We also see in [1] that <a href="https://cameronrwolfe.substack.com/i/141159804/biases-and-how-we-can-avoid-them">verbosity bias</a>&#8212;<em>or the tendency of LLM judges to prefer longer completions</em>&#8212;noticeably impacts synthetic preference pipelines. To promote concise responses, chat-based preference pairs are filtered such that chosen and rejected completions do not differ in length by more than 100 tokens<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-17" href="#footnote-17" target="_self">17</a>; see below. Length control deteriorates certain benchmark scores but also improves usability, leads to better vibe tests, and is&#8212;<em>somewhat counterintuitively</em>&#8212;determined to be a superior starting point for RL training.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9hpF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52b2060a-abfd-43ee-ae87-fe9dbaefb9fb_2094x1034.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9hpF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52b2060a-abfd-43ee-ae87-fe9dbaefb9fb_2094x1034.png 424w, https://substackcdn.com/image/fetch/$s_!9hpF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52b2060a-abfd-43ee-ae87-fe9dbaefb9fb_2094x1034.png 848w, https://substackcdn.com/image/fetch/$s_!9hpF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52b2060a-abfd-43ee-ae87-fe9dbaefb9fb_2094x1034.png 1272w, https://substackcdn.com/image/fetch/$s_!9hpF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52b2060a-abfd-43ee-ae87-fe9dbaefb9fb_2094x1034.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9hpF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52b2060a-abfd-43ee-ae87-fe9dbaefb9fb_2094x1034.png" width="1456" height="719" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/52b2060a-abfd-43ee-ae87-fe9dbaefb9fb_2094x1034.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:719,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:322877,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/179769076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52b2060a-abfd-43ee-ae87-fe9dbaefb9fb_2094x1034.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9hpF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52b2060a-abfd-43ee-ae87-fe9dbaefb9fb_2094x1034.png 424w, https://substackcdn.com/image/fetch/$s_!9hpF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52b2060a-abfd-43ee-ae87-fe9dbaefb9fb_2094x1034.png 848w, https://substackcdn.com/image/fetch/$s_!9hpF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52b2060a-abfd-43ee-ae87-fe9dbaefb9fb_2094x1034.png 1272w, https://substackcdn.com/image/fetch/$s_!9hpF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52b2060a-abfd-43ee-ae87-fe9dbaefb9fb_2094x1034.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>As shown in the figure above, DPO performance does not improve monotonically with more data and the optimal amount of data is task-dependent. In other words, <em>the total size of the training dataset is a hyperparameter that must be tuned</em>. In [1], the optimal data size and mixture is determined via a combination of:</p><ol><li><p>Ablation experiments that combine different data sources with an 100K base mixture to determine data viability.</p></li><li><p>Mixing experiments that combine 50K examples from the base mixture with 50K examples from various data sources to test the impact of up-sampling a particular source of preference pairs. </p></li><li><p>One-off tests of hand-crafted data mixtures determined by expert intuition.</p></li></ol><p>The behavior of DPO training is less predictable, and the final training strategy was determined empirically. Authors manually selected nine different mixtures to compare to a uniform sampling baseline and performed hyperparameter sweeps to determine the optimal amount of training data and learning rate. The final checkpoint is selected via a combination of vibe tests and benchmark scores.</p><p><strong>RL.</strong> The RL training process for Olmo 3 Instruct is identical to that of Olmo 3 Think aside from a few minor modifications:</p><ul><li><p>Using less challenging datasets (i.e., by removing the most difficult tasks) in the math and coding domains.</p></li><li><p>Removing the offline difficulty filtering step. This step is unnecessary for Instruct models due to focusing less on complex reasoning. </p></li></ul><p>Olmo 3 Instruct models are trained on a mixture of general chat, math, and code data using the same RL training stack as Olmo 3 Think. However, the maximum response length is capped at 8K tokens to avoid excessively long outputs. The full RL pipeline is applied to multiple DPO models, and the final model is chosen via a combination of <em>&#8220;final average performance, length analysis, and vibe-tests.&#8221;</em></p><h2>The Open LLM Renaissance</h2><p>AI research has traditionally been very transparent, but the level of openness has decreased during the LLM boom as top labs have focused efforts on proprietary models (e.g., GPT, Gemini, or Claude) with little transparency. Open models have always been a topic of discussion, but the level of interest in open LLM research skyrocketed with the release of DeepSeek-R1 [9]. After this release, a variety of (primarily Chinese) AI labs followed suit by releasing great models like Qwen-3, <a href="https://arxiv.org/abs/2507.20534">Kimi-K2</a>, <a href="https://www.minimax.io/news/minimax-m2">MiniMax M2</a>, <a href="https://arxiv.org/abs/2508.06471">GLM-4.5</a>, and more; see below for details.</p><div class="embedded-post-wrap" data-attrs="{&quot;id&quot;:181259397,&quot;url&quot;:&quot;https://www.interconnects.ai/p/2025-open-models-year-in-review&quot;,&quot;publication_id&quot;:48206,&quot;publication_name&quot;:&quot;Interconnects&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!djof!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc52e8097-8f3d-4f7e-808b-2f4ad37f3b52_720x720.png&quot;,&quot;title&quot;:&quot;2025 Open Models Year in Review&quot;,&quot;truncated_body_text&quot;:&quot;Welcome to the first Artifacts Recap, where we highlight the most notable and impactful open model releases of this year. And what a year it has been! Starting into the year, the open model landscape was seen as lagging behind severely, with open models being mostly a choice for those who needed privacy or wanted to fine-tune models for their use cases.&quot;,&quot;date&quot;:&quot;2025-12-14T20:01:01.918Z&quot;,&quot;like_count&quot;:14,&quot;comment_count&quot;:7,&quot;bylines&quot;:[{&quot;id&quot;:41984689,&quot;name&quot;:&quot;Florian Brand&quot;,&quot;handle&quot;:&quot;xeophon&quot;,&quot;previous_name&quot;:&quot;Xeophon&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!jqwS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff675812d-871a-46fc-913e-68f9b57cc790_666x666.jpeg&quot;,&quot;bio&quot;:&quot;PhD Student, Trier University | https://florianbrand.de/ | Opinions my own.&quot;,&quot;profile_set_up_at&quot;:&quot;2024-05-17T20:22:24.904Z&quot;,&quot;reader_installed_at&quot;:&quot;2025-05-06T12:05:56.163Z&quot;,&quot;publicationUsers&quot;:[{&quot;id&quot;:3754268,&quot;user_id&quot;:41984689,&quot;publication_id&quot;:48206,&quot;role&quot;:&quot;contributor&quot;,&quot;public&quot;:true,&quot;is_primary&quot;:false,&quot;publication&quot;:{&quot;id&quot;:48206,&quot;name&quot;:&quot;Interconnects&quot;,&quot;subdomain&quot;:&quot;robotic&quot;,&quot;custom_domain&quot;:&quot;www.interconnects.ai&quot;,&quot;custom_domain_optional&quot;:false,&quot;hero_text&quot;:&quot;The cutting edge of AI, from inside the frontier AI labs, minus the hype. The border between high-level and technical thinking. Read by leading engineers, researchers, and investors.&quot;,&quot;logo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c52e8097-8f3d-4f7e-808b-2f4ad37f3b52_720x720.png&quot;,&quot;author_id&quot;:10472909,&quot;primary_user_id&quot;:10472909,&quot;theme_var_background_pop&quot;:&quot;#ff6b00&quot;,&quot;created_at&quot;:&quot;2020-05-21T02:59:47.895Z&quot;,&quot;email_from_name&quot;:&quot;Interconnects by Nathan Lambert&quot;,&quot;copyright&quot;:&quot;Interconnects AI, LLC&quot;,&quot;founding_plan_name&quot;:&quot;Founding Member&quot;,&quot;community_enabled&quot;:true,&quot;invite_only&quot;:false,&quot;payments_state&quot;:&quot;enabled&quot;,&quot;language&quot;:null,&quot;explicit&quot;:false,&quot;homepage_type&quot;:&quot;magaziney&quot;,&quot;is_personal_mode&quot;:false}}],&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null,&quot;status&quot;:{&quot;bestsellerTier&quot;:null,&quot;subscriberTier&quot;:null,&quot;leaderboard&quot;:null,&quot;vip&quot;:false,&quot;badge&quot;:null,&quot;paidPublicationIds&quot;:[],&quot;subscriber&quot;:null}},{&quot;id&quot;:10472909,&quot;name&quot;:&quot;Nathan Lambert&quot;,&quot;handle&quot;:&quot;natolambert&quot;,&quot;previous_name&quot;:null,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!RihO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fedcdfb-e137-4f6a-9089-a46add6c6242_500x500.jpeg&quot;,&quot;bio&quot;:&quot;ML researcher making sense of AI research, products, and the uncertain technological future. PhD from Berkeley AI. Experience at Meta, DeepMind, HuggingFace.&quot;,&quot;profile_set_up_at&quot;:&quot;2021-04-24T01:19:33.371Z&quot;,&quot;reader_installed_at&quot;:&quot;2022-03-09T17:52:30.690Z&quot;,&quot;publicationUsers&quot;:[{&quot;id&quot;:100753,&quot;user_id&quot;:10472909,&quot;publication_id&quot;:48206,&quot;role&quot;:&quot;admin&quot;,&quot;public&quot;:true,&quot;is_primary&quot;:true,&quot;publication&quot;:{&quot;id&quot;:48206,&quot;name&quot;:&quot;Interconnects&quot;,&quot;subdomain&quot;:&quot;robotic&quot;,&quot;custom_domain&quot;:&quot;www.interconnects.ai&quot;,&quot;custom_domain_optional&quot;:false,&quot;hero_text&quot;:&quot;The cutting edge of AI, from inside the frontier AI labs, minus the hype. The border between high-level and technical thinking. Read by leading engineers, researchers, and investors.&quot;,&quot;logo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c52e8097-8f3d-4f7e-808b-2f4ad37f3b52_720x720.png&quot;,&quot;author_id&quot;:10472909,&quot;primary_user_id&quot;:10472909,&quot;theme_var_background_pop&quot;:&quot;#ff6b00&quot;,&quot;created_at&quot;:&quot;2020-05-21T02:59:47.895Z&quot;,&quot;email_from_name&quot;:&quot;Interconnects by Nathan Lambert&quot;,&quot;copyright&quot;:&quot;Interconnects AI, LLC&quot;,&quot;founding_plan_name&quot;:&quot;Founding Member&quot;,&quot;community_enabled&quot;:true,&quot;invite_only&quot;:false,&quot;payments_state&quot;:&quot;enabled&quot;,&quot;language&quot;:null,&quot;explicit&quot;:false,&quot;homepage_type&quot;:&quot;magaziney&quot;,&quot;is_personal_mode&quot;:false}},{&quot;id&quot;:4610799,&quot;user_id&quot;:10472909,&quot;publication_id&quot;:4519930,&quot;role&quot;:&quot;admin&quot;,&quot;public&quot;:true,&quot;is_primary&quot;:false,&quot;publication&quot;:{&quot;id&quot;:4519930,&quot;name&quot;:&quot;natolambert overflow&quot;,&quot;subdomain&quot;:&quot;natolambert&quot;,&quot;custom_domain&quot;:null,&quot;custom_domain_optional&quot;:false,&quot;hero_text&quot;:&quot;a place for any extra thoughts beyond Interconnects.ai&quot;,&quot;logo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eb88d599-32c8-49a9-ba33-ab6327aff727_256x256.png&quot;,&quot;author_id&quot;:10472909,&quot;primary_user_id&quot;:null,&quot;theme_var_background_pop&quot;:&quot;#FF6719&quot;,&quot;created_at&quot;:&quot;2025-03-27T15:04:05.448Z&quot;,&quot;email_from_name&quot;:null,&quot;copyright&quot;:&quot;Nathan Lambert&quot;,&quot;founding_plan_name&quot;:null,&quot;community_enabled&quot;:true,&quot;invite_only&quot;:false,&quot;payments_state&quot;:&quot;disabled&quot;,&quot;language&quot;:null,&quot;explicit&quot;:false,&quot;homepage_type&quot;:&quot;newspaper&quot;,&quot;is_personal_mode&quot;:false}},{&quot;id&quot;:4926744,&quot;user_id&quot;:10472909,&quot;publication_id&quot;:4830082,&quot;role&quot;:&quot;admin&quot;,&quot;public&quot;:true,&quot;is_primary&quot;:false,&quot;publication&quot;:{&quot;id&quot;:4830082,&quot;name&quot;:&quot;Retort AI&quot;,&quot;subdomain&quot;:&quot;retortai&quot;,&quot;custom_domain&quot;:&quot;www.retortai.com&quot;,&quot;custom_domain_optional&quot;:false,&quot;hero_text&quot;:&quot;Distilling the major events and challenges in the world of artificial intelligence and machine learning, from Thomas Krendl Gilbert and Nathan Lambert.\n\n&quot;,&quot;logo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cbad298c-6074-441b-ad43-d5df6dbf101d_800x800.png&quot;,&quot;author_id&quot;:10472909,&quot;primary_user_id&quot;:null,&quot;theme_var_background_pop&quot;:&quot;#FF6719&quot;,&quot;created_at&quot;:&quot;2025-04-25T22:10:28.216Z&quot;,&quot;email_from_name&quot;:null,&quot;copyright&quot;:&quot;Nathan Lambert&quot;,&quot;founding_plan_name&quot;:null,&quot;community_enabled&quot;:true,&quot;invite_only&quot;:false,&quot;payments_state&quot;:&quot;disabled&quot;,&quot;language&quot;:null,&quot;explicit&quot;:false,&quot;homepage_type&quot;:&quot;newspaper&quot;,&quot;is_personal_mode&quot;:false}}],&quot;twitter_screen_name&quot;:&quot;natolambert&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100,&quot;status&quot;:{&quot;bestsellerTier&quot;:100,&quot;subscriberTier&quot;:5,&quot;leaderboard&quot;:null,&quot;vip&quot;:false,&quot;badge&quot;:{&quot;type&quot;:&quot;bestseller&quot;,&quot;tier&quot;:100},&quot;paidPublicationIds&quot;:[883883,1915042,1084918,6349492,6027,69345,1084089],&quot;subscriber&quot;:null}}],&quot;utm_campaign&quot;:null,&quot;belowTheFold&quot;:true,&quot;type&quot;:&quot;newsletter&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="EmbeddedPostToDOM"><a class="embedded-post" native="true" href="https://www.interconnects.ai/p/2025-open-models-year-in-review?utm_source=substack&amp;utm_campaign=post_embed&amp;utm_medium=web"><div class="embedded-post-header"><img class="embedded-post-publication-logo" src="https://substackcdn.com/image/fetch/$s_!djof!,w_56,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc52e8097-8f3d-4f7e-808b-2f4ad37f3b52_720x720.png" loading="lazy"><span class="embedded-post-publication-name">Interconnects</span></div><div class="embedded-post-title-wrapper"><div class="embedded-post-title">2025 Open Models Year in Review</div></div><div class="embedded-post-body">Welcome to the first Artifacts Recap, where we highlight the most notable and impactful open model releases of this year. And what a year it has been! Starting into the year, the open model landscape was seen as lagging behind severely, with open models being mostly a choice for those who needed privacy or wanted to fine-tune models for their use cases&#8230;</div><div class="embedded-post-cta-wrapper"><span class="embedded-post-cta">Read more</span></div><div class="embedded-post-meta">5 months ago &#183; 14 likes &#183; 7 comments &#183; Florian Brand and Nathan Lambert</div></a></div><p>Despite the boom in open LLM research, open LLM releases were minimal in Western countries aside from models like <a href="https://cameronrwolfe.substack.com/p/gpt-oss">GPT-OSS</a> and <a href="https://mistral.ai/news/mistral-3">Mistral</a>. Additionally, the models that were released are almost exclusively open-weights, rather than being fully open&#8212;<em>i.e., no code or data transparency is provided</em>. These issues inspired the creation of initiatives like the <a href="https://www.atomproject.ai/">ATOM project</a> and have driven investment into the Olmo model series. As we have seen, Olmo 3 models still lag behind their open-weight counterparts, but we should remember the following points:</p><ol><li><p>Progress between the <a href="https://arxiv.org/abs/2402.00838">original Olmo model</a> and Olmo 3 is significant. </p></li><li><p>No other fully-open model series has neared state-of-the-art performance.</p></li><li><p>The impact of Olmo 3 goes beyond just the models themselves.</p></li></ol><p>The artifacts released by Olmo 3 are more than a model&#8212;<em>they are a starting point for any aspect of open LLM research</em>. Anyone with access to GPUs has the ability to clone and iterate upon the model flows proposed in [1]. Performing this kind of research before Olmo 3 may have required first crafting a functional training recipe, which would (conservatively) require millions of dollars in experiments.</p><div class="comment" data-attrs="{&quot;url&quot;:&quot;https://open.substack.com/home&quot;,&quot;commentId&quot;:187116341,&quot;comment&quot;:{&quot;id&quot;:187116341,&quot;date&quot;:&quot;2025-12-12T18:43:23.495Z&quot;,&quot;edited_at&quot;:&quot;2025-12-12T18:44:31.707Z&quot;,&quot;body&quot;:&quot;My favorite bit of the Olmo 3 paper: Transparent auditing of the cost for the v3 models (not 3.1), based of wall clock time for pre/post train, evals, cluster issues, etc, as a counter to the famous $5.576M for DeepSeek V3.\n\nat $2/H100 hour, Olmo 3 start to end would cost $2.75M. \n\nhttps://allenai.org/papers/olmo3&quot;,&quot;body_json&quot;:{&quot;type&quot;:&quot;doc&quot;,&quot;attrs&quot;:{&quot;schemaVersion&quot;:&quot;v1&quot;},&quot;content&quot;:[{&quot;type&quot;:&quot;paragraph&quot;,&quot;content&quot;:[{&quot;type&quot;:&quot;text&quot;,&quot;text&quot;:&quot;My favorite bit of the Olmo 3 paper: Transparent auditing of the cost for the v3 models (not 3.1), based of wall clock time for pre/post train, evals, cluster issues, etc, as a counter to the famous $5.576M for DeepSeek V3.&quot;}]},{&quot;type&quot;:&quot;paragraph&quot;,&quot;content&quot;:[{&quot;type&quot;:&quot;text&quot;,&quot;text&quot;:&quot;at $2/H100 hour, Olmo 3 start to end would cost $2.75M. &quot;}]},{&quot;type&quot;:&quot;paragraph&quot;,&quot;content&quot;:[{&quot;type&quot;:&quot;text&quot;,&quot;marks&quot;:[{&quot;type&quot;:&quot;link&quot;,&quot;attrs&quot;:{&quot;href&quot;:&quot;https://allenai.org/papers/olmo3&quot;,&quot;target&quot;:&quot;_blank&quot;,&quot;rel&quot;:&quot;nofollow ugc noopener&quot;,&quot;class&quot;:&quot;note-link&quot;}}],&quot;text&quot;:&quot;https://allenai.org/papers/olmo3&quot;}]}]},&quot;restacks&quot;:0,&quot;reaction_count&quot;:14,&quot;attachments&quot;:[{&quot;id&quot;:&quot;08d12d35-5d14-456a-a1bd-a50eb900411f&quot;,&quot;type&quot;:&quot;image&quot;,&quot;imageUrl&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c52e75e7-a0f6-4593-a9d9-703e5f2d63f8_1598x1334.png&quot;,&quot;imageWidth&quot;:1598,&quot;imageHeight&quot;:1334,&quot;explicit&quot;:false}],&quot;name&quot;:&quot;Nathan Lambert&quot;,&quot;user_id&quot;:10472909,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!RihO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fedcdfb-e137-4f6a-9089-a46add6c6242_500x500.jpeg&quot;,&quot;user_bestseller_tier&quot;:100,&quot;userStatus&quot;:{&quot;bestsellerTier&quot;:100,&quot;subscriberTier&quot;:5,&quot;leaderboard&quot;:{&quot;ranking&quot;:&quot;paid&quot;,&quot;rank&quot;:38,&quot;publicationName&quot;:&quot;Interconnects&quot;,&quot;label&quot;:&quot;Technology&quot;,&quot;categoryId&quot;:&quot;4&quot;,&quot;publicationId&quot;:48206},&quot;vip&quot;:false,&quot;badge&quot;:{&quot;type&quot;:&quot;bestseller&quot;,&quot;tier&quot;:100},&quot;paidPublicationIds&quot;:[883883,1915042,1084918,6349492,6027,69345,1084089],&quot;subscriber&quot;:null}}}" data-component-name="CommentPlaceholder"></div><p>With this in mind, <em>resources from Olmo 3 will fuel open research for the foreseeable future</em>. We are already seeing positive signs in this direction with models like <a href="https://www.primeintellect.ai/blog/intellect-3">Intellect-3</a>, <a href="https://www.arcee.ai/blog/the-trinity-manifesto">Trinity</a>, and <a href="https://mistral.ai/news/mistral-3">Mistral 3</a> being released immediately after Olmo 3. </p><h4><strong>New to the newsletter?</strong></h4><p>Hi! I&#8217;m <a href="https://cameronrwolfe.me/">Cameron R. Wolfe</a>, Deep Learning Ph.D. and Senior Research Scientist at <a href="https://research.netflix.com/research-area/nlp-and-conversations">Netflix</a>. This is the Deep (Learning) Focus newsletter, where I help readers better understand important topics in AI research. The newsletter will always be free and open to read. If you like the newsletter, please subscribe, consider a paid subscription, share it, or follow me on <a href="https://twitter.com/cwolferesearch">X</a> and <a href="https://www.linkedin.com/in/cameron-r-wolfe-ph-d-04744a238/">LinkedIn</a>!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://cameronrwolfe.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://cameronrwolfe.substack.com/subscribe?"><span>Subscribe now</span></a></p><h4>Bibliography</h4><p>[1] OLMo, Team, et al. &#8220;Olmo 3&#8221; <em><a href="https://www.datocms-assets.com/64837/1763662397-1763646865-olmo_3_technical_report-1.pdf">https://www.datocms-assets.com/64837/1763662397-1763646865-olmo_3_technical_report-1.pdf</a> </em>(2025).</p><p>[2] Hugging Face Team. &#8220;Smol-LLM Training Playbook.&#8221; https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook (2025).</p><p>[3] OLMo, Team, et al. &#8220;2 OLMo 2 Furious.&#8221; <em>arXiv preprint arXiv:2501.00656</em> (2024).</p><p>[4] Groeneveld, Dirk, et al. &#8220;OLMo: Accelerating the science of language models.&#8221; <em>Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers)</em>. 2024.</p><p>[5] Liu, Qian, et al. &#8220;Regmix: Data mixture as regression for language model pre-training.&#8221; <em>arXiv preprint arXiv:2407.01492</em> (2024).</p><p>[6] Li, Yunshui, et al. &#8220;Model Merging in Pre-training of Large Language Models.&#8221; <em>arXiv preprint arXiv:2505.12082</em> (2025).</p><p>[7] Pham, Chau Minh, Yapei Chang, and Mohit Iyyer. &#8220;CLIPPER: Compression enables long-context synthetic data generation.&#8221; <em>arXiv preprint arXiv:2502.14854</em> (2025).</p><p>[8] Peng, Bowen, et al. &#8220;Yarn: Efficient context window extension of large language models.&#8221; <em>arXiv preprint arXiv:2309.00071</em> (2023).</p><p>[9] Guo, Daya, et al. &#8220;Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.&#8221; <em>arXiv preprint arXiv:2501.12948</em> (2025).</p><p>[10] Lambert, Nathan, et al. &#8220;Tulu 3: Pushing frontiers in open language model post-training.&#8221; <em>arXiv preprint arXiv:2411.15124</em> (2024).</p><p>[11] Geng, Scott, et al. &#8220;The delta learning hypothesis: Preference tuning on weak data can yield strong gains.&#8221; <em>arXiv preprint arXiv:2507.06187</em> (2025).</p><p>[12] Shao, Rulin, et al. &#8220;Spurious rewards: Rethinking training signals in rlvr.&#8221; <em>arXiv preprint arXiv:2506.10947</em> (2025).</p><p>[13] Wang, Yiping, et al. &#8220;Reinforcement learning for reasoning in large language models with one training example.&#8221; <em>arXiv preprint arXiv:2504.20571</em> (2025).</p><p>[14] Yue, Yang, et al. &#8220;Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?.&#8221; <em>arXiv preprint arXiv:2504.13837</em> (2025).</p><p>[15] Shao, Zhihong, et al. &#8220;Deepseekmath: Pushing the limits of mathematical reasoning in open language models.&#8221; <em>arXiv preprint arXiv:2402.03300</em> (2024).</p><p>[16] Yu, Qiying, et al. &#8220;Dapo: An open-source llm reinforcement learning system at scale.&#8221; <em>arXiv preprint arXiv:2503.14476</em> (2025).</p><p>[17] Zeng, Aohan, et al. &#8220;Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.&#8221; <em>arXiv preprint arXiv:2508.06471</em> (2025).</p><p>[18] F. Yao, L. Liu, D. Zhang, C. Dong, J. Shang, and J. Gao. Your efficient rl framework secretly brings you off-policy rl training, Aug. 2025. URL <a href="https://fengyao.notion.site/off-policy-rl">https://fengyao.notion.site/off-policy-rl</a>.</p><p>[19] Liu, Zichen, et al. &#8220;Understanding r1-zero-like training: A critical perspective.&#8221; <em>arXiv preprint arXiv:2503.20783</em> (2025).</p><p>[20] Cui, Ganqu, et al. &#8220;Ultrafeedback: Boosting language models with scaled ai feedback.&#8221; <em>arXiv preprint arXiv:2310.01377</em> (2023).</p><p>[21] Yang, An, et al. &#8220;Qwen3 technical report.&#8221; <em>arXiv preprint arXiv:2505.09388</em> (2025).</p><p>[22] Wortsman, Mitchell, et al. &#8220;Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time.&#8221; <em>International conference on machine learning</em>. PMLR, 2022.</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Another common choice for the distributed training of LLMs is the <a href="https://arxiv.org/abs/1910.02054">zero redundancy optimizer (ZeRO)</a>, which is usually accessed via the <a href="https://www.deepspeed.ai/getting-started/">deepspeed</a> package. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Here, &#8220;sharding&#8221; means that we split the data evenly among the GPUs that we have available. For example, if we have an eight-GPU node and want to store 16 parameters in a sharded manner, we would store two parameters on each GPU. Sharding reduces per-GPU memory consumption to <code>1 / N</code>, where <code>N</code> is the number of GPUs.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Here, we call the architecture dense to clarify that it does not use a sparse architecture variant like a <a href="https://cameronrwolfe.substack.com/p/moe-llms">Mixture-of-Experts (MoE)</a>. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>The technique used to compute SNR of each benchmark is explained <a href="https://arxiv.org/abs/2508.13144">here</a>. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>Authors in [1] use <a href="https://en.wikipedia.org/wiki/Ward%27s_method">Ward&#8217;s variance-minimization</a>, which iteratively refines task clusters to minimize the variance of evaluation scores between benchmarks in a cluster. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p><a href="https://skeptric.com/perplexity/">Bits-per-byte</a> and <a href="https://huggingface.co/docs/transformers/perplexity">perplexity</a> are common information-theoretic metrics used to measure the performance of pretrained language models. Both of these metrics capture the predictive quality of the model&#8217;s next token distribution. These metrics are related in that they both measure the <a href="https://en.wikipedia.org/wiki/Cross-entropy">cross-entropy</a> of the model&#8217;s next token distribution, but they are normalized differently.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>Interestingly, this procedure naturally learns to up-weight STEM data, as well as favor python data within the StackEdu code mixture. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>Given a target dataset, we begin with 5B tokens of web data and combine this with 5B tokens of the target dataset. We then anneal (i.e., train a model over the data as the learning rate is decayed to zero) over the combined 10B tokens and evaluate. As a baseline, we simply anneal over 10B tokens of web-only data. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-9" href="#footnote-anchor-9" class="footnote-number" contenteditable="false" target="_self">9</a><div class="footnote-content"><p>The context window refers to the total number of tokens that an LLM can process at a time. For example, a context window of 4K tokens means that the total length of the model&#8217;s input and output cannot exceed 4K, otherwise the model may perform poorly. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-10" href="#footnote-anchor-10" class="footnote-number" contenteditable="false" target="_self">10</a><div class="footnote-content"><p>The long context extension phase of Olmo 3 trains on 100B tokens for the 32B model and 50B tokens for the 7B model. The exact same proportions of data are used for both the 50B and 100B mixtures.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-11" href="#footnote-anchor-11" class="footnote-number" contenteditable="false" target="_self">11</a><div class="footnote-content"><p>In particular, these noun phrases are identified in [1] using <a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf">TF-IDF</a>. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-12" href="#footnote-anchor-12" class="footnote-number" contenteditable="false" target="_self">12</a><div class="footnote-content"><p>Math uses <a href="https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M">Open Thoughts 3</a> and <a href="https://huggingface.co/datasets/PrimeIntellect/SYNTHETIC-2">Synthetic 2</a>. Coding uses <a href="https://huggingface.co/collections/TIGER-Lab/acecoder">AceCoder</a>, the code portion of the <a href="https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset">Llama Nemotron post-training dataset</a>, and <a href="https://huggingface.co/datasets/nvidia/OpenCodeReasoning">Open Code Reasoning</a>. Chat uses <a href="https://huggingface.co/datasets/allenai/WildChat">WildChat</a> (with a focus on the Tulu-3 [10] subset) and <a href="https://huggingface.co/OpenAssistant">Open Assistant</a>. Precise instruction following uses the same prompts from Tulu-3 with some additional verifiable constraints. There are also a few other datasets included in the SFT mix like <a href="https://huggingface.co/datasets/LipengCS/Table-GPT">TableGPT</a> for transforming data and <a href="https://huggingface.co/collections/CohereLabs/aya-datasets">Aya</a> for multilinguality.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-13" href="#footnote-anchor-13" class="footnote-number" contenteditable="false" target="_self">13</a><div class="footnote-content"><p>By &#8220;properties&#8221; of the target distribution, we usually mean some function of the target distribution (e.g., an expectation).</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-14" href="#footnote-anchor-14" class="footnote-number" contenteditable="false" target="_self">14</a><div class="footnote-content"><p>An F1 score can be computed between two sequences of text by tokenizing each sequence and computing precision and recall based upon whether certain tokens appear in each sequence. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-15" href="#footnote-anchor-15" class="footnote-number" contenteditable="false" target="_self">15</a><div class="footnote-content"><p>Specifically, the discussion section of <a href="https://arxiv.org/abs/2402.03300">DeepSeekMath</a> mentions that RL training primarily improves Maj@N capabilities, rather than Pass@N. In other words, the LLM may not learn to solve net new problems, but it becomes much more reliable at solving problems that were already within its scope. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-16" href="#footnote-anchor-16" class="footnote-number" contenteditable="false" target="_self">16</a><div class="footnote-content"><p>Pass@N is an evaluation technique in which we generate <code>N</code> completions from an LLM and count the model as correct if at least one of these <code>N</code> completions is correct. Larger values of <code>N</code> give the LLM more &#8220;shots&#8221; at correctly solving an answer. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-17" href="#footnote-anchor-17" class="footnote-number" contenteditable="false" target="_self">17</a><div class="footnote-content"><p>This exact threshold (also called a length budget) is determined empirically via vibe tests in which researchers tested different values, examined performance metrics, and manually inspected the model&#8217;s resulting verbosity. </p></div></div>]]></content:encoded></item><item><title><![CDATA[Group Relative Policy Optimization (GRPO)]]></title><description><![CDATA[How the algorithm that teaches LLMs to reason actually works...]]></description><link>https://cameronrwolfe.substack.com/p/grpo</link><guid isPermaLink="false">https://cameronrwolfe.substack.com/p/grpo</guid><dc:creator><![CDATA[Cameron R. Wolfe, Ph.D.]]></dc:creator><pubDate>Mon, 24 Nov 2025 10:33:31 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/f98b75b5-c615-4139-a045-ad9572f3cf9f_2008x1130.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2fQ6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77dd24fb-dbba-4d16-bd35-e26dfa2d0d5d_1999x1042.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2fQ6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77dd24fb-dbba-4d16-bd35-e26dfa2d0d5d_1999x1042.png 424w, https://substackcdn.com/image/fetch/$s_!2fQ6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77dd24fb-dbba-4d16-bd35-e26dfa2d0d5d_1999x1042.png 848w, https://substackcdn.com/image/fetch/$s_!2fQ6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77dd24fb-dbba-4d16-bd35-e26dfa2d0d5d_1999x1042.png 1272w, https://substackcdn.com/image/fetch/$s_!2fQ6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77dd24fb-dbba-4d16-bd35-e26dfa2d0d5d_1999x1042.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2fQ6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77dd24fb-dbba-4d16-bd35-e26dfa2d0d5d_1999x1042.png" width="1456" height="759" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/77dd24fb-dbba-4d16-bd35-e26dfa2d0d5d_1999x1042.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:759,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:591511,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/177823868?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77dd24fb-dbba-4d16-bd35-e26dfa2d0d5d_1999x1042.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2fQ6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77dd24fb-dbba-4d16-bd35-e26dfa2d0d5d_1999x1042.png 424w, https://substackcdn.com/image/fetch/$s_!2fQ6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77dd24fb-dbba-4d16-bd35-e26dfa2d0d5d_1999x1042.png 848w, https://substackcdn.com/image/fetch/$s_!2fQ6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77dd24fb-dbba-4d16-bd35-e26dfa2d0d5d_1999x1042.png 1272w, https://substackcdn.com/image/fetch/$s_!2fQ6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77dd24fb-dbba-4d16-bd35-e26dfa2d0d5d_1999x1042.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1, 19])</figcaption></figure></div><p>Reinforcement learning (RL) has always played a pivotal role in research on large language models (LLMs), beginning with its use for aligning LLMs to human preferences. More recently, researchers have heavily focused on using RL training to improve LLM reasoning performance. This line of research has led to a rapid expansion of LLM capabilities over the last few years. The objective of RL training (e.g., alignment or reasoning) has changed over time, along with the RL optimizers that are used to achieve these goals. Most early work on RL for LLMs used Proximal Policy Optimization (PPO) as the default RL optimizer, but recent reasoning research relies upon Group Relative Policy Optimization (GRPO).</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://cameronrwolfe.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Join 50,000 others who use Deep (Learning) Focus to deeply understand AI research.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>This overview will provide a deep dive into GRPO, where it comes from, how it works, and the role it has played in creating better large reasoning models (LRMs). As we will learn, RL training&#8212;<em>even with GRPO</em>&#8212;is a complex process that presents a seemingly endless frontier of open research questions. However, GRPO is a refreshingly simple&#8212;<em>and effective</em>&#8212;algorithm that is more efficient and approachable than its predecessors. These characteristics allow GRPO to democratize RL research and, in turn, accelerate progress on both:</p><ol><li><p>Building a better collective understanding of RL for LLMs.</p></li><li><p>Training more powerful reasoning models.</p></li></ol><p><strong>Basics of RL.</strong> We will not discuss the basics of RL (e.g., terminology, problem setup, or policy gradients) in this overview. To gain a more comprehensive grasp of the foundational ideas in RL that are useful for understanding GRPO, please see the following excerpts from prior articles:</p><ul><li><p>RL Problem Setup &amp; Terminology [<a href="https://cameronrwolfe.substack.com/i/173306894/problem-setup-and-terminology-for-rl">link</a>]</p></li><li><p>Different RL Formulations for LLMs [<a href="https://cameronrwolfe.substack.com/i/173306894/markov-decision-process-mdp-versus-bandit-formulation">link</a>]</p></li><li><p>Policy Gradient Basics [<a href="https://cameronrwolfe.substack.com/i/175107358/policy-gradient-basics">link</a>]</p></li></ul><h2>Reinforcement Learning (RL) for LLMs</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CJn6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CJn6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 424w, https://substackcdn.com/image/fetch/$s_!CJn6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 848w, https://substackcdn.com/image/fetch/$s_!CJn6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 1272w, https://substackcdn.com/image/fetch/$s_!CJn6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CJn6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png" width="1456" height="430" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:430,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CJn6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 424w, https://substackcdn.com/image/fetch/$s_!CJn6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 848w, https://substackcdn.com/image/fetch/$s_!CJn6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 1272w, https://substackcdn.com/image/fetch/$s_!CJn6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [19])</figcaption></figure></div><p>To begin our discussion, we will cover some preliminary details on reasoning models and reinforcement learning (RL). Specifically, we will first discuss the two most common RL frameworks used for training LLMs (depicted above):</p><ol><li><p><em><a href="https://cameronrwolfe.substack.com/p/the-story-of-rlhf-origins-motivations">Reinforcement Learning from Human Feedback (RLHF)</a></em> trains the LLM using RL with rewards derived from a <a href="https://cameronrwolfe.substack.com/p/reward-models">reward model</a> trained on human preferences.</p></li><li><p><em><a href="https://cameronrwolfe.substack.com/i/153722335/reinforcement-learning-with-verifiable-rewards">Reinforcement Learning with Verifiable Rewards (RLVR)</a></em> trains the LLM using RL with rewards derived from rule-based or deterministic verifiers.</p></li></ol><p>After this discussion, we will provide further details on large reasoning models (LRMs), which are LLMs that have been extensively trained (via RLVR) to hone their complex reasoning capabilities. This discussion is relevant to GRPO, as it is currently the most common RL optimizer&#8212;<em>at least for open LLMs</em>&#8212;to use for training LRMs with RLVR. In fact, GRPO gained popularity primarily through its use in training  open reasoning models like DeepSeek-R1 [8]!</p><p><strong>General RL setup.</strong> The main difference between RLHF and RLVR lies in how we assign rewards&#8212;<em>RLHF uses a learned reward model, while RLVR uses verifiable (or rules-based) rewards</em>. Despite this difference, these are both online RL algorithms that follow a similar training framework; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uPv8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uPv8!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 424w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 848w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 1272w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uPv8!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif" width="1456" height="817" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:817,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;[animate output image]&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="[animate output image]" title="[animate output image]" srcset="https://substackcdn.com/image/fetch/$s_!uPv8!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 424w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 848w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 1272w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">General framework for online RL</figcaption></figure></div><p>We first sample a batch of prompts and generate a completion&#8212;<em>or multiple completions</em>&#8212;for each prompt in the batch using our current policy. A reward is computed for each completion, which can then be used to derive a policy update using our RL optimizer of choice&#8212;<em>this is where GRPO comes in</em>! GRPO is a generic RL optimizer that is used to compute the policy update (i.e., the update to our LLM&#8217;s weights) during RL training. GRPO is usually used for RLVR, while PPO is usually used for RLHF. However, RL optimizers are generic, and technically any RL optimizer can be used to derive the policy update in these frameworks.</p><h4>Reinforcement Learning from Human Feedback (RLHF)</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Dtl3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bde3170-7f57-4f2f-aebb-3af9eb7b6a62_1556x948.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Dtl3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bde3170-7f57-4f2f-aebb-3af9eb7b6a62_1556x948.png 424w, https://substackcdn.com/image/fetch/$s_!Dtl3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bde3170-7f57-4f2f-aebb-3af9eb7b6a62_1556x948.png 848w, https://substackcdn.com/image/fetch/$s_!Dtl3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bde3170-7f57-4f2f-aebb-3af9eb7b6a62_1556x948.png 1272w, https://substackcdn.com/image/fetch/$s_!Dtl3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bde3170-7f57-4f2f-aebb-3af9eb7b6a62_1556x948.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Dtl3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bde3170-7f57-4f2f-aebb-3af9eb7b6a62_1556x948.png" width="1456" height="887" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6bde3170-7f57-4f2f-aebb-3af9eb7b6a62_1556x948.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:887,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Dtl3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bde3170-7f57-4f2f-aebb-3af9eb7b6a62_1556x948.png 424w, https://substackcdn.com/image/fetch/$s_!Dtl3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bde3170-7f57-4f2f-aebb-3af9eb7b6a62_1556x948.png 848w, https://substackcdn.com/image/fetch/$s_!Dtl3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bde3170-7f57-4f2f-aebb-3af9eb7b6a62_1556x948.png 1272w, https://substackcdn.com/image/fetch/$s_!Dtl3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bde3170-7f57-4f2f-aebb-3af9eb7b6a62_1556x948.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [16])</figcaption></figure></div><p>The first form of RL training to be popularized in the LLM domain was Reinforcement Learning from Human Feedback (RLHF). Early post-ChatGPT LLMs were almost always post-trained using the following three-step alignment procedure (depicted above), as proposed by <a href="https://cameronrwolfe.substack.com/i/175107358/training-language-models-to-follow-instructions-with-human-feedback">InstructGPT</a> [16]:</p><ol><li><p><a href="https://cameronrwolfe.substack.com/p/understanding-and-using-supervised">Supervised finetuning (SFT)</a>&#8212;<em>a.k.a. instruction finetuning (IFT)</em>&#8212;trains the model using <a href="https://cameronrwolfe.substack.com/p/language-model-training-and-inference">next-token prediction</a> over examples of good completions.</p></li><li><p>A reward model is trained over a <a href="https://rlhfbook.com/c/05-preferences.html">human preference dataset</a>.</p></li><li><p>Reinforcement learning (RL)&#8212;<em>usually with PPO</em>&#8212;is used to finetune the LLM with the reward model as the reward signal.</p></li></ol><p>The second and third steps of this procedure are collectively referred to as RLHF. This framework actually involves two training procedures: <em>a supervised learning phase for the reward model and an RL training phase for the LLM</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rKGp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F609d472d-1a82-4fd4-8c25-fe4e7253ee13_610x1118.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rKGp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F609d472d-1a82-4fd4-8c25-fe4e7253ee13_610x1118.png 424w, https://substackcdn.com/image/fetch/$s_!rKGp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F609d472d-1a82-4fd4-8c25-fe4e7253ee13_610x1118.png 848w, https://substackcdn.com/image/fetch/$s_!rKGp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F609d472d-1a82-4fd4-8c25-fe4e7253ee13_610x1118.png 1272w, https://substackcdn.com/image/fetch/$s_!rKGp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F609d472d-1a82-4fd4-8c25-fe4e7253ee13_610x1118.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rKGp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F609d472d-1a82-4fd4-8c25-fe4e7253ee13_610x1118.png" width="268" height="491.18688524590164" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/609d472d-1a82-4fd4-8c25-fe4e7253ee13_610x1118.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1118,&quot;width&quot;:610,&quot;resizeWidth&quot;:268,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rKGp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F609d472d-1a82-4fd4-8c25-fe4e7253ee13_610x1118.png 424w, https://substackcdn.com/image/fetch/$s_!rKGp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F609d472d-1a82-4fd4-8c25-fe4e7253ee13_610x1118.png 848w, https://substackcdn.com/image/fetch/$s_!rKGp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F609d472d-1a82-4fd4-8c25-fe4e7253ee13_610x1118.png 1272w, https://substackcdn.com/image/fetch/$s_!rKGp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F609d472d-1a82-4fd4-8c25-fe4e7253ee13_610x1118.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [17])</figcaption></figure></div><p><strong>Preference data</strong> is the foundation of RLHF. Each element of a preference dataset consists of a prompt, two completions to that prompt, and a preference label&#8212;<em>assigned either by human or an <a href="https://cameronrwolfe.substack.com/p/rlaif-reinforcement-learning-from">AI or LLM judge</a></em>&#8212;indicating which completion is preferred to the other. Specifying an explicit reward for an LLM is very difficult&#8212;<em>how do we reliably determine whether a completion is &#8220;good&#8221; or not when the model has so many diverse capabilities?</em> Instead of answering this question directly, we can instead collect preference data, which captures preferred model behavior via examples of ranked model responses for a particular prompt. A typical interface for collecting preference annotations can be seen in the figure below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JBCh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9510abc4-232c-446b-a0b0-cf949efd9045_2046x1540.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JBCh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9510abc4-232c-446b-a0b0-cf949efd9045_2046x1540.png 424w, https://substackcdn.com/image/fetch/$s_!JBCh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9510abc4-232c-446b-a0b0-cf949efd9045_2046x1540.png 848w, https://substackcdn.com/image/fetch/$s_!JBCh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9510abc4-232c-446b-a0b0-cf949efd9045_2046x1540.png 1272w, https://substackcdn.com/image/fetch/$s_!JBCh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9510abc4-232c-446b-a0b0-cf949efd9045_2046x1540.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JBCh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9510abc4-232c-446b-a0b0-cf949efd9045_2046x1540.png" width="1456" height="1096" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9510abc4-232c-446b-a0b0-cf949efd9045_2046x1540.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1096,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JBCh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9510abc4-232c-446b-a0b0-cf949efd9045_2046x1540.png 424w, https://substackcdn.com/image/fetch/$s_!JBCh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9510abc4-232c-446b-a0b0-cf949efd9045_2046x1540.png 848w, https://substackcdn.com/image/fetch/$s_!JBCh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9510abc4-232c-446b-a0b0-cf949efd9045_2046x1540.png 1272w, https://substackcdn.com/image/fetch/$s_!JBCh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9510abc4-232c-446b-a0b0-cf949efd9045_2046x1540.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [18])</figcaption></figure></div><p>Choosing the better model response is relatively intuitive, though it does require detailed guidelines on alignment criteria to ensure data quality. Preference data is used extensively in LLM post-training because:</p><ol><li><p>We can use it to train our model to produce human-preferable responses.</p></li><li><p>We just have to select a preferred response (rather than define an explicit reward signal or manually write responses from scratch). </p></li></ol><p>After collecting sufficient preference data, we have many examples of preferred model behavior that can be used to align our LLM to human (or AI-generated) preferences. We can directly train an LLM on this preference data using a direct alignment algorithm like <a href="https://cameronrwolfe.substack.com/p/direct-preference-optimization">Direct Preference Optimization (DPO)</a>, but we usually incorporate this data into RL by first using it to train a reward model. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!M_zU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757f3b6-d8a3-49da-80dc-74b9bcb9a1aa_1716x890.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!M_zU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757f3b6-d8a3-49da-80dc-74b9bcb9a1aa_1716x890.png 424w, https://substackcdn.com/image/fetch/$s_!M_zU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757f3b6-d8a3-49da-80dc-74b9bcb9a1aa_1716x890.png 848w, https://substackcdn.com/image/fetch/$s_!M_zU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757f3b6-d8a3-49da-80dc-74b9bcb9a1aa_1716x890.png 1272w, https://substackcdn.com/image/fetch/$s_!M_zU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757f3b6-d8a3-49da-80dc-74b9bcb9a1aa_1716x890.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!M_zU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757f3b6-d8a3-49da-80dc-74b9bcb9a1aa_1716x890.png" width="1456" height="755" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0757f3b6-d8a3-49da-80dc-74b9bcb9a1aa_1716x890.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:755,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!M_zU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757f3b6-d8a3-49da-80dc-74b9bcb9a1aa_1716x890.png 424w, https://substackcdn.com/image/fetch/$s_!M_zU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757f3b6-d8a3-49da-80dc-74b9bcb9a1aa_1716x890.png 848w, https://substackcdn.com/image/fetch/$s_!M_zU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757f3b6-d8a3-49da-80dc-74b9bcb9a1aa_1716x890.png 1272w, https://substackcdn.com/image/fetch/$s_!M_zU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757f3b6-d8a3-49da-80dc-74b9bcb9a1aa_1716x890.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Reward model architecture</figcaption></figure></div><p><strong>Reward models.</strong> A reward model is a specialized LLM&#8212;<em>usually a copy of the LLM we are training with an added regression head (depicted above)&#8212;</em>that is finetuned to predict a human preference score given a prompt and candidate completion as input. Specifically, the reward model is finetuned on our preference data using a ranking loss function that is derived from the <a href="https://cameronrwolfe.substack.com/i/166169560/the-bradley-terry-model-of-preference">Bradley-Terry model</a>; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iPQn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84db389-0e57-4a3c-808b-d48b28a192d6_1204x392.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iPQn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84db389-0e57-4a3c-808b-d48b28a192d6_1204x392.png 424w, https://substackcdn.com/image/fetch/$s_!iPQn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84db389-0e57-4a3c-808b-d48b28a192d6_1204x392.png 848w, https://substackcdn.com/image/fetch/$s_!iPQn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84db389-0e57-4a3c-808b-d48b28a192d6_1204x392.png 1272w, https://substackcdn.com/image/fetch/$s_!iPQn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84db389-0e57-4a3c-808b-d48b28a192d6_1204x392.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iPQn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84db389-0e57-4a3c-808b-d48b28a192d6_1204x392.png" width="617" height="200.88372093023256" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c84db389-0e57-4a3c-808b-d48b28a192d6_1204x392.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:392,&quot;width&quot;:1204,&quot;resizeWidth&quot;:617,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iPQn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84db389-0e57-4a3c-808b-d48b28a192d6_1204x392.png 424w, https://substackcdn.com/image/fetch/$s_!iPQn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84db389-0e57-4a3c-808b-d48b28a192d6_1204x392.png 848w, https://substackcdn.com/image/fetch/$s_!iPQn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84db389-0e57-4a3c-808b-d48b28a192d6_1204x392.png 1272w, https://substackcdn.com/image/fetch/$s_!iPQn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84db389-0e57-4a3c-808b-d48b28a192d6_1204x392.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Reward model loss function</figcaption></figure></div><p>Put simply, this loss function teaches the reward model to assign a higher score to the preferred response in a preference pair relative to the rejected response. The reward model is trained over paired preference data, but we see above that the model outputs an individual preference score for each completion in the pair. More details on reward models can be found in the overview below.</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;887863b0-96eb-4d89-bc0c-a510f2df549a&quot;,&quot;caption&quot;:&quot;Reward models (RMs) are a cornerstone of large language model (LLM) research, enabling significant advancements by incorporating human preferences into the training process. Despite their critical role, RMs are often overlooked. Practical guidance on how to train and use them effectively remains scarce&#8212;&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Reward Models&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:29736521,&quot;name&quot;:&quot;Cameron R. Wolfe, Ph.D.&quot;,&quot;bio&quot;:&quot;Research @ Netflix &#8226; Rice University PhD &#8226; I make AI understandable&quot;,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/69aba7df-b571-4609-aa47-fc2d031c11b8_1242x1595.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-06-30T09:33:16.285Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6f2dc466-5918-4e2d-9698-c2626e71089f_1988x1116.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://cameronrwolfe.substack.com/p/reward-models&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:166169560,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:118,&quot;comment_count&quot;:13,&quot;publication_id&quot;:1092659,&quot;publication_name&quot;:&quot;Deep (Learning) Focus&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!87xa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab9b43fb-52d5-40da-995d-5b7cd3f91064_896x896.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UkPk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94ec9186-9bf4-4b06-a6b7-7eb119b91e1a_2020x650.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UkPk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94ec9186-9bf4-4b06-a6b7-7eb119b91e1a_2020x650.png 424w, https://substackcdn.com/image/fetch/$s_!UkPk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94ec9186-9bf4-4b06-a6b7-7eb119b91e1a_2020x650.png 848w, https://substackcdn.com/image/fetch/$s_!UkPk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94ec9186-9bf4-4b06-a6b7-7eb119b91e1a_2020x650.png 1272w, https://substackcdn.com/image/fetch/$s_!UkPk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94ec9186-9bf4-4b06-a6b7-7eb119b91e1a_2020x650.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UkPk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94ec9186-9bf4-4b06-a6b7-7eb119b91e1a_2020x650.png" width="1456" height="469" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/94ec9186-9bf4-4b06-a6b7-7eb119b91e1a_2020x650.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:469,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UkPk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94ec9186-9bf4-4b06-a6b7-7eb119b91e1a_2020x650.png 424w, https://substackcdn.com/image/fetch/$s_!UkPk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94ec9186-9bf4-4b06-a6b7-7eb119b91e1a_2020x650.png 848w, https://substackcdn.com/image/fetch/$s_!UkPk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94ec9186-9bf4-4b06-a6b7-7eb119b91e1a_2020x650.png 1272w, https://substackcdn.com/image/fetch/$s_!UkPk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94ec9186-9bf4-4b06-a6b7-7eb119b91e1a_2020x650.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Input and output structure of a reward model</figcaption></figure></div><p><strong>PPO &amp; RLHF.</strong> Once the reward model has been trained over the preference data using this loss, the model learns how to assign a preference score to each model completion; see above. We can directly use this reward model as a reward signal for RL training. For RLHF, we usually use <a href="https://cameronrwolfe.substack.com/p/ppo-llm">Proximal Policy Optimization (PPO)</a> [12], which we will cover later in more detail, as the underlying RL optimizer.</p><blockquote><p><em>&#8220;Reward models broadly have been used extensively in reinforcement learning research as a proxy for environment rewards.&#8221;</em> - <a href="https://rlhfbook.com/c/07-reward-models.html">RLHF book</a></p></blockquote><p>Our LLM is indirectly trained on human feedback via the reward model. We begin with a preference dataset, which captures human preference via concrete examples of ranked model outputs. This data is used to train a reward model that can assign accurate preference scores to arbitrary outputs from the LLM. During training with RL, we generate new outputs&#8212;<em>or <a href="https://cameronrwolfe.substack.com/p/online-rl">on-policy samples</a></em>&#8212;from our LLM and score them with the reward model. These scores serve as the reward signal, and our RL optimizer updates the model&#8217;s weights to maximize rewards. Since the reward here is the output of our reward model, <em>we are maximizing preference scores</em>. In this way, the RL training process guides the LLM to produce outputs that align with human preferences, as estimated by the reward model.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!brUZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F491ce94f-790a-4c17-81af-6def25473758_1708x745.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!brUZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F491ce94f-790a-4c17-81af-6def25473758_1708x745.png 424w, https://substackcdn.com/image/fetch/$s_!brUZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F491ce94f-790a-4c17-81af-6def25473758_1708x745.png 848w, https://substackcdn.com/image/fetch/$s_!brUZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F491ce94f-790a-4c17-81af-6def25473758_1708x745.png 1272w, https://substackcdn.com/image/fetch/$s_!brUZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F491ce94f-790a-4c17-81af-6def25473758_1708x745.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!brUZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F491ce94f-790a-4c17-81af-6def25473758_1708x745.png" width="537" height="234.19986263736263" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/491ce94f-790a-4c17-81af-6def25473758_1708x745.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:635,&quot;width&quot;:1456,&quot;resizeWidth&quot;:537,&quot;bytes&quot;:210573,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/177823868?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F491ce94f-790a-4c17-81af-6def25473758_1708x745.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!brUZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F491ce94f-790a-4c17-81af-6def25473758_1708x745.png 424w, https://substackcdn.com/image/fetch/$s_!brUZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F491ce94f-790a-4c17-81af-6def25473758_1708x745.png 848w, https://substackcdn.com/image/fetch/$s_!brUZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F491ce94f-790a-4c17-81af-6def25473758_1708x745.png 1272w, https://substackcdn.com/image/fetch/$s_!brUZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F491ce94f-790a-4c17-81af-6def25473758_1708x745.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Schematic depiction of RLHF (from [19])</figcaption></figure></div><p><strong>Impact of RLHF.</strong> The ability to align an LLM to human preferences is a hugely impactful technology that catalyzed the popular use of LLMs. If we think about the differences between well-known LLMs like ChatGPT and their less widely-recognized predecessors, one of the key enhancements made to ChatGPT was the use of more sophisticated post-training. Specifically, ChatGPT was extensively aligned via SFT and RLHF, which significantly improved the model&#8217;s helpfulness. In this way, RL research&#8212;<em>and RLHF in particular</em>&#8212;played a pivotal role in creating the impressive and capable LLMs that we have today.</p><h4>Reinforcement Learning from Verifiable Rewards (RLVR)</h4><p>The reward in RLHF is derived from a reward model. This reward model requires its own training pipeline and validation, which adds costs and complexity to the RL training process. Our policy could also suffer from reward hacking, even when using a high-quality reward model. The policy explores the space of possible completions during RL to maximize rewards. If we continue running RL for long enough, however, the model may learn to maximize rewards via an exploit or hack in our reward model, rather than by generating better completions.</p><blockquote><p><em>&#8220;Reinforcement Learning with Verifiable Rewards (RLVR) can be seen as a simplified form of&#8230; RL with execution feedback, in which we simply use answer matching or constraint verification as a binary signal to train the model.&#8221; </em>- from [13]</p></blockquote><p>Put simply, reward models&#8212;<em>despite their incredible impact through RLHF</em>&#8212;have downsides. Reinforcement Learning from Verifiable Rewards (RLVR) chooses to avoid reward models, instead deriving rewards from manually verifiable and deterministic sources (e.g., rules or heuristics). Using verifiable rewards instead of neural reward models reduces the risk of reward hacking and makes extensive, large-scale RL training more feasible by making rewards harder to game.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GSkD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a0a064f-0d82-4beb-9267-b37059b658eb_1244x662.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GSkD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a0a064f-0d82-4beb-9267-b37059b658eb_1244x662.png 424w, https://substackcdn.com/image/fetch/$s_!GSkD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a0a064f-0d82-4beb-9267-b37059b658eb_1244x662.png 848w, https://substackcdn.com/image/fetch/$s_!GSkD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a0a064f-0d82-4beb-9267-b37059b658eb_1244x662.png 1272w, https://substackcdn.com/image/fetch/$s_!GSkD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a0a064f-0d82-4beb-9267-b37059b658eb_1244x662.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GSkD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a0a064f-0d82-4beb-9267-b37059b658eb_1244x662.png" width="483" height="257.03054662379424" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3a0a064f-0d82-4beb-9267-b37059b658eb_1244x662.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:662,&quot;width&quot;:1244,&quot;resizeWidth&quot;:483,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:&quot;Screenshot 2025-05-15 at 1.04.56&#8239;PM.png&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="Screenshot 2025-05-15 at 1.04.56&#8239;PM.png" srcset="https://substackcdn.com/image/fetch/$s_!GSkD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a0a064f-0d82-4beb-9267-b37059b658eb_1244x662.png 424w, https://substackcdn.com/image/fetch/$s_!GSkD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a0a064f-0d82-4beb-9267-b37059b658eb_1244x662.png 848w, https://substackcdn.com/image/fetch/$s_!GSkD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a0a064f-0d82-4beb-9267-b37059b658eb_1244x662.png 1272w, https://substackcdn.com/image/fetch/$s_!GSkD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a0a064f-0d82-4beb-9267-b37059b658eb_1244x662.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Schematic depiction of RLVR (from [19])</figcaption></figure></div><p><strong>Verifiable domains and rewards.</strong> To train an LLM with RLVR, we must select a domain that is verifiable in nature; e.g., math or coding. In other words, we need to create a dataset that has either <em>i)</em> a known ground truth answer or <em>ii)</em> some rule-based technique that can be used to verify the correctness of an answer for each prompt in our dataset. For coding, we can create a sandbox for running LLM-generated code and use test cases to assess correctness. Similarly, we can evaluate math problems by performing basic string matching between the answer predicted by the LLM and a ground-truth answer for a problem; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zfsl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zfsl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 424w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 848w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 1272w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zfsl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png" width="1456" height="499" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:499,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zfsl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 424w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 848w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 1272w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Verifying a problem with exact string matching</figcaption></figure></div><p>Usually, we must instruct the LLM to format its output such that the final answer can be easily parsed. Even then, however, string matching is not always sufficient for evaluating correctness. In many cases, we can benefit from crafting validation logic that is more robust (e.g., asking an LLM to tell us if two answers are the same [20]) and that captures variations in format for similar or identical outputs. </p><blockquote><p><em>&#8220;Math verification is determined by an LLM judge given the ground truth solution and DeepSeek-R1 solution attempt. We found that using an LLM judge instead of a stricter parsing engine (Math-Verify) for verification during data generation results in a higher yield and leads to higher performing downstream models.&#8221;</em> - from [20]</p></blockquote><p><strong>Applications of RLVR.</strong> Beyond substituting a reward model with verifiable rewards, the RL component of RLVR is unchanged. However, RLHF and RLVR differ in their purpose and application:</p><ol><li><p>RLHF is usually implemented with PPO as the underlying RL optimizer, while GRPO is the most common RL optimizer for RLVR.</p></li><li><p>RLHF focuses on LLM alignment with preference feedback, while RLVR is used to improve the complex reasoning capabilities of an LLM.</p></li></ol><p>Most recent research on LLMs and RL is heavily focused on creating LLMs with better reasoning capabilities, known as large reasoning models (LRMs). The training process for LRMs is centered around performing RLVR on domains like math and coding. In these training setups, GRPO is the most commonly used RL optimizer&#8212;<em>at least for open LLMs.</em> As we will see in this overview, several <a href="https://cameronrwolfe.substack.com/p/demystifying-reasoning-models">notable results</a> have already been achieved from using RLVR (with GRPO) to train LRMs. However, this area of research is still incredibly active and dynamic. Examples of popular topics being explored in this area include:</p><ul><li><p><a href="https://www.interconnects.ai/p/papers-im-reading-base-model-rl-grpo">Tweaking or improving GRPO</a></p></li><li><p><a href="https://arxiv.org/abs/2510.13786">Scaling the RLVR training process</a></p></li><li><p><a href="https://arxiv.org/abs/2508.12790">Expanding to non-verifiable domains via rubrics</a></p></li><li><p><a href="https://arxiv.org/abs/2501.12599">Using curriculum learning to improve RLVR</a></p></li><li><p><a href="https://cameronrwolfe.substack.com/p/online-rl?open=false#%C2%A7bridging-offline-and-online-reinforcement-learning-for-llms">Combining verifiable and non-verifiable rewards</a></p></li></ul><h4>Large Reasoning Models (LRMs)</h4><p>As mentioned before, RLVR and GRPO can be used to improve the reasoning capabilities of LLMs on verifiable tasks, and research on this topic has led to the creation of large reasoning models (LRMs). The key distinction between an LRM and a standard LLM is the ability to dynamically &#8220;think&#8221; about a prompt prior to providing a final output. By increasing the length of the thinking process, these LRMs can use <a href="https://cameronrwolfe.substack.com/i/152758713/reasoning-models-and-new-scaling-paradigms">inference-time scaling</a>&#8212;<em>or simply spend more compute on generating a completion&#8212;</em>to improve their performance. </p><blockquote><p><em>&#8220;We&#8217;ve developed a new series of AI models designed to spend more time thinking before they respond.&#8221;</em> - from [4]</p></blockquote><p>One of the first such models to be released was OpenAI&#8217;s <a href="https://openai.com/index/introducing-openai-o1-preview/">o1-preview</a>, which was predated by a <a href="https://www.reuters.com/technology/artificial-intelligence/openai-working-new-reasoning-technology-under-code-name-strawberry-2024-07-12/">long series of rumors</a> about OpenAI developing a new series of LLMs with complex reasoning capabilities. This model has since been followed by a massive number of new closed (e.g., <a href="https://openai.com/index/introducing-o3-and-o4-mini/">o3 / o4</a> or <a href="https://deepmind.google/models/gemini/pro/">Gemini 3</a>) and open (<a href="https://arxiv.org/abs/2505.09388">Qwen-3</a>, <a href="https://arxiv.org/abs/2501.12948">DeepSeek-R1</a>, and <a href="https://allenai.org/blog/olmo3">Olmo-3</a>) LRMs as the research community continues to iterate on these ideas. Interestingly, the popularization of LRMs has also led to a proliferation of open models&#8212;<em>mostly proposed after DeepSeek-R1 [8], which we will discuss later on</em>. Recent open LRM releases like Kimi-K2 [14] have even started to match or exceed the performance of closed models; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2rPM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d61a3a-a074-473b-b40d-a04fc6578623_2526x1400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2rPM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d61a3a-a074-473b-b40d-a04fc6578623_2526x1400.png 424w, https://substackcdn.com/image/fetch/$s_!2rPM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d61a3a-a074-473b-b40d-a04fc6578623_2526x1400.png 848w, https://substackcdn.com/image/fetch/$s_!2rPM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d61a3a-a074-473b-b40d-a04fc6578623_2526x1400.png 1272w, https://substackcdn.com/image/fetch/$s_!2rPM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d61a3a-a074-473b-b40d-a04fc6578623_2526x1400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2rPM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d61a3a-a074-473b-b40d-a04fc6578623_2526x1400.png" width="1456" height="807" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/17d61a3a-a074-473b-b40d-a04fc6578623_2526x1400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:807,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:535552,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/177823868?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d61a3a-a074-473b-b40d-a04fc6578623_2526x1400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2rPM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d61a3a-a074-473b-b40d-a04fc6578623_2526x1400.png 424w, https://substackcdn.com/image/fetch/$s_!2rPM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d61a3a-a074-473b-b40d-a04fc6578623_2526x1400.png 848w, https://substackcdn.com/image/fetch/$s_!2rPM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d61a3a-a074-473b-b40d-a04fc6578623_2526x1400.png 1272w, https://substackcdn.com/image/fetch/$s_!2rPM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17d61a3a-a074-473b-b40d-a04fc6578623_2526x1400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [14])</figcaption></figure></div><p><strong>How do LRMs work?</strong> LRMs and LLMs are identical architecturally<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>. They are both based upon <a href="https://cameronrwolfe.substack.com/p/decoder-only-transformers-the-workhorse">decoder-only transformers</a>, potentially with a <a href="https://cameronrwolfe.substack.com/p/moe-llms">Mixture-of-Experts (MoE) architecture</a>. Their main difference lies in how they generate output. At a high level, LRMs operate by allowing the model to &#8220;think&#8221; prior to producing a final output. This thinking process occurs in the form of a long, free-text chain-of-thought (CoT)&#8212;<em>also called a rationale or reasoning trajectory</em>&#8212;that is generated by the LLM. Most closed LRMs hide this reasoning trajectory from the end-user for safety purposes<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>. The user sees only the model&#8217;s final output and (optionally) a truncated summary of the reasoning process. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JJH6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JJH6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png 424w, https://substackcdn.com/image/fetch/$s_!JJH6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png 848w, https://substackcdn.com/image/fetch/$s_!JJH6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png 1272w, https://substackcdn.com/image/fetch/$s_!JJH6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JJH6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png" width="482" height="287.34615384615387" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:868,&quot;width&quot;:1456,&quot;resizeWidth&quot;:482,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!JJH6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png 424w, https://substackcdn.com/image/fetch/$s_!JJH6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png 848w, https://substackcdn.com/image/fetch/$s_!JJH6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png 1272w, https://substackcdn.com/image/fetch/$s_!JJH6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [9])</figcaption></figure></div><p>For open LRMs, we can observe the model&#8217;s reasoning process and final output. Concretely, LRMs use special tokens to separate their reasoning process from their actual output. The reasoning trajectory is generated first and is wrapped between <code>&lt;think&gt;</code> tokens. The model ends its reasoning process with a <code>&lt;/think&gt;</code> token, then proceeds to generate a final response; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Way8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Way8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png 424w, https://substackcdn.com/image/fetch/$s_!Way8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png 848w, https://substackcdn.com/image/fetch/$s_!Way8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png 1272w, https://substackcdn.com/image/fetch/$s_!Way8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Way8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png" width="1456" height="1034" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1034,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:326292,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Way8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png 424w, https://substackcdn.com/image/fetch/$s_!Way8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png 848w, https://substackcdn.com/image/fetch/$s_!Way8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png 1272w, https://substackcdn.com/image/fetch/$s_!Way8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Concrete example of LRM output in Qwen-3 prompt format</figcaption></figure></div><p><strong>Reasoning trajectories.</strong> If we look at <a href="https://openai.com/index/learning-to-reason-with-llms/">some examples</a> of reasoning trajectories from open or closed LRMs, we will notice that these models exhibit sophisticated reasoning behaviors in their long CoT:</p><ul><li><p>Thinking through each part of a complex problem.</p></li><li><p>Decomposing complex problems into smaller, solvable parts.</p></li><li><p>Critiquing solutions and finding errors.</p></li><li><p>Exploring many alternative solutions.</p></li></ul><p>In many ways, the model is performing a complex, text-based search process to find a viable solution to a prompt. Such behavior goes beyond any previously-observed behavior with standard LLMs and <a href="https://cameronrwolfe.substack.com/p/chain-of-thought-prompting-for-llms">chain of thought prompting</a>. With this in mind, we might begin to wonder: <em>How does the model learn how to do this?</em></p><p><strong>LRM training.</strong> LRMs also differ from standard LLMs in their training methodology. Though exact post-training details may vary significantly between models, both LLMs and LRMs undergo similar pretraining and alignment phases that consist of <a href="https://cameronrwolfe.substack.com/p/understanding-and-using-supervised">supervised finetuning (SFT)</a> and RLHF.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1jdx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ca6cd8f-8e21-43be-b565-4b8465fdb4d0_2382x435.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1jdx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ca6cd8f-8e21-43be-b565-4b8465fdb4d0_2382x435.png 424w, https://substackcdn.com/image/fetch/$s_!1jdx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ca6cd8f-8e21-43be-b565-4b8465fdb4d0_2382x435.png 848w, https://substackcdn.com/image/fetch/$s_!1jdx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ca6cd8f-8e21-43be-b565-4b8465fdb4d0_2382x435.png 1272w, https://substackcdn.com/image/fetch/$s_!1jdx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ca6cd8f-8e21-43be-b565-4b8465fdb4d0_2382x435.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1jdx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ca6cd8f-8e21-43be-b565-4b8465fdb4d0_2382x435.png" width="1456" height="266" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0ca6cd8f-8e21-43be-b565-4b8465fdb4d0_2382x435.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:266,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:231539,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/177823868?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ca6cd8f-8e21-43be-b565-4b8465fdb4d0_2382x435.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1jdx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ca6cd8f-8e21-43be-b565-4b8465fdb4d0_2382x435.png 424w, https://substackcdn.com/image/fetch/$s_!1jdx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ca6cd8f-8e21-43be-b565-4b8465fdb4d0_2382x435.png 848w, https://substackcdn.com/image/fetch/$s_!1jdx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ca6cd8f-8e21-43be-b565-4b8465fdb4d0_2382x435.png 1272w, https://substackcdn.com/image/fetch/$s_!1jdx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ca6cd8f-8e21-43be-b565-4b8465fdb4d0_2382x435.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>However, LRMs extend this standard training process by performing large-scale RLVR on verifiable domains like math and code. Because verifiable reward signals are less prone to <a href="https://lilianweng.github.io/posts/2024-11-28-reward-hacking/">reward hacking</a>, we can perform larger-scale RL training (i.e., by running the training process longer) with less risk of training collapse. Several works [8, 9] have shown that LRMs obey a predictable scaling law with respect to the amount of compute used during RL training, <em>meaning that we can achieve better performance by increasing the number of RL training steps</em>.</p><blockquote><p><em>&#8220;We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process.&#8221;</em> - from [8]</p></blockquote><p>The complex reasoning behaviors of an LRM are not directly encoded into the model in any way. Rather, this behavior naturally emerges from large-scale RL training. The LRM undergoes an RL-powered self-evolution as it attempts to solve problems and is rewarded for finding correct solutions. From this process, the model learns to properly leverage its reasoning trajectory. We will continue discussing the details of RL training for LRMs throughout the remainder of this post, but the key idea here is to:</p><ul><li><p>Create the correct incentives for RL training&#8212;<em>usually a deterministic or rule-based reward signal that is at low risk for reward hacking</em>. </p></li><li><p>Run large-scale RL training with these reliable reward signals.</p></li><li><p>Allow sophisticated model behavior to naturally emerge.</p></li></ul><p>Powerful LRMs are a product of large-scale RL with the correct incentives, but there are many practical details involved in properly incentivizing and scaling the RL training process&#8212;<em>this is still a very active area of research [15]</em>. </p><p><strong>Are LRMs a silver bullet?</strong> Given the impressive performance of LRMs in complex reasoning domains, we might naively believe that LRMs will outperform standard LLMs at all tasks. However, the story is not this simple&#8212;<em>LRMs are not always the best tool to use</em>. Because the training process for LRMs is focused on verifiable domains like math and code, their performance may be biased towards these domains&#8212;<em>and away from non-verifiable domains like creative writing</em>. </p><div class="pullquote"><p>&#8220;Reasoning models are designed to be good at complex tasks such as solving puzzles, advanced math problems, and challenging coding tasks. However, they are not necessary for simpler tasks like summarization, translation, or knowledge-based question answering. In fact, using reasoning models for everything can be inefficient and expensive. For instance, reasoning models are typically more expensive to use, more verbose, and sometimes more prone to errors due to overthinking.&#8221; - <a href="https://magazine.sebastianraschka.com/p/understanding-reasoning-llms">Sebastian Raschka</a> </p></div><p>LRMs may also have deficiencies in alignment (e.g., instruction following or reading-friendly formatting) relative to standard LLMs. However, <em><strong>most of these issues are being solved as we continue to study the interplay between RLHF and RLVR.</strong></em> We should use LRMs for the domains in which they excel but be sure to test their performance in non-verifiable domains. Using a standard LLM may be sufficient&#8212;<em>or better</em>&#8212;and is usually more efficient in terms of inference-time compute. </p><h2>GRPO from Idea to Implementation</h2><p>Now that we understand how RL is used to train LLMs (and LRMs), we will take a deeper look at common RL optimizers used to derive policy updates for RLHF and RLVR. To begin, we will learn about Proximal Policy Optimization (PPO) [12] before moving on to the main topic of this overview&#8212;<em>Group Relative Policy Optimization (GRPO) [1]</em>. GRPO is inspired by PPO and shares some of its core ideas. However, GRPO also goes beyond PPO by making several changes to simplify the algorithm while maintaining effectiveness for LLM training.</p><h4><a href="https://arxiv.org/abs/1707.06347">Proximal Policy Optimization (PPO)</a> [12]</h4><p>GRPO is heavily based upon the Proximal Policy Optimization (PPO) algorithm [12]. PPO was used in <a href="https://cameronrwolfe.substack.com/i/175107358/learning-to-summarize-from-human-feedback">seminal work on RLHF</a> and, as a result, became the default RL optimizer in the LLM domain for some time. Only recently with the advent of LRMs have alternative algorithms like GRPO started to become popular.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!S1nc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc38f9ea3-d07f-4240-898e-de3c75e66878_2264x786.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!S1nc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc38f9ea3-d07f-4240-898e-de3c75e66878_2264x786.png 424w, https://substackcdn.com/image/fetch/$s_!S1nc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc38f9ea3-d07f-4240-898e-de3c75e66878_2264x786.png 848w, https://substackcdn.com/image/fetch/$s_!S1nc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc38f9ea3-d07f-4240-898e-de3c75e66878_2264x786.png 1272w, https://substackcdn.com/image/fetch/$s_!S1nc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc38f9ea3-d07f-4240-898e-de3c75e66878_2264x786.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!S1nc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc38f9ea3-d07f-4240-898e-de3c75e66878_2264x786.png" width="652" height="226.1401098901099" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c38f9ea3-d07f-4240-898e-de3c75e66878_2264x786.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:505,&quot;width&quot;:1456,&quot;resizeWidth&quot;:652,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!S1nc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc38f9ea3-d07f-4240-898e-de3c75e66878_2264x786.png 424w, https://substackcdn.com/image/fetch/$s_!S1nc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc38f9ea3-d07f-4240-898e-de3c75e66878_2264x786.png 848w, https://substackcdn.com/image/fetch/$s_!S1nc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc38f9ea3-d07f-4240-898e-de3c75e66878_2264x786.png 1272w, https://substackcdn.com/image/fetch/$s_!S1nc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc38f9ea3-d07f-4240-898e-de3c75e66878_2264x786.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(from [12])</figcaption></figure></div><p>The structure of PPO is outlined above. As we can see, each training iteration of PPO performs the following sequence of steps:</p><ol><li><p>Sample a diverse batch of prompts.</p></li><li><p>Generate a completion from the policy for each prompt.</p></li><li><p>Compute advantage estimates for each completion.</p></li><li><p>Perform several policy updates over this sampled data.</p></li></ol><p><strong>Surrogate objective.</strong> During PPO, we formulate a surrogate objective<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a> that is optimized with respect to the parameters of our policy. The PPO surrogate objective is based upon the policy ratio between the current policy and an old model (i.e., the policy as it existed before the first update in a training step). The policy ratio&#8212;<em>also called the importance ratio</em>&#8212;stabilizes the training process by comparing the new policy&#8217;s token probabilities to the old policy and applying a weight (or importance) to training that helps to avoid drastic changes; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IXsZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7d1530-a2cc-48c6-9e95-8571b781ba35_1994x792.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IXsZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7d1530-a2cc-48c6-9e95-8571b781ba35_1994x792.png 424w, https://substackcdn.com/image/fetch/$s_!IXsZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7d1530-a2cc-48c6-9e95-8571b781ba35_1994x792.png 848w, https://substackcdn.com/image/fetch/$s_!IXsZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7d1530-a2cc-48c6-9e95-8571b781ba35_1994x792.png 1272w, https://substackcdn.com/image/fetch/$s_!IXsZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7d1530-a2cc-48c6-9e95-8571b781ba35_1994x792.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IXsZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7d1530-a2cc-48c6-9e95-8571b781ba35_1994x792.png" width="554" height="219.92582417582418" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a7d1530-a2cc-48c6-9e95-8571b781ba35_1994x792.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:578,&quot;width&quot;:1456,&quot;resizeWidth&quot;:554,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IXsZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7d1530-a2cc-48c6-9e95-8571b781ba35_1994x792.png 424w, https://substackcdn.com/image/fetch/$s_!IXsZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7d1530-a2cc-48c6-9e95-8571b781ba35_1994x792.png 848w, https://substackcdn.com/image/fetch/$s_!IXsZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7d1530-a2cc-48c6-9e95-8571b781ba35_1994x792.png 1272w, https://substackcdn.com/image/fetch/$s_!IXsZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7d1530-a2cc-48c6-9e95-8571b781ba35_1994x792.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Policy or importance ratio</figcaption></figure></div><p>To derive the surrogate objective for PPO, we begin with an unclipped objective that resembles the surrogate objective used in <a href="https://cameronrwolfe.substack.com/i/175107358/trust-region-policy-optimization-trpo">Trust Region Policy Optimization (TRPO)</a>; see below. Additionally, we introduce a clipped version of this objective by applying a clipping mechanism to the policy ratio <code>r_t(&#952;)</code>. Clipping forces the policy ratio to fall in the range <code>[1 - &#949;, 1 + &#949;]</code>. In other words, we avoid the policy ratio becoming too large or too small, ensuring that the token probabilities produced by the current and old policies remain relatively similar.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oHJG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oHJG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png 424w, https://substackcdn.com/image/fetch/$s_!oHJG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png 848w, https://substackcdn.com/image/fetch/$s_!oHJG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png 1272w, https://substackcdn.com/image/fetch/$s_!oHJG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oHJG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png" width="1456" height="246" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:246,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:121736,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/175107358?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!oHJG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png 424w, https://substackcdn.com/image/fetch/$s_!oHJG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png 848w, https://substackcdn.com/image/fetch/$s_!oHJG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png 1272w, https://substackcdn.com/image/fetch/$s_!oHJG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The PPO surrogate objective</figcaption></figure></div><p>In PPO, the surrogate objective is simply the minimum of clipped and unclipped objectives, which makes it a pessimistic (lower bound) estimate for the unclipped objective. The behavior of the surrogate loss&#8217; clipping mechanism changes depending on the sign of the advantage. The possible cases are shown below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ovlv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38769a7f-6549-4fed-ab3e-f829185b5069_1544x642.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ovlv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38769a7f-6549-4fed-ab3e-f829185b5069_1544x642.png 424w, https://substackcdn.com/image/fetch/$s_!ovlv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38769a7f-6549-4fed-ab3e-f829185b5069_1544x642.png 848w, https://substackcdn.com/image/fetch/$s_!ovlv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38769a7f-6549-4fed-ab3e-f829185b5069_1544x642.png 1272w, https://substackcdn.com/image/fetch/$s_!ovlv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38769a7f-6549-4fed-ab3e-f829185b5069_1544x642.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ovlv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38769a7f-6549-4fed-ab3e-f829185b5069_1544x642.png" width="1456" height="605" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/38769a7f-6549-4fed-ab3e-f829185b5069_1544x642.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:605,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ovlv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38769a7f-6549-4fed-ab3e-f829185b5069_1544x642.png 424w, https://substackcdn.com/image/fetch/$s_!ovlv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38769a7f-6549-4fed-ab3e-f829185b5069_1544x642.png 848w, https://substackcdn.com/image/fetch/$s_!ovlv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38769a7f-6549-4fed-ab3e-f829185b5069_1544x642.png 1272w, https://substackcdn.com/image/fetch/$s_!ovlv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38769a7f-6549-4fed-ab3e-f829185b5069_1544x642.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [12])</figcaption></figure></div><p>As we can see, taking the minimum of clipped and unclipped terms in the surrogate objective causes clipping to be applied in only one direction. The surrogate objective can be arbitrarily <em>decreased</em> by moving the policy ratio away from one, but clipping prevents the objective from being <em>increased</em> beyond a certain point by limiting the policy ratio. In this way, the clipping mechanism of PPO disincentivizes large policy ratios and, in turn, maintains a trust region by preventing large policy updates that could potentially damage our policy. </p><blockquote><p><em>&#8220;We only ignore the change in probability ratio when it would make the objective improve, and we include it when it makes the objective worse.&#8221;</em> - from [12]</p></blockquote><p><strong>KL divergence.</strong> When training LLMs with PPO, we usually incorporate the KL divergence between the current policy and a reference policy&#8212;<em>like the SFT model</em>&#8212;into training. The KL divergence serves as a penalty that encourages similarity between the current and reference policies. We compute the KL divergence by comparing token distributions from the two LLMs for each token in a sequence. The easiest&#8212;<em>and most common</em>&#8212;way to approximate KL divergence [7] is via the difference in log probabilities between the policy and reference; see <a href="https://cameronrwolfe.substack.com/i/167254905/kullback-leibler-kl-divergence">here</a>.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\texttt{kl_div} = \\texttt{policy_logprobs} - \\texttt{ref_logprobs}&quot;,&quot;id&quot;:&quot;EARHMKGVSJ&quot;}" data-component-name="LatexBlockToDOM"></div><p>After the KL divergence has been computed, there are two primary ways that it can be incorporated into the RL training process:</p><ol><li><p>By directly subtracting the KL divergence from the reward.</p></li><li><p>By adding the KL divergence to the loss function as a penalty term.</p></li></ol><p>PPO adopts the former option by subtracting the KL divergence directly from the reward signal used in RL training as shown in the equation below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MMrI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3d5004-2390-489f-995a-e0245c174535_2534x530.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MMrI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3d5004-2390-489f-995a-e0245c174535_2534x530.png 424w, https://substackcdn.com/image/fetch/$s_!MMrI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3d5004-2390-489f-995a-e0245c174535_2534x530.png 848w, https://substackcdn.com/image/fetch/$s_!MMrI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3d5004-2390-489f-995a-e0245c174535_2534x530.png 1272w, https://substackcdn.com/image/fetch/$s_!MMrI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3d5004-2390-489f-995a-e0245c174535_2534x530.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MMrI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3d5004-2390-489f-995a-e0245c174535_2534x530.png" width="587" height="122.9635989010989" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cc3d5004-2390-489f-995a-e0245c174535_2534x530.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:305,&quot;width&quot;:1456,&quot;resizeWidth&quot;:587,&quot;bytes&quot;:188292,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/175107358?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3d5004-2390-489f-995a-e0245c174535_2534x530.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!MMrI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3d5004-2390-489f-995a-e0245c174535_2534x530.png 424w, https://substackcdn.com/image/fetch/$s_!MMrI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3d5004-2390-489f-995a-e0245c174535_2534x530.png 848w, https://substackcdn.com/image/fetch/$s_!MMrI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3d5004-2390-489f-995a-e0245c174535_2534x530.png 1272w, https://substackcdn.com/image/fetch/$s_!MMrI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3d5004-2390-489f-995a-e0245c174535_2534x530.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Adding KL to the reward in PPO</figcaption></figure></div><p><strong>Advantage estimation.</strong> The <a href="https://cameronrwolfe.substack.com/p/ppo-llm?open=false#%C2%A7problem-setup-and-terminology">advantage function</a>, a key part of PPO&#8217;s surrogate objective, is the difference between the action-value and value function: <code>A(s, a) = Q(s, a) - V(s)</code>. The value function in PPO is estimated with a learned model called the value model or critic. This critic is usually a separate copy of our policy, or&#8212;<em>for better parameter efficiency</em>&#8212;an added value head that shares weights with the policy. The critic takes a completion as input and predicts expected cumulative reward on a per-token basis by using an architecture that is similar to that of a reward model (i.e., a transformer with a regression head)<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fXOv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8133ba-f772-44f5-bfbc-19e800a842cc_1732x570.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fXOv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8133ba-f772-44f5-bfbc-19e800a842cc_1732x570.png 424w, https://substackcdn.com/image/fetch/$s_!fXOv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8133ba-f772-44f5-bfbc-19e800a842cc_1732x570.png 848w, https://substackcdn.com/image/fetch/$s_!fXOv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8133ba-f772-44f5-bfbc-19e800a842cc_1732x570.png 1272w, https://substackcdn.com/image/fetch/$s_!fXOv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8133ba-f772-44f5-bfbc-19e800a842cc_1732x570.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fXOv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8133ba-f772-44f5-bfbc-19e800a842cc_1732x570.png" width="1456" height="479" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fb8133ba-f772-44f5-bfbc-19e800a842cc_1732x570.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:479,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fXOv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8133ba-f772-44f5-bfbc-19e800a842cc_1732x570.png 424w, https://substackcdn.com/image/fetch/$s_!fXOv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8133ba-f772-44f5-bfbc-19e800a842cc_1732x570.png 848w, https://substackcdn.com/image/fetch/$s_!fXOv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8133ba-f772-44f5-bfbc-19e800a842cc_1732x570.png 1272w, https://substackcdn.com/image/fetch/$s_!fXOv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8133ba-f772-44f5-bfbc-19e800a842cc_1732x570.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The value function is also on-policy, meaning it depends on the current parameters of our policy. Unlike <a href="https://cameronrwolfe.substack.com/p/reward-models">reward models</a>, which are fixed at the beginning of RL training, the critic is trained alongside the LLM in each policy update to ensure its predictions remain on-policy. <em>This is known as an actor-critic setup</em>. To handle this, we can add an extra <a href="https://en.wikipedia.org/wiki/Mean_squared_error">mean-squared error (MSE) loss term</a>&#8212;<em>between the rewards predicted by the critic and actual rewards</em>&#8212;to the surrogate loss for PPO.</p><p>The critic can be used to compute the advantage via Generalized Advantage Estimation (GAE) [13]. The details of GAE are beyond the scope of this post. We will only cover GAE at a high level, but a full explanation can be found <a href="https://cameronrwolfe.substack.com/i/175107358/generalized-advantage-estimation-gae">here</a>. GAE builds upon the concept of a temporal difference (TD) residual; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!A4K-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c1e98c7-da70-4da6-a365-3b2fe9cd2230_1723x896.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!A4K-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c1e98c7-da70-4da6-a365-3b2fe9cd2230_1723x896.png 424w, https://substackcdn.com/image/fetch/$s_!A4K-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c1e98c7-da70-4da6-a365-3b2fe9cd2230_1723x896.png 848w, https://substackcdn.com/image/fetch/$s_!A4K-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c1e98c7-da70-4da6-a365-3b2fe9cd2230_1723x896.png 1272w, https://substackcdn.com/image/fetch/$s_!A4K-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c1e98c7-da70-4da6-a365-3b2fe9cd2230_1723x896.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!A4K-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c1e98c7-da70-4da6-a365-3b2fe9cd2230_1723x896.png" width="440" height="228.76373626373626" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4c1e98c7-da70-4da6-a365-3b2fe9cd2230_1723x896.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:757,&quot;width&quot;:1456,&quot;resizeWidth&quot;:440,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!A4K-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c1e98c7-da70-4da6-a365-3b2fe9cd2230_1723x896.png 424w, https://substackcdn.com/image/fetch/$s_!A4K-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c1e98c7-da70-4da6-a365-3b2fe9cd2230_1723x896.png 848w, https://substackcdn.com/image/fetch/$s_!A4K-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c1e98c7-da70-4da6-a365-3b2fe9cd2230_1723x896.png 1272w, https://substackcdn.com/image/fetch/$s_!A4K-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c1e98c7-da70-4da6-a365-3b2fe9cd2230_1723x896.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The TD residual</figcaption></figure></div><p>The TD residual uses per-token value predictions from the critic to form a one-step estimate of the advantage. Put simply, the TD residual is analyzing how much the reward changes after predicting a single token relative to the expected reward. However, the TD residual only uses a small amount of actual reward information (i.e., the reward at step <code>t</code>) to estimate the advantage, which causes the estimate to become biased<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>. To solve this issue, we can generalize the single-step TD residual to form a series of <code>N</code>-step advantage estimators; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_U8s!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ae75ed-997b-4654-b383-dda56a8d9b2e_2298x716.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_U8s!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ae75ed-997b-4654-b383-dda56a8d9b2e_2298x716.png 424w, https://substackcdn.com/image/fetch/$s_!_U8s!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ae75ed-997b-4654-b383-dda56a8d9b2e_2298x716.png 848w, https://substackcdn.com/image/fetch/$s_!_U8s!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ae75ed-997b-4654-b383-dda56a8d9b2e_2298x716.png 1272w, https://substackcdn.com/image/fetch/$s_!_U8s!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ae75ed-997b-4654-b383-dda56a8d9b2e_2298x716.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_U8s!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ae75ed-997b-4654-b383-dda56a8d9b2e_2298x716.png" width="1456" height="454" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/18ae75ed-997b-4654-b383-dda56a8d9b2e_2298x716.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:454,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_U8s!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ae75ed-997b-4654-b383-dda56a8d9b2e_2298x716.png 424w, https://substackcdn.com/image/fetch/$s_!_U8s!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ae75ed-997b-4654-b383-dda56a8d9b2e_2298x716.png 848w, https://substackcdn.com/image/fetch/$s_!_U8s!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ae75ed-997b-4654-b383-dda56a8d9b2e_2298x716.png 1272w, https://substackcdn.com/image/fetch/$s_!_U8s!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ae75ed-997b-4654-b383-dda56a8d9b2e_2298x716.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><code>N</code>-step advantage estimators</figcaption></figure></div><p>Similarly to the single-step TD residual, advantage estimators with lower values of <code>N</code> have low variance but high bias. As we increase the value of <code>N</code>, however, we are incorporating more exact reward information into the advantage estimate, thus lowering the bias (and, in turn, increasing variance). GAE tries to find a balance between these two ends of the spectrum by <em>i)</em> using all values of <code>N</code> and ii) taking an average of these advantage estimates. This is accomplished with the mixing parameter <code>&#955;</code> for GAE, as shown below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!v3wn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff11ed641-c3be-442a-ad17-b41072a721a8_2015x843.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!v3wn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff11ed641-c3be-442a-ad17-b41072a721a8_2015x843.png 424w, https://substackcdn.com/image/fetch/$s_!v3wn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff11ed641-c3be-442a-ad17-b41072a721a8_2015x843.png 848w, https://substackcdn.com/image/fetch/$s_!v3wn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff11ed641-c3be-442a-ad17-b41072a721a8_2015x843.png 1272w, https://substackcdn.com/image/fetch/$s_!v3wn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff11ed641-c3be-442a-ad17-b41072a721a8_2015x843.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!v3wn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff11ed641-c3be-442a-ad17-b41072a721a8_2015x843.png" width="1456" height="609" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f11ed641-c3be-442a-ad17-b41072a721a8_2015x843.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:609,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!v3wn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff11ed641-c3be-442a-ad17-b41072a721a8_2015x843.png 424w, https://substackcdn.com/image/fetch/$s_!v3wn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff11ed641-c3be-442a-ad17-b41072a721a8_2015x843.png 848w, https://substackcdn.com/image/fetch/$s_!v3wn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff11ed641-c3be-442a-ad17-b41072a721a8_2015x843.png 1272w, https://substackcdn.com/image/fetch/$s_!v3wn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff11ed641-c3be-442a-ad17-b41072a721a8_2015x843.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">GAE formulation</figcaption></figure></div><p>The value of <code>&#955; &#8712; [0, 1]</code> controls the bias variance tradeoff. We can toggle the value of <code>&#955;</code> in GAE as needed to stabilize the training process<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a>. For example, if training is unstable, we can decrease <code>&#955;</code> to yield lower variance policy updates. </p><p><strong>Complexity of PPO.</strong> As we might infer from the above discussion, PPO is not a simple algorithm&#8212;<em>there are many more details to be learned.</em> For a more complete overview of PPO, please see the article linked below. However, we need to briefly discuss the key limitations of PPO to serve as motivation for GRPO. </p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;7b12c322-ffdb-455c-94ec-99b740271d97&quot;,&quot;caption&quot;:&quot;PPO is poorly understood outside of top research labs for good reason. Not only is PPO complicated, but its high compute and memory overhead make experimentation difficult. Successfully using PPO requires both algorithmic knowledge and practical experience. This overview builds upon basic concepts in RL to develop a detailed understanding of PPO.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;PPO for LLMs: A Guide for Normal People&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:29736521,&quot;name&quot;:&quot;Cameron R. Wolfe, Ph.D.&quot;,&quot;bio&quot;:&quot;Research @ Netflix &#8226; Rice University PhD &#8226; I make AI understandable&quot;,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/69aba7df-b571-4609-aa47-fc2d031c11b8_1242x1595.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-10-27T09:33:23.171Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/61f107c1-95cb-4438-84b9-8d87c9cdc04f_2502x1408.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://cameronrwolfe.substack.com/p/ppo-llm&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:175107358,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:100,&quot;comment_count&quot;:4,&quot;publication_id&quot;:1092659,&quot;publication_name&quot;:&quot;Deep (Learning) Focus&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!87xa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab9b43fb-52d5-40da-995d-5b7cd3f91064_896x896.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p>There are a total of four models included in PPO&#8217;s training process: two that are being trained (i.e., the policy and the critic) and two that are used for inference (i.e., the reference and reward model). The fact that the critic must be trained in tandem with the policy complicates the training process, increases compute costs, and consumes a lot of memory. Plus, there are many additional nuances and settings that must be carefully tuned to arrive at a working PPO implementation (e.g., GAE, value model setup, reward model setup, clipping, and more).</p><blockquote><p><em>&#8220;During RL training, the value function is treated as a baseline in the calculation of the advantage for variance reduction. In the LLM context, usually only the last token is assigned a reward score by the reward model, which may complicate the training of a value function that is accurate at each token.&#8221;</em> - from [1]</p></blockquote><p><strong>Can we simplify PPO?</strong> Much of the complexity of PPO&#8212;<em>though not all!</em>&#8212;stems from estimating the per-token value function with the critic. Recent work has questioned the need for this critic, arguing that critic-free RL algorithms like <a href="https://cameronrwolfe.substack.com/p/reinforce">REINFORCE</a> can be used instead of PPO to train LLMs with no performance degradation. This argument stems from a few key observations:</p><ul><li><p>Avoiding high-variance policy updates&#8212;<em>which is the key benefit of PPO and a limitation of simpler RL optimizers like REINFORCE</em>&#8212;is less of a concern for LLMs because we are finetuning models that are extensively pretrained.</p></li><li><p>LLMs are mostly trained using outcome rewards, which makes estimating advantage on a per-token basis unnecessary. <em>How can we learn an accurate per-token value estimate from outcome rewards only?</em> Modeling the advantage and reward on a completion level should be sufficient for LLMs in this case. </p></li></ul><p>GRPO provides further empirical support for these claims in the LLM domain. Specifically, GRPO forgoes the critic and estimates advantage by averaging rewards for multiple completions to the same prompt. Each token in GRPO receives the same advantage estimate, rather than attempting to assign credit on a per-token basis from a sequence-level (outcome) reward signal. </p><h4>Group Relative Policy Optimization (GRPO)</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dzfC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dzfC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png 424w, https://substackcdn.com/image/fetch/$s_!dzfC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png 848w, https://substackcdn.com/image/fetch/$s_!dzfC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png 1272w, https://substackcdn.com/image/fetch/$s_!dzfC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dzfC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png" width="1456" height="701" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:701,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:420310,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/177823868?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!dzfC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png 424w, https://substackcdn.com/image/fetch/$s_!dzfC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png 848w, https://substackcdn.com/image/fetch/$s_!dzfC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png 1272w, https://substackcdn.com/image/fetch/$s_!dzfC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d12056e-a139-4bd9-bb4b-00fee858ad9c_2718x1308.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>Group Relative Policy Optimization (GRPO) [1] builds upon PPO by proposing a simpler technique for estimating the advantage. In particular, GRPO estimates the advantage by sampling multiple completions&#8212;<em>or a &#8220;group&#8221; of completions</em>&#8212;for each prompt and using the rewards of these completions to form a <a href="https://cameronrwolfe.substack.com/i/175107358/policy-gradient-basics">baseline</a>. This group-derived baseline replaces the value function, which allows GRPO to forgo training a critic. Avoiding the critic drastically reduces GRPO&#8217;s memory consumption and training complexity compared to PPO.</p><blockquote><p><em>&#8220;We introduce the Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO). GRPO foregoes the critic model, instead estimating the baseline from group scores, significantly reducing training resources.&#8221;</em> - from [1]</p></blockquote><p><strong>Advantage estimation in GRPO.</strong> Instead of using a learned value model, GRPO estimates the advantage by<em> </em>sampling multiple completions for each prompt in the batch and using the formulation shown below to compute the advantage.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nguf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nguf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png 424w, https://substackcdn.com/image/fetch/$s_!nguf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png 848w, https://substackcdn.com/image/fetch/$s_!nguf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png 1272w, https://substackcdn.com/image/fetch/$s_!nguf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nguf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png" width="1456" height="597" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:597,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:211136,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/177823868?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!nguf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png 424w, https://substackcdn.com/image/fetch/$s_!nguf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png 848w, https://substackcdn.com/image/fetch/$s_!nguf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png 1272w, https://substackcdn.com/image/fetch/$s_!nguf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97afd2cb-5a22-4990-a470-4f5bdebb8a53_2124x871.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Advantage computation in GRPO</figcaption></figure></div><p>In GRPO, completions to the same prompt form a group, and we calculate the advantage relative to other rewards observed in the group&#8212;<em>hence, the name &#8220;group relative&#8221; policy optimization</em>! More specifically, the advantage for completion <code>i</code> is calculated by first subtracting the mean reward over the group from <code>r_i</code>, then dividing this difference by the standard deviation of rewards over the group. We are still assuming an <a href="https://cameronrwolfe.substack.com/i/173306894/markov-decision-process-mdp-versus-bandit-formulation">MDP formulation</a> in this discussion, but the formulation above assigns the same advantage to every token <code>t</code> in the sequence <code>i</code>.</p><blockquote><p><em>&#8220;GRPO is often run with a far higher number of samples per prompt because the advantage is entirely about the relative value of a completion to its peers from that prompt.&#8221;</em> - <a href="http://RLHF book">RLHF book</a></p></blockquote><p>Because we compute the advantage in a relative manner (i.e., based on rewards in the group), the number of completions we sample per prompt must be high to obtain a stable policy gradient estimate. Unlike GRPO, <a href="https://cameronrwolfe.substack.com/p/ppo-llm">PPO</a> and <a href="https://cameronrwolfe.substack.com/i/173306894/reward-increment-nonnegative-factor-x-offset-reinforcement-x-characteristic-eligibility-reinforce">REINFORCE</a> typically sample a single completion per prompt. However, sampling multiple completions per prompt has been explored by prior RL optimizers like <a href="https://cameronrwolfe.substack.com/i/173306894/reinforce-leave-one-out-rloo">RLOO</a>.</p><p><strong>Surrogate loss.</strong> Despite estimating the advantage differently, GRPO uses a surrogate loss that is nearly identical to that of PPO. Both of these optimizers make use of the same clipping mechanism for the policy ratio; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6kXE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ce0fb53-a64d-4225-9f84-acf312c16c06_2475x763.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6kXE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ce0fb53-a64d-4225-9f84-acf312c16c06_2475x763.png 424w, https://substackcdn.com/image/fetch/$s_!6kXE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ce0fb53-a64d-4225-9f84-acf312c16c06_2475x763.png 848w, https://substackcdn.com/image/fetch/$s_!6kXE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ce0fb53-a64d-4225-9f84-acf312c16c06_2475x763.png 1272w, https://substackcdn.com/image/fetch/$s_!6kXE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ce0fb53-a64d-4225-9f84-acf312c16c06_2475x763.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6kXE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ce0fb53-a64d-4225-9f84-acf312c16c06_2475x763.png" width="1456" height="449" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9ce0fb53-a64d-4225-9f84-acf312c16c06_2475x763.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:449,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:192461,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/177823868?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ce0fb53-a64d-4225-9f84-acf312c16c06_2475x763.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6kXE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ce0fb53-a64d-4225-9f84-acf312c16c06_2475x763.png 424w, https://substackcdn.com/image/fetch/$s_!6kXE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ce0fb53-a64d-4225-9f84-acf312c16c06_2475x763.png 848w, https://substackcdn.com/image/fetch/$s_!6kXE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ce0fb53-a64d-4225-9f84-acf312c16c06_2475x763.png 1272w, https://substackcdn.com/image/fetch/$s_!6kXE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ce0fb53-a64d-4225-9f84-acf312c16c06_2475x763.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">GRPO surrogate loss</figcaption></figure></div><p>This expression assumes an <a href="http://MDP formulation">MDP formulation</a> and has been modified to explicitly aggregate the loss over multiple completions within a group. In contrast, we previously formulated the loss for PPO as an expectation over completions.</p><p>One key difference between PPO and GRPO is the <a href="https://cameronrwolfe.substack.com/i/167254905/kullback-leibler-kl-divergence">KL divergence</a> term being subtracted as a penalty term from the surrogate loss rather than incorporated into the per-token reward. Additionally, GRPO does not always perform multiple policy updates per batch of data. If we only perform a single policy update per batch, we have <code>&#960;_&#952;</code> <code>=</code> <code>&#960;_old</code>, which simplifies the clipped objective to the expression shown below<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a>. See <a href="https://github.com/huggingface/trl/issues/2608">here</a> for more discussion on this topic. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DdaK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F113e9565-8a15-4077-9edb-eaec05f196f9_2131x750.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DdaK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F113e9565-8a15-4077-9edb-eaec05f196f9_2131x750.png 424w, https://substackcdn.com/image/fetch/$s_!DdaK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F113e9565-8a15-4077-9edb-eaec05f196f9_2131x750.png 848w, https://substackcdn.com/image/fetch/$s_!DdaK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F113e9565-8a15-4077-9edb-eaec05f196f9_2131x750.png 1272w, https://substackcdn.com/image/fetch/$s_!DdaK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F113e9565-8a15-4077-9edb-eaec05f196f9_2131x750.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DdaK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F113e9565-8a15-4077-9edb-eaec05f196f9_2131x750.png" width="628" height="220.83516483516485" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/113e9565-8a15-4077-9edb-eaec05f196f9_2131x750.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:512,&quot;width&quot;:1456,&quot;resizeWidth&quot;:628,&quot;bytes&quot;:209879,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/177823868?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F113e9565-8a15-4077-9edb-eaec05f196f9_2131x750.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DdaK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F113e9565-8a15-4077-9edb-eaec05f196f9_2131x750.png 424w, https://substackcdn.com/image/fetch/$s_!DdaK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F113e9565-8a15-4077-9edb-eaec05f196f9_2131x750.png 848w, https://substackcdn.com/image/fetch/$s_!DdaK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F113e9565-8a15-4077-9edb-eaec05f196f9_2131x750.png 1272w, https://substackcdn.com/image/fetch/$s_!DdaK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F113e9565-8a15-4077-9edb-eaec05f196f9_2131x750.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Simplification of the clipping term with a single update</figcaption></figure></div><p><strong>Extension to process rewards.</strong> Most implementations of GRPO use outcome rewards, as this is the most common setting for an LLM. However, we can extend GRPO to handle <a href="https://cameronrwolfe.substack.com/i/166169560/different-types-of-rms">process rewards</a> (e.g., after each reasoning step) by:</p><ol><li><p>Normalizing rewards based on the mean and standard deviation of all process rewards observed in the group.</p></li><li><p>Computing the advantage of each token as the sum of normalized rewards for following steps in the reasoning trajectory.</p></li></ol><p>When using outcome rewards, each token is assigned the same advantage by GRPO, but this approach changes when using process rewards. The advantage is estimated for each token based on rewards observed in following steps of the trajectory, which changes depending on the position of a token. Additionally, we must now consider all rewards&#8212;<em>including multiple rewards in each trajectory</em>&#8212;when computing the mean and standard deviation metrics for GRPO. </p><p><strong>Memory consumption.</strong> In PPO, we are training two models&#8212;<em>the policy and the critic</em>&#8212;in tandem. Additionally, we are running real-time inference for both the reward model and the reference policy, yielding a total of four models that must be managed. The need to train two models drastically increases the memory footprint of PPO. Assuming we use half precision (<code>bf16</code> or <code>fp16</code>), we can host an LLM using ~2GB of memory for every 1B model parameters; e.g., inference with <a href="https://huggingface.co/Qwen/Qwen3-32B">Qwen-3-32B</a> should require ~60-70GB of memory. Notably, this calculation only accounts for loading the model&#8217;s weights into GPU memory, and memory usage can vary quite a bit depending on the maximum context length being used<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Evsv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93d1c6b3-e657-4d5a-839d-61c02456a157_1910x1042.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Evsv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93d1c6b3-e657-4d5a-839d-61c02456a157_1910x1042.png 424w, https://substackcdn.com/image/fetch/$s_!Evsv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93d1c6b3-e657-4d5a-839d-61c02456a157_1910x1042.png 848w, https://substackcdn.com/image/fetch/$s_!Evsv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93d1c6b3-e657-4d5a-839d-61c02456a157_1910x1042.png 1272w, https://substackcdn.com/image/fetch/$s_!Evsv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93d1c6b3-e657-4d5a-839d-61c02456a157_1910x1042.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Evsv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93d1c6b3-e657-4d5a-839d-61c02456a157_1910x1042.png" width="1456" height="794" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/93d1c6b3-e657-4d5a-839d-61c02456a157_1910x1042.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:794,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:141297,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/177823868?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93d1c6b3-e657-4d5a-839d-61c02456a157_1910x1042.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Evsv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93d1c6b3-e657-4d5a-839d-61c02456a157_1910x1042.png 424w, https://substackcdn.com/image/fetch/$s_!Evsv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93d1c6b3-e657-4d5a-839d-61c02456a157_1910x1042.png 848w, https://substackcdn.com/image/fetch/$s_!Evsv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93d1c6b3-e657-4d5a-839d-61c02456a157_1910x1042.png 1272w, https://substackcdn.com/image/fetch/$s_!Evsv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93d1c6b3-e657-4d5a-839d-61c02456a157_1910x1042.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Illustration of memory consumption during training and inference</figcaption></figure></div><p>In contrast, training a model in half precision usually requires <a href="https://modal.com/blog/how-much-vram-need-fine-tuning">~16GB of memory per 1B model parameters</a>, which varies depending on the details of the training setup<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-9" href="#footnote-9" target="_self">9</a>. Similarly to inference, we load the model weights into GPU memory for training, but we must also store other training-related data (e.g., optimizer states and gradients). We also need enough GPU memory to store model activations during training, so memory consumption still increases with context length.</p><blockquote><p><em>&#8220;As LMs are scaled up, computing gradients for backpropagation requires a prohibitive amount of memory&#8212;in our test, up to 12&#215; the memory required for inference&#8212;because it needs to cache activations during the forward pass, gradients during the backward pass, and, in the case of Adam, store gradient history.&#8221;</em> - <a href="https://arxiv.org/abs/2305.17333">source</a></p></blockquote><p>With this in mind, the fact that GRPO does not use a critic not only saves on compute costs relative to PPO, but it drastically reduces memory consumption&#8212;<em>we are now training a single model instead of two models</em>. Eliminating a trainable model has a much larger impact on memory consumption compared to removing a model that is only used for inference (e.g., the reward model). </p><p><strong>GRPO &amp; reward models. </strong>GRPO became popular primarily in the context of LRM training with RLVR. For this reason, GRPO is mostly used in verifiable reward settings without a neural reward model. A common misconception about GRPO is that it eliminates the need for a reward model, <em>but GRPO can be used with or without a reward model</em>. In fact, the original GRPO paper used a reward model instead of verifiable rewards [1]! Removing the reward model is a benefit of verifiable rewards, not an intrinsic benefit of GRPO itself&#8212;<em>the primary advantage of GRPO is the elimination of the critic.</em></p><h4>Implementing GRPO</h4><p>To make this discussion more concrete, let&#8217;s implement the GRPO loss function in PyTorch pseudocode. This implementation is adapted from the <a href="https://rlhfbook.com/">RLHF book</a><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-10" href="#footnote-10" target="_self">10</a>, which has a <a href="https://rlhfbook.com/c/11-policy-gradients.html">fantastic explanation</a> of GRPO and other policy gradient algorithms.</p><p>In the code below, <code>B</code> is our batch size, <code>G</code> is the group size, and <code>L</code> is the context length or number of tokens in each completion. We present two options for approximating KL divergence, including a simple KL estimate (<code>kl_div</code>) that is commonly used for LLMs and a slightly more complex variant (<code>kl_div_alt</code>) that matches the approximation used in the original GRPO paper [1]. More details on why this particular KL divergence estimate is used will be provided later on. </p><pre><code><code>import torch
import torch.nn.functional as F

# constants
kl_beta = 0.1
eps = 0.2

# sample G completions for B prompts
# compute outcome reward for each completion
with torch.no_grad():
    completions = LLM.generate(prompts)  # (B*G, L)
    rewards = RM(completions)  # (B*G)

# create a padding mask from lengths of completions in batch
completion_mask = &lt;... mask out padding tokens ...&gt;

# get policy logprobs for each action
llm_out = LLM(completions)
per_token_logps = F.log_softmax(llm_out, dim=-1)  # (B*G, L)

# get reference logprobs for each action
ref_out = REF(completions)
ref_per_token_logps = F.log_softmax(ref_out, dim=-1)  # (B*G, L)

# compute KL divergence between policy and reference policy
kl_div = per_token_logps - ref_per_token_logps

# alternative KL divergence used by DeepSeekMath [1]
kl_div_alt = (
    torch.exp(ref_per_token_logps - per_token_logps)
    - (ref_per_token_logps - per_token_logps)
    - 1
)

# compute mean and std of grouped rewards
reward_mean = rewards.view(-1, G).mean(dim=1)  # (B,)
reward_std = rewards.view(-1, G).std(dim=1)  # (B,)

# compute advantage for GRPO
advantage = (rewards.view(-1, G) - reward_mean)
advantage /= (reward_std + 1e-8)  # (B, G)
advantage = advantage.view(-1, 1)  # (B*G, 1)

# compute the policy ratio
policy_ratio = torch.exp(
    per_token_logps - old_per_token_logps,
)  # (B*G, L)
clip_policy_ratio = torch.clamp(
    policy_ratio,
    min=1.0 - eps,
    max=1.0 + eps,
)

# compute clipped loss
loss = torch.min(
    advantage * policy_ratio,
    advantage * clip_policy_ratio,
)  # (B*G, L)

# kl divergence added as penalty term to loss
loss = -loss + kl_beta * kl_div

# aggregate the loss across tokens (many options exist here)
loss = ((loss * completion_mask).sum(axis=-1) /
        completion_mask.sum(axis=-1)).mean()

# perform policy gradient update
optimizer.zero_grad()
loss.backward()
optimizer.step()</code></code></pre><p>The implementation above relies upon <code>old_per_token_logps</code> to compute the policy ratio. The old policy refers to the initial policy parameters prior to any policy updates being performed for a batch of data. Before the first update for a batch, we must store these log probabilities so that they can be used for several subsequent policy updates over the same batch. The code above only outlines a single policy update, but if this were our first update over a batch of data we could simply set <code>old_per_token_logps = per_token_logps.detach()</code>. Then, we could re-run this code&#8212;<em>excluding the part that samples new completions and computes their rewards</em>&#8212;to perform several policy updates over the batch.</p><h2>Key Publications with GRPO</h2><p>We now understand the key ideas underlying GRPO, which are relatively simple compared to optimizers like PPO. Next, we will build upon this understanding by outlining a few key papers that demonstrate the practical application of GRPO. Specifically, we will review DeepSeekMath [1] and DeepSeek-R1 [8]. The former paper proposed the GRPO algorithm in the context of training specialized LLMs for solving math problems. This work was later extended by DeepSeek-R1, which used GRPO to train a state-of-the-art open LRM using RLVR. As we will see, this was the first open model to nearly match the performance of closed LRMs like OpenAI&#8217;s o1 [9], which led to a subsequent explosion of open LRM releases.</p><h4><a href="https://arxiv.org/abs/2402.03300">DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models</a> [1]</h4><p>GRPO was proposed with the release of DeepSeekMath [1], a small and open language model for mathematical reasoning. DeepSeekMath uses a combination of <em>i)</em> continued pretraining on a high-quality, math-focused corpus and <em>ii)</em> further training with RL to surpass the performance of similar open-source LLMs&#8212;<em>and nearly match the performance of top proprietary models like GPT-4</em>; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MWCt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F643fa130-301d-45a3-836f-8c4f5e927dd1_1232x606.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MWCt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F643fa130-301d-45a3-836f-8c4f5e927dd1_1232x606.png 424w, https://substackcdn.com/image/fetch/$s_!MWCt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F643fa130-301d-45a3-836f-8c4f5e927dd1_1232x606.png 848w, https://substackcdn.com/image/fetch/$s_!MWCt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F643fa130-301d-45a3-836f-8c4f5e927dd1_1232x606.png 1272w, https://substackcdn.com/image/fetch/$s_!MWCt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F643fa130-301d-45a3-836f-8c4f5e927dd1_1232x606.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MWCt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F643fa130-301d-45a3-836f-8c4f5e927dd1_1232x606.png" width="1232" height="606" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/643fa130-301d-45a3-836f-8c4f5e927dd1_1232x606.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:606,&quot;width&quot;:1232,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:128731,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/177823868?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F643fa130-301d-45a3-836f-8c4f5e927dd1_1232x606.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MWCt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F643fa130-301d-45a3-836f-8c4f5e927dd1_1232x606.png 424w, https://substackcdn.com/image/fetch/$s_!MWCt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F643fa130-301d-45a3-836f-8c4f5e927dd1_1232x606.png 848w, https://substackcdn.com/image/fetch/$s_!MWCt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F643fa130-301d-45a3-836f-8c4f5e927dd1_1232x606.png 1272w, https://substackcdn.com/image/fetch/$s_!MWCt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F643fa130-301d-45a3-836f-8c4f5e927dd1_1232x606.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>Despite its far-reaching impact, GRPO was first proposed in [1] specifically for training domain-specific LLMs. Authors cite simplicity and memory efficiency as key benefits of GRPO relative to PPO. Additionally, we see in [1] that further RL finetuning via GRPO boosts the mathematical reasoning capabilities of even strong models that have already undergone extensive instruction tuning.</p><blockquote><p><em>&#8220;Our research provides compelling evidence that the publicly accessible Common Crawl data contains valuable information for mathematical purposes&#8230;. We successfully construct the DeepSeekMath Corpus, a high-quality dataset of 120B tokens from web pages filtered for mathematical content.&#8221;</em> - from [1]</p></blockquote><p><strong>The DeepSeekMath Corpus</strong> is a high-quality corpus of 120B math-focused tokens&#8212;<em>mined from <a href="https://commoncrawl.org/">CommonCrawl</a></em>&#8212;used for continued pretraining of DeepSeekMath models. The impressive performance of DeepSeekMath is partially attributed to the <em>&#8220;meticulously engineered data selection pipeline&#8221;</em> that produces this data. The high-level structure of this data selection pipeline is depicted below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!j32u!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81ffde44-1410-48f2-bf07-361aa7a7c0c2_2286x994.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!j32u!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81ffde44-1410-48f2-bf07-361aa7a7c0c2_2286x994.png 424w, https://substackcdn.com/image/fetch/$s_!j32u!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81ffde44-1410-48f2-bf07-361aa7a7c0c2_2286x994.png 848w, https://substackcdn.com/image/fetch/$s_!j32u!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81ffde44-1410-48f2-bf07-361aa7a7c0c2_2286x994.png 1272w, https://substackcdn.com/image/fetch/$s_!j32u!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81ffde44-1410-48f2-bf07-361aa7a7c0c2_2286x994.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!j32u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81ffde44-1410-48f2-bf07-361aa7a7c0c2_2286x994.png" width="1456" height="633" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/81ffde44-1410-48f2-bf07-361aa7a7c0c2_2286x994.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:633,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:628148,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/177823868?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81ffde44-1410-48f2-bf07-361aa7a7c0c2_2286x994.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!j32u!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81ffde44-1410-48f2-bf07-361aa7a7c0c2_2286x994.png 424w, https://substackcdn.com/image/fetch/$s_!j32u!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81ffde44-1410-48f2-bf07-361aa7a7c0c2_2286x994.png 848w, https://substackcdn.com/image/fetch/$s_!j32u!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81ffde44-1410-48f2-bf07-361aa7a7c0c2_2286x994.png 1272w, https://substackcdn.com/image/fetch/$s_!j32u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81ffde44-1410-48f2-bf07-361aa7a7c0c2_2286x994.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>The DeepSeekMath corpus is created iteratively. During the first iteration of data selection, we train a <a href="https://fasttext.cc/">fastText</a> model to identify high-quality math content by using OpenWebMath [2] as a seed corpus. In other words, the OpenWebMath data is used as positive examples of high-quality math content, and we sample 500,000 data points from CommonCrawl to serve as negative examples (i.e., data that are not math-focused). The fastText model is then trained over this data to classify high-quality math content. After deduplicating the web pages in CommonCrawl, we have ~40B web pages that are then ranked by the output of the fastText model&#8212;<em>the 40B top-scoring tokens are retained for further refinement</em>.</p><p>We further refine this fastText classifier by grouping CommonCrawl into domains with the same base URL. A domain is considered to be &#8220;math-related&#8221; if more than 10% of the pages in this domain have been identified as math-related by the fastText model. Human annotators manually annotate the URLs in these math-related domains, allowing more math-focused examples to be identified for retraining the fastText model. This process is repeated three times, yielding a total of 120B math-focused tokens. Data collection ends after the fourth iteration because authors found that 98% of the identified data was already collected. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!69KI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bc925ff-e860-46a5-bb8f-d79c81e28e06_1261x1103.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!69KI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bc925ff-e860-46a5-bb8f-d79c81e28e06_1261x1103.png 424w, https://substackcdn.com/image/fetch/$s_!69KI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bc925ff-e860-46a5-bb8f-d79c81e28e06_1261x1103.png 848w, https://substackcdn.com/image/fetch/$s_!69KI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bc925ff-e860-46a5-bb8f-d79c81e28e06_1261x1103.png 1272w, https://substackcdn.com/image/fetch/$s_!69KI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bc925ff-e860-46a5-bb8f-d79c81e28e06_1261x1103.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!69KI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bc925ff-e860-46a5-bb8f-d79c81e28e06_1261x1103.png" width="1261" height="1103" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2bc925ff-e860-46a5-bb8f-d79c81e28e06_1261x1103.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1103,&quot;width&quot;:1261,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:319150,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/177823868?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bc925ff-e860-46a5-bb8f-d79c81e28e06_1261x1103.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!69KI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bc925ff-e860-46a5-bb8f-d79c81e28e06_1261x1103.png 424w, https://substackcdn.com/image/fetch/$s_!69KI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bc925ff-e860-46a5-bb8f-d79c81e28e06_1261x1103.png 848w, https://substackcdn.com/image/fetch/$s_!69KI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bc925ff-e860-46a5-bb8f-d79c81e28e06_1261x1103.png 1272w, https://substackcdn.com/image/fetch/$s_!69KI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bc925ff-e860-46a5-bb8f-d79c81e28e06_1261x1103.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p><strong>Is the data good?</strong> To validate the DeepSeekMath corpus&#8217; quality, pretraining experiments are performed over several different datasets. Models trained on the DeepSeekMath corpus clearly lead on all downstream benchmarks. As shown above, the performance of these models has a steeper learning curve, indicating that the average quality of the DeepSeekMath corpus is higher relative to other math-focused corpora. Additionally, this new corpus is multilingual&#8212;<em>primarily</em> <em>English and Chinese</em>&#8212;and nearly an order of magnitude larger than alternatives.</p><p><strong>DeepSeekMath-Base </strong>is the initial base model trained in [1] for mathematical reasoning. It is initialized with the weights of a code model&#8212;<em><a href="http://deepseek-ai/deepseek-coder-7b-base-v1.5">DeepSeek-Coder-7B-Base-v1.5</a> in particular</em>&#8212;and undergoes continued pretraining on 500B tokens from the DeepSeekMath corpus (and other sources like arXiv papers, Github code, and general language data). <a href="https://huggingface.co/deepseek-ai/deepseek-math-7b-base">DeepSeekMath-7B-Base</a> outperforms other open-source base models on mathematical reasoning&#8212;<em>both with and without tool use</em>&#8212;and formal theorem proving tasks. Going further, we see in [1] that DeepSeekMath-7B-Base also retains key capabilities in other domains. For example, its performance on coding and general language / reasoning tasks is still strong.</p><blockquote><p><em>&#8220;DeepSeekMath-Base 7B exhibits significant enhancements in performance on MMLU and BBH&#8230; illustrating the positive impact of math training on language understanding and reasoning&#8230; by including code tokens for continual training, DeepSeekMath-Base 7B effectively maintains the performance of DeepSeek-Coder-Base-v1.5 on the two coding benchmarks.&#8221;</em> - from [1]</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9pqA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f3066e0-ac50-4f81-8756-152a7d12f56e_2044x1280.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9pqA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f3066e0-ac50-4f81-8756-152a7d12f56e_2044x1280.png 424w, https://substackcdn.com/image/fetch/$s_!9pqA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f3066e0-ac50-4f81-8756-152a7d12f56e_2044x1280.png 848w, https://substackcdn.com/image/fetch/$s_!9pqA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f3066e0-ac50-4f81-8756-152a7d12f56e_2044x1280.png 1272w, https://substackcdn.com/image/fetch/$s_!9pqA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f3066e0-ac50-4f81-8756-152a7d12f56e_2044x1280.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9pqA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f3066e0-ac50-4f81-8756-152a7d12f56e_2044x1280.png" width="1456" height="912" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6f3066e0-ac50-4f81-8756-152a7d12f56e_2044x1280.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:912,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1068654,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/177823868?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f3066e0-ac50-4f81-8756-152a7d12f56e_2044x1280.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9pqA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f3066e0-ac50-4f81-8756-152a7d12f56e_2044x1280.png 424w, https://substackcdn.com/image/fetch/$s_!9pqA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f3066e0-ac50-4f81-8756-152a7d12f56e_2044x1280.png 848w, https://substackcdn.com/image/fetch/$s_!9pqA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f3066e0-ac50-4f81-8756-152a7d12f56e_2044x1280.png 1272w, https://substackcdn.com/image/fetch/$s_!9pqA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f3066e0-ac50-4f81-8756-152a7d12f56e_2044x1280.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3, 4, 5])</figcaption></figure></div><p><strong>Instruction tuning.</strong> After continued pretraining, DeepSeekMath-Base undergoes an instruction tuning phase in which the model is trained with <a href="https://cameronrwolfe.substack.com/p/understanding-and-using-supervised">supervised finetuning (SFT)</a> over a curated dataset for mathematical reasoning. Authors collect a set of math problems in both English and Chinese that span diverse fields and levels of complexity. Solutions to these problems are created using three different formats (depicted above):</p><ul><li><p><em><a href="https://cameronrwolfe.substack.com/p/chain-of-thought-prompting-for-llms">Chain of Thought</a></em> [3]: prompts the model to output intermediate reasoning steps prior to its final answer. </p></li><li><p><em><a href="https://cameronrwolfe.substack.com/p/program-aided-language-models">Program of Thoughts</a></em> [4]: separates reasoning from computation by prompting the model to output its reasoning steps as a structured program that is then solved by an external code interpreter. </p></li><li><p><em><a href="https://arxiv.org/abs/2309.17452">Tool-Integrated Reasoning</a></em> [5]: teaches the model to perform complex mathematical reasoning via a trajectory of interleaved natural language reasoning and tool usage (e.g., computation libraries or symbolic solvers). </p></li></ul><p>The final instruction tuning dataset contains a total of 776K examples and is used to train <a href="https://huggingface.co/deepseek-ai/deepseek-math-7b-instruct">DeepSeekMath-7B-Instruct</a>, starting from DeepSeekMath-7B-Base. As shown below, the instruction tuned model outperforms all other open-source models&#8212;<em>even those that are much larger</em>&#8212;on chain of thought and tool-integrated reasoning tasks. The model can perform relatively well with or without tools. DeepSeekMath-7B-Instruct also rivals the performance of proprietary models (e.g., Gemini Pro) in some cases but tends to lag behind top-performing models (e.g., Gemini Ultra and GPT-4), <em>especially in the tool-integrated domain</em>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lcwQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0164b168-a3fb-457e-b85c-420a419e338d_1042x1468.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lcwQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0164b168-a3fb-457e-b85c-420a419e338d_1042x1468.png 424w, https://substackcdn.com/image/fetch/$s_!lcwQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0164b168-a3fb-457e-b85c-420a419e338d_1042x1468.png 848w, https://substackcdn.com/image/fetch/$s_!lcwQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0164b168-a3fb-457e-b85c-420a419e338d_1042x1468.png 1272w, https://substackcdn.com/image/fetch/$s_!lcwQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0164b168-a3fb-457e-b85c-420a419e338d_1042x1468.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lcwQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0164b168-a3fb-457e-b85c-420a419e338d_1042x1468.png" width="589" height="829.8003838771593" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0164b168-a3fb-457e-b85c-420a419e338d_1042x1468.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1468,&quot;width&quot;:1042,&quot;resizeWidth&quot;:589,&quot;bytes&quot;:317556,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/177823868?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0164b168-a3fb-457e-b85c-420a419e338d_1042x1468.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lcwQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0164b168-a3fb-457e-b85c-420a419e338d_1042x1468.png 424w, https://substackcdn.com/image/fetch/$s_!lcwQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0164b168-a3fb-457e-b85c-420a419e338d_1042x1468.png 848w, https://substackcdn.com/image/fetch/$s_!lcwQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0164b168-a3fb-457e-b85c-420a419e338d_1042x1468.png 1272w, https://substackcdn.com/image/fetch/$s_!lcwQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0164b168-a3fb-457e-b85c-420a419e338d_1042x1468.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p><strong>RL training with GRPO.</strong> The above table also presents the performance of DeepSeekMath-RL, which undergoes one final RL training phase using GRPO as the underlying optimizer. In fact, GRPO was initially proposed in [1], where authors cite the practicality of GRPO&#8212;<em>specifically</em> <em>its memory efficiency, compute efficiency, and simplicity relative to PPO</em>&#8212;as key design criteria. Although GRPO is usually used in tandem with verifiable rewards, authors in [1] score completions using a <a href="https://cameronrwolfe.substack.com/p/reward-models">reward model</a>. Additionally, an outcome reward setting is used, meaning that rewards are assigned at the end of a full completion.</p><blockquote><p><em>&#8220;The group relative way that GRPO&#8230; calculates the advantages aligns well with the comparative nature of rewards models, as reward models are typically trained on datasets of comparisons between outputs on the same question.&#8221;</em> - from [1]</p></blockquote><p><strong>More GRPO details.</strong> DeepSeekMath-7B-Instruct is further trained using GRPO over a subset of data from the instruction tuning set&#8212;<em>some subsets of this data are purposely left out to test generalization capabilities</em>. During training, the objective is regularized via an added KL divergence penalty between the current policy and the SFT model (i.e., DeepSeekMath-7B-Instruct). Interestingly, authors in [1] adopt a <a href="https://huggingface.co/blog/NormalUhr/kl-divergence-estimator-rl-llm">modified estimator</a> of the KL divergence, as shown below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iEi2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc30d4720-4453-41e2-ac2e-d1224f9d6837_1539x798.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iEi2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc30d4720-4453-41e2-ac2e-d1224f9d6837_1539x798.png 424w, https://substackcdn.com/image/fetch/$s_!iEi2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc30d4720-4453-41e2-ac2e-d1224f9d6837_1539x798.png 848w, https://substackcdn.com/image/fetch/$s_!iEi2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc30d4720-4453-41e2-ac2e-d1224f9d6837_1539x798.png 1272w, https://substackcdn.com/image/fetch/$s_!iEi2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc30d4720-4453-41e2-ac2e-d1224f9d6837_1539x798.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iEi2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc30d4720-4453-41e2-ac2e-d1224f9d6837_1539x798.png" width="555" height="287.7918956043956" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c30d4720-4453-41e2-ac2e-d1224f9d6837_1539x798.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:755,&quot;width&quot;:1456,&quot;resizeWidth&quot;:555,&quot;bytes&quot;:201650,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/177823868?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc30d4720-4453-41e2-ac2e-d1224f9d6837_1539x798.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iEi2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc30d4720-4453-41e2-ac2e-d1224f9d6837_1539x798.png 424w, https://substackcdn.com/image/fetch/$s_!iEi2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc30d4720-4453-41e2-ac2e-d1224f9d6837_1539x798.png 848w, https://substackcdn.com/image/fetch/$s_!iEi2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc30d4720-4453-41e2-ac2e-d1224f9d6837_1539x798.png 1272w, https://substackcdn.com/image/fetch/$s_!iEi2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc30d4720-4453-41e2-ac2e-d1224f9d6837_1539x798.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Different techniques for approximating the KL divergence</figcaption></figure></div><p>Both of these expressions are valid estimators for the KL divergence; see [7] for details.  The estimator for KL divergence that is typically used when training LLMs (top of the figure above) is unbiased but has high-variance. In fact, this estimator can oftentimes be negative in value, whereas the KL divergence is a non-negative metric. In contrast, the estimator used in [1] (bottom of the figure above) is both unbiased and has lower variance&#8212;<em>it is guaranteed to be positive</em>&#8212;which makes it a desirable estimator for the KL divergence. Due to its use in DeepSeekMath, this estimator has also been adopted in public implementations of GRPO (e.g., this estimator is used in the <a href="https://github.com/huggingface/trl/blob/main/trl/trainer/grpo_trainer.py#L1831">TRL GRPO trainer</a>).</p><p>Training DeepSeekMath-7B-Instruct with GRPO yields the DeepSeekMath-7B-RL model. During GRPO training, we only perform a single policy update for each batch of data. On the other hand, it is common in PPO to perform 2-4 policy updates over the same batch of data [6]. Additionally, GRPO training uses batch sizes that are quite large&#8212;<em>a total batch size of 1,024 with 16 prompts and a group size of 64 completions</em>. Large batch sizes are characteristic of GRPO and tend to be a practical necessity for the training process to be stable. As mentioned previously, many samples per prompt are needed because we estimate the advantage purely based on other rewards that are observed within a group. </p><p><strong>Impact of RL.</strong> After further RL training, DeepSeekMath-7B-RL is found to outperform all open-source models and the majority of proprietary models. Interestingly, the RL-trained model also outperforms DeepSeekMath-Instruct across all benchmarks, despite the constrained scope of its training data&#8212;<em>only a small subset of the instruction tuning data (i.e., 144K of 776K total examples) is used during RL</em>. This finding suggests that RL training generalizes well and tends to enhance both in-domain and out-of-domain performance.</p><blockquote><p><em>&#8220;Does code training improve reasoning abilities? We believe it does, at least for mathematical reasoning.&#8221;</em> - from [1]</p></blockquote><p><strong>Code, math and beyond.</strong> One interesting aspect of the analysis in [1] is the focus upon understanding the interplay between coding and math. As shown in the table below, two training strategies are tested:</p><ul><li><p>A two-stage pipeline that first trains on either code data or general data, then on math data.</p></li><li><p>A one-stage pipeline that (optionally) mixes code data into the math dataset.</p></li></ul><p>In the two-stage pipeline, we see that training the model on coding data&#8212;<em>as opposed to general data</em>&#8212;prior to training on math data benefits the model&#8217;s downstream performance on math benchmarks; see below. This insight motivates initializing DeepSeekMath with the weights of a coding model.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5LkF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd44d9d9-9e5c-412e-af79-d72e49e14a98_1504x778.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5LkF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd44d9d9-9e5c-412e-af79-d72e49e14a98_1504x778.png 424w, https://substackcdn.com/image/fetch/$s_!5LkF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd44d9d9-9e5c-412e-af79-d72e49e14a98_1504x778.png 848w, https://substackcdn.com/image/fetch/$s_!5LkF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd44d9d9-9e5c-412e-af79-d72e49e14a98_1504x778.png 1272w, https://substackcdn.com/image/fetch/$s_!5LkF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd44d9d9-9e5c-412e-af79-d72e49e14a98_1504x778.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5LkF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd44d9d9-9e5c-412e-af79-d72e49e14a98_1504x778.png" width="1456" height="753" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bd44d9d9-9e5c-412e-af79-d72e49e14a98_1504x778.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:753,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:197301,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/177823868?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd44d9d9-9e5c-412e-af79-d72e49e14a98_1504x778.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!5LkF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd44d9d9-9e5c-412e-af79-d72e49e14a98_1504x778.png 424w, https://substackcdn.com/image/fetch/$s_!5LkF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd44d9d9-9e5c-412e-af79-d72e49e14a98_1504x778.png 848w, https://substackcdn.com/image/fetch/$s_!5LkF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd44d9d9-9e5c-412e-af79-d72e49e14a98_1504x778.png 1272w, https://substackcdn.com/image/fetch/$s_!5LkF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd44d9d9-9e5c-412e-af79-d72e49e14a98_1504x778.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>In the one-stage pipeline, the impact of including code data is mixed. Including code in the data mixture helps to avoid catastrophic forgetting and retain coding abilities. However, this data mixture actually degrades performance on certain math benchmarks&#8212;<em>particularly those that do not permit tool use</em>&#8212;compared to just training on math data. However, this negative result may be due to issues in the data mixture. Namely, the one-stage pipeline uses 150B math tokens and 400B code tokens, which can cause coding capabilities to be prioritized over math. </p><blockquote><p><em>&#8220;We observe the math training also improves model capability on MMLU and BBH benchmarks, indicating it does not only enhance the model&#8217;s mathematical abilities but also amplifies general reasoning capabilities.&#8221;</em> - from [1]</p></blockquote><p>Beyond studying the interplay between code and math, authors in [1] note that math-focused training tends to improve general model capabilities as well. For example, we see that DeepSeekMath models also have improved performance on general benchmarks like <a href="https://huggingface.co/datasets/cais/mmlu">MMLU</a> and <a href="https://github.com/suzgunmirac/BIG-Bench-Hard">BBH</a>, as explained in the quote above. </p><h4><strong><a href="https://arxiv.org/abs/2501.12948">DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning</a> [8]</strong></h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jcN8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51d76a51-4898-4cc2-9c88-3b3d547ab160_1926x1140.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jcN8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51d76a51-4898-4cc2-9c88-3b3d547ab160_1926x1140.png 424w, https://substackcdn.com/image/fetch/$s_!jcN8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51d76a51-4898-4cc2-9c88-3b3d547ab160_1926x1140.png 848w, https://substackcdn.com/image/fetch/$s_!jcN8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51d76a51-4898-4cc2-9c88-3b3d547ab160_1926x1140.png 1272w, https://substackcdn.com/image/fetch/$s_!jcN8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51d76a51-4898-4cc2-9c88-3b3d547ab160_1926x1140.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jcN8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51d76a51-4898-4cc2-9c88-3b3d547ab160_1926x1140.png" width="1456" height="862" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/51d76a51-4898-4cc2-9c88-3b3d547ab160_1926x1140.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:862,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:196660,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/177823868?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51d76a51-4898-4cc2-9c88-3b3d547ab160_1926x1140.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jcN8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51d76a51-4898-4cc2-9c88-3b3d547ab160_1926x1140.png 424w, https://substackcdn.com/image/fetch/$s_!jcN8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51d76a51-4898-4cc2-9c88-3b3d547ab160_1926x1140.png 848w, https://substackcdn.com/image/fetch/$s_!jcN8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51d76a51-4898-4cc2-9c88-3b3d547ab160_1926x1140.png 1272w, https://substackcdn.com/image/fetch/$s_!jcN8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51d76a51-4898-4cc2-9c88-3b3d547ab160_1926x1140.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [8])</figcaption></figure></div><p>Although GRPO was proposed in [1], the algorithm was more widely popularized by its use in training DeepSeek-R1 [8]. During the early days of LRMs, nearly all high-quality reasoning models&#8212;<em>such as OpenAI&#8217;s o-series models [9]</em>&#8212;were closed-source<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-11" href="#footnote-11" target="_self">11</a>. For this reason, there was a <a href="https://www.interconnects.ai/p/reverse-engineering-openai-o1">lot</a> <a href="https://www.interconnects.ai/p/openais-o1-using-search-was-a-psyop">of</a> <a href="https://www.youtube.com/watch?v=6PEJ96k1kiw">speculation</a> outside of top labs about how these models actually worked. <a href="https://cameronrwolfe.substack.com/p/demystifying-reasoning-models">DeepSeek-R1</a> [8] was the first open LRM to reach o1-level performance in a transparent way. As detailed in the report, this model is finetuned from DeepSeek-V3 [10]&#8212;<em>a 671 billion parameter <a href="https://cameronrwolfe.substack.com/p/moe-llms">Mixture-of-Experts (MoE) model</a></em>&#8212;using RLVR. The RL training process uses GRPO and is primarily focused on verifiable domains like math and coding.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ozKr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fe00c0c-da10-431b-8316-4ea3939e50fe_1264x645.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ozKr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fe00c0c-da10-431b-8316-4ea3939e50fe_1264x645.png 424w, https://substackcdn.com/image/fetch/$s_!ozKr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fe00c0c-da10-431b-8316-4ea3939e50fe_1264x645.png 848w, https://substackcdn.com/image/fetch/$s_!ozKr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fe00c0c-da10-431b-8316-4ea3939e50fe_1264x645.png 1272w, https://substackcdn.com/image/fetch/$s_!ozKr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fe00c0c-da10-431b-8316-4ea3939e50fe_1264x645.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ozKr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fe00c0c-da10-431b-8316-4ea3939e50fe_1264x645.png" width="1264" height="645" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1fe00c0c-da10-431b-8316-4ea3939e50fe_1264x645.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:645,&quot;width&quot;:1264,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ozKr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fe00c0c-da10-431b-8316-4ea3939e50fe_1264x645.png 424w, https://substackcdn.com/image/fetch/$s_!ozKr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fe00c0c-da10-431b-8316-4ea3939e50fe_1264x645.png 848w, https://substackcdn.com/image/fetch/$s_!ozKr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fe00c0c-da10-431b-8316-4ea3939e50fe_1264x645.png 1272w, https://substackcdn.com/image/fetch/$s_!ozKr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fe00c0c-da10-431b-8316-4ea3939e50fe_1264x645.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [9])</figcaption></figure></div><p>Prior to the popularization of LRMs, the scale of RL training performed with LLMs was (relatively) small&#8212;<em>post-training was a fraction</em><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-12" href="#footnote-12" target="_self">12</a><em> of total LLM training cost</em>. However, a <a href="https://cameronrwolfe.substack.com/p/llm-scaling-laws">new kind of scaling law</a> emerged with LRMs [8, 9]; see above. Model performance was shown to smoothly improve with respect to:</p><ol><li><p>The amount of compute spent on RL training.</p></li><li><p>The amount of inference-time compute (e.g., by generating multiple outputs or a single output with a longer rationale). </p></li></ol><p>For this reason, the ratio of LLM training cost spent on post-training&#8212;<em>and RL in particular</em>&#8212;has rapidly increased. In [8], we see exactly this, where DeepSeek-R1 undergoes extensive RL training with GRPO to improve its reasoning abilities. </p><p><strong>DeepSeek-R1-Zero</strong> is the first model proposed in [8]. This model is initialized with the weights of DeepSeek-V3 [10] and post-trained with large-scale RL. Unlike a standard post-training procedure, no SFT training is used for training R1-Zero&#8212;<em>the model is trained purely with GRPO</em>. Interestingly, we see in [8] that R1-Zero naturally learns through RL to leverage its reasoning trajectory to solve complex problems. This was the first open research effort to show that reasoning abilities could be developed in an LLM without supervised training.</p><blockquote><p><em>&#8220;DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors.&#8221;</em> - from [1]</p></blockquote><p>This model was created by the same authors of DeepSeekMath [1], so R1-Zero also uses GRPO for RL training. Authors cite familiar reasons for this choice:</p><ul><li><p>Reducing the computational cost of RL training.</p></li><li><p>Memory savings from eliminating the critic model. </p></li></ul><p><strong>Verifiable rewards.</strong> Authors in [8] choose to avoid using neural reward models when training R1-Zero due to issues with reward hacking in larger-scale RL training runs. Put simply, <em>if we train the LLM for long enough, it will eventually figure out an exploit for the reward model</em>. To solve this issue, R1-Zero is trained using RLVR&#8212;<em>using only verifiable reward signals makes the RL training process harder to game</em>. More specifically, two types of rewards are used:</p><ol><li><p><em>Accuracy reward</em>: evaluates whether the model&#8217;s response is correct.</p></li><li><p><em>Format reward</em>: enforces a desired format on the model&#8217;s output.</p></li></ol><p>The accuracy reward is computed using task-specific heuristics. For math problems, the model can provide its answer in a specified format, allowing us to verify via basic string matching. Similarly, coding problems can be verified by executing the code produced by the LLM in a sandbox over predefined test cases. In contrast, the format reward simply rewards the model for formatting its output correctly. As shown below, the output format for R1-Zero just uses special tokens to separate the model&#8217;s reasoning process from its final output or answer.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lZD6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdc9fc1-4032-41ba-9d7a-946f4826f826_1840x454.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lZD6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdc9fc1-4032-41ba-9d7a-946f4826f826_1840x454.png 424w, https://substackcdn.com/image/fetch/$s_!lZD6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdc9fc1-4032-41ba-9d7a-946f4826f826_1840x454.png 848w, https://substackcdn.com/image/fetch/$s_!lZD6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdc9fc1-4032-41ba-9d7a-946f4826f826_1840x454.png 1272w, https://substackcdn.com/image/fetch/$s_!lZD6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdc9fc1-4032-41ba-9d7a-946f4826f826_1840x454.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lZD6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdc9fc1-4032-41ba-9d7a-946f4826f826_1840x454.png" width="678" height="167.1717032967033" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9bdc9fc1-4032-41ba-9d7a-946f4826f826_1840x454.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:359,&quot;width&quot;:1456,&quot;resizeWidth&quot;:678,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lZD6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdc9fc1-4032-41ba-9d7a-946f4826f826_1840x454.png 424w, https://substackcdn.com/image/fetch/$s_!lZD6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdc9fc1-4032-41ba-9d7a-946f4826f826_1840x454.png 848w, https://substackcdn.com/image/fetch/$s_!lZD6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdc9fc1-4032-41ba-9d7a-946f4826f826_1840x454.png 1272w, https://substackcdn.com/image/fetch/$s_!lZD6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdc9fc1-4032-41ba-9d7a-946f4826f826_1840x454.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(from [8])</figcaption></figure></div><p><strong>Matching o1.</strong> Despite using no SFT, R1-Zero shows clear progress in its reasoning capabilities. The model&#8217;s performance on AIME 2024 is plotted below as RL training progresses. Here, we see that performance improves smoothly with the amount of RL training, eventually reaching parity with o1-preview. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8rFM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe19787e1-df29-413b-8ab3-7ed137eca9d9_1844x1028.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8rFM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe19787e1-df29-413b-8ab3-7ed137eca9d9_1844x1028.png 424w, https://substackcdn.com/image/fetch/$s_!8rFM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe19787e1-df29-413b-8ab3-7ed137eca9d9_1844x1028.png 848w, https://substackcdn.com/image/fetch/$s_!8rFM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe19787e1-df29-413b-8ab3-7ed137eca9d9_1844x1028.png 1272w, https://substackcdn.com/image/fetch/$s_!8rFM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe19787e1-df29-413b-8ab3-7ed137eca9d9_1844x1028.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8rFM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe19787e1-df29-413b-8ab3-7ed137eca9d9_1844x1028.png" width="1456" height="812" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e19787e1-df29-413b-8ab3-7ed137eca9d9_1844x1028.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:812,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:770207,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!8rFM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe19787e1-df29-413b-8ab3-7ed137eca9d9_1844x1028.png 424w, https://substackcdn.com/image/fetch/$s_!8rFM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe19787e1-df29-413b-8ab3-7ed137eca9d9_1844x1028.png 848w, https://substackcdn.com/image/fetch/$s_!8rFM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe19787e1-df29-413b-8ab3-7ed137eca9d9_1844x1028.png 1272w, https://substackcdn.com/image/fetch/$s_!8rFM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe19787e1-df29-413b-8ab3-7ed137eca9d9_1844x1028.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [8])</figcaption></figure></div><p>A performance comparison between R1-Zero and o1 models from OpenAI is provided below. R1-Zero matches or exceeds the performance of o1-mini in most cases and performs comparably to o1-preview on several tasks. However, R1-Zero is clearly outperformed by o1 models on coding tasks. As we will see, however, this coding issue was fixed in future iterations of the model.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5Xef!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba93d001-c99e-4b80-a371-b97d92ea1adc_2008x506.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5Xef!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba93d001-c99e-4b80-a371-b97d92ea1adc_2008x506.png 424w, https://substackcdn.com/image/fetch/$s_!5Xef!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba93d001-c99e-4b80-a371-b97d92ea1adc_2008x506.png 848w, https://substackcdn.com/image/fetch/$s_!5Xef!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba93d001-c99e-4b80-a371-b97d92ea1adc_2008x506.png 1272w, https://substackcdn.com/image/fetch/$s_!5Xef!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba93d001-c99e-4b80-a371-b97d92ea1adc_2008x506.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5Xef!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba93d001-c99e-4b80-a371-b97d92ea1adc_2008x506.png" width="1456" height="367" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ba93d001-c99e-4b80-a371-b97d92ea1adc_2008x506.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:367,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:855771,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!5Xef!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba93d001-c99e-4b80-a371-b97d92ea1adc_2008x506.png 424w, https://substackcdn.com/image/fetch/$s_!5Xef!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba93d001-c99e-4b80-a371-b97d92ea1adc_2008x506.png 848w, https://substackcdn.com/image/fetch/$s_!5Xef!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba93d001-c99e-4b80-a371-b97d92ea1adc_2008x506.png 1272w, https://substackcdn.com/image/fetch/$s_!5Xef!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba93d001-c99e-4b80-a371-b97d92ea1adc_2008x506.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>The beauty of RL.</strong> We might begin to wonder how R1-Zero develops such impressive reasoning capabilities during RL training. Luckily, the model&#8217;s learning process is observable&#8212;<em>we can just monitor the reasoning traces produced by the model over time</em>. By doing this, we see (as shown below) that R1-Zero learns to generate progressively longer chains of thought to improve its reasoning process throughout training. In other words, <em>the model naturally learns that using more inference-time compute is useful for solving difficult reasoning problems</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!COPD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!COPD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 424w, https://substackcdn.com/image/fetch/$s_!COPD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 848w, https://substackcdn.com/image/fetch/$s_!COPD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 1272w, https://substackcdn.com/image/fetch/$s_!COPD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!COPD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png" width="1456" height="812" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:812,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1809109,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!COPD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 424w, https://substackcdn.com/image/fetch/$s_!COPD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 848w, https://substackcdn.com/image/fetch/$s_!COPD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 1272w, https://substackcdn.com/image/fetch/$s_!COPD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [8])</figcaption></figure></div><p>Additionally, R1-Zero learns to do more than just generate a long chain of thought.  Authors in [8] also observe several meaningful behaviors that emerge naturally from RL training. For example, the model develops an ability to reflect upon its own solutions by revisiting and evaluating prior components of its reasoning process. Similarly, the model begins to explicitly test out and explore alternative solutions or approaches while trying to solve a problem.</p><blockquote><p><em>&#8220;The self-evolution of DeepSeek-R1-Zero is a fascinating demonstration of how RL can drive a model to improve its reasoning capabilities autonomously.&#8221;</em> - from [8]</p></blockquote><p>Notably, this behavior is not explicitly programmed into the model. Rather, RL allows the model to explore different strategies for arriving at a correct solution. To steer the training process, we reward the model for producing correct answers with proper formatting. From these rewards alone, R1-Zero uses an RL-based &#8220;self-evolution&#8221; process to naturally learn how to solve reasoning problems. <em>We simply create the correct incentives that facilitate the model&#8217;s learning process</em>. </p><p><strong>DeepSeek-R1.</strong> Despite the impressive reasoning abilities of DeepSeek-R1-Zero, the fact that the model is trained purely with RL&#8212;<em>and thus forgoes common best practices for alignment and post-training</em>&#8212;causes it to have some bugs. For example, its readability is poor (e.g., no markdown formatting to make its answers easier to read or parse), and it incorrectly mixes languages together. To solve these issues, authors in [8] train the DeepSeek-R1 model, which uses a multi-stage training process to find a balance between standard LLM capabilities and reasoning.</p><blockquote><p><em>&#8220;To prevent the early unstable cold start phase of RL training from the base model, for DeepSeek-R1 we construct and collect a small amount of long CoT data to fine-tune the model as the initial RL actor.&#8221;</em> - from [1]</p></blockquote><p><strong>Phase One: SFT Cold Start.</strong> Prior to RL training, R1 is trained via SFT over a small dataset of long CoT examples, which is referred to as &#8220;cold start&#8221; data. This data is collected using a few different approaches:</p><ol><li><p>Prompt a model (e.g., DeepSeek-V3) to produce long CoT data, either with few-shot examples or by instructing the model to generate detailed answers with accompanied reflection and verification.</p></li><li><p>Use the R1-Zero model to generate a large number of long CoT outputs, then ask human annotators to post-process and select the model&#8217;s best outputs.</p></li></ol><p>Authors in [1] combine these approaches to collect &#8220;thousands of cold-start data&#8221; on which DeepSeek-V3 is finetuned directly via SFT. Because we are using long CoT data, <em>this is a reasoning-oriented finetuning process</em>. From this cold start data, the model learns a viable (initial) template for solving reasoning problems. The reasoning-oriented SFT data introduces a human prior into training&#8212;<em>we have control over the style and pattern of data used in this phase</em>. For example, authors in [1] structure the data to include summaries of each long CoT, which teaches the model to summarize its reasoning process prior to its final answer. We are simply setting a stronger seed from which to start the RL self-evolution process<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-13" href="#footnote-13" target="_self">13</a>. </p><p><strong>Stage Two: Reasoning-Oriented RL.</strong> After SFT, we repeat the large-scale RL training process with GRPO (i.e., the same RL training setup used for R1-Zero) to enhance R1&#8217;s reasoning capabilities. The only change made for R1 is the addition of a language consistency reward&#8212;<em>calculated as the portion of the model&#8217;s output written in the desired target language</em>&#8212;into RLVR. This language consistency reward is shown in [1] to slightly deteriorate the model&#8217;s reasoning capabilities. However, language consistency helps to avoid the language mixing observed in R1-Zero, which makes the model&#8217;s output more fluent and readable.</p><p><strong>Stage Three: Rejection sampling.</strong> After the convergence of reasoning-oriented RL, we use the resulting model to collect a large and diverse SFT dataset. Unlike the initial cold start SFT phase, however, we collect both reasoning-focused and general data, allowing the model to learn from a broader set of domains. The reasoning data for this stage is collected by:</p><ol><li><p>Curating a diverse set of reasoning-based prompts.</p></li><li><p>Generating candidate trajectories using the model from after stage two.</p></li><li><p>Performing rejection sampling (i.e., filtering and selecting the top trajectories based on quality and correctness).</p></li></ol><p>Interestingly, the SFT dataset from this stage includes a substantial ratio of non-reasoning data (e.g., writing or translation examples) that is sourced from the post-training dataset for DeepSeek-V3. To match the style of data used for training R1, this data is augmented by adding a CoT&#8212;<em>generated by another LLM</em>&#8212;to explain the outputs of complex prompts. Simpler prompts are left with no rationale.</p><blockquote><p><em>&#8220;We reuse portions of the SFT dataset of DeepSeek-V3. For certain non-reasoning tasks, we call DeepSeek-V3 to generate a potential chain-of-thought before answering the question by prompting.&#8221;</em> - from [1]</p></blockquote><p>Unlike reasoning-oriented data, we cannot use rule-based verification for general-purpose data. Instead, authors in [8] use DeepSeek-V3 as a <a href="https://arxiv.org/abs/2410.12832">generative reward model</a> or <a href="https://arxiv.org/abs/2408.15240">verifier</a> for this data. After data verification and heuristic filtering (e.g., removing language mixing or long paragraphs), we have a set of 600,000 reasoning examples and 200,000 general-purpose examples, yielding a dataset of 800,000 examples over which we further finetune R1 using SFT. </p><p><strong>Stage Four: RLVR &amp; RLHF.</strong> The final training stage of R1 aligns the model with human preferences while continuing to hone its reasoning abilities. Similarly to the prior stage, we train the model over a combination of reasoning-based data and general-purpose data reused from the training pipeline of DeepSeek-V3. This stage uses RL with two styles of rewards:</p><ul><li><p>Rules-based rewards (same as R1-Zero) for reasoning-based problems.</p></li><li><p>Neural reward models&#8212;<em>trained over human preference pairs, just as in standard RLHF</em>&#8212;for general-purpose data.</p></li></ul><p>DeepSeek-R1 is aligned to be more helpful and harmless&#8212;<em>two <a href="https://arxiv.org/abs/2204.05862">standard alignment criteria</a> for LLMs</em>&#8212;on general data. Each criterion is modeled using a separate <a href="https://cameronrwolfe.substack.com/p/reward-models">reward model</a>. For helpfulness, only the final answer (i.e., excluding the long CoT) from the model is passed into the reward model. On the other hand, harmlessness is predicted by passing the entire reasoning trajectory to the reward model. This combination of verifiable and preference-based (neural) rewards allows R1 to be aligned to human preferences while maintaining strong reasoning abilities.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0Wcf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d42ce87-35e7-4af2-8a45-cf348df75132_1918x1094.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0Wcf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d42ce87-35e7-4af2-8a45-cf348df75132_1918x1094.png 424w, https://substackcdn.com/image/fetch/$s_!0Wcf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d42ce87-35e7-4af2-8a45-cf348df75132_1918x1094.png 848w, https://substackcdn.com/image/fetch/$s_!0Wcf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d42ce87-35e7-4af2-8a45-cf348df75132_1918x1094.png 1272w, https://substackcdn.com/image/fetch/$s_!0Wcf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d42ce87-35e7-4af2-8a45-cf348df75132_1918x1094.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0Wcf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d42ce87-35e7-4af2-8a45-cf348df75132_1918x1094.png" width="724" height="412.7197802197802" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5d42ce87-35e7-4af2-8a45-cf348df75132_1918x1094.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:830,&quot;width&quot;:1456,&quot;resizeWidth&quot;:724,&quot;bytes&quot;:573212,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!0Wcf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d42ce87-35e7-4af2-8a45-cf348df75132_1918x1094.png 424w, https://substackcdn.com/image/fetch/$s_!0Wcf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d42ce87-35e7-4af2-8a45-cf348df75132_1918x1094.png 848w, https://substackcdn.com/image/fetch/$s_!0Wcf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d42ce87-35e7-4af2-8a45-cf348df75132_1918x1094.png 1272w, https://substackcdn.com/image/fetch/$s_!0Wcf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d42ce87-35e7-4af2-8a45-cf348df75132_1918x1094.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p><strong>R1 performance.</strong> As shown above, R1 matches or surpasses the performance of OpenAI&#8217;s o1 model on most reasoning tasks. Unlike R1-Zero, R1 also has strong coding abilities and can handle general-purpose tasks due to its hybrid training pipeline. In general, R1 is a capable model that can handle both traditional and reasoning-oriented tasks. However, we should note that differences exist between LRMs and LLMs&#8212;<em>reasoning models are not clearly better in all areas</em>. For example, R1 performs poorly on instruction following benchmarks (e.g., <a href="https://arxiv.org/abs/2311.07911">IF-Eval</a>) compared to standard LLMs. However, this trend is likely to be reversed in the future as the balance between standard LLMs and reasoning continues to be refined.</p><p><strong>Distilled variants of R1.</strong> Given that R1 is a very large model (i.e., 671B parameter MoE), the main R1 model is also <a href="https://cameronrwolfe.substack.com/i/153722335/distilled-models">distilled</a> to create a series of smaller, dense models. A very simple pipeline is adopted for distillation. Beginning with two base models (i.e., <a href="https://arxiv.org/abs/2412.15115">Qwen-2.5</a> and <a href="https://arxiv.org/abs/2407.21783">Llama-3</a>), we simply:</p><ul><li><p>Generate ~800,000 supervised training examples by sampling completions from the full DeepSeek-R1 model.</p></li><li><p>Finetune the base models using SFT over this data.</p></li></ul><p>This is the simplest form of distillation that can be used, which just trains the student on completions from the teacher using SFT. Such an approach is referred to as off-policy distillation [11]. This off-policy distillation procedure works well for the R1 model. In fact, distilling from R1 actually outperforms direct training of smaller models with RL; see below. However, we can usually achieve better performance via logit distillation (i.e., training the student model on the full log probabilities outputted by the teacher for each token) or <a href="https://thinkingmachines.ai/blog/on-policy-distillation/">on-policy distillation</a>. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IhEm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4ed3b-81bd-44a2-b8b7-5c0ec792f3cd_2464x406.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IhEm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4ed3b-81bd-44a2-b8b7-5c0ec792f3cd_2464x406.png 424w, https://substackcdn.com/image/fetch/$s_!IhEm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4ed3b-81bd-44a2-b8b7-5c0ec792f3cd_2464x406.png 848w, https://substackcdn.com/image/fetch/$s_!IhEm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4ed3b-81bd-44a2-b8b7-5c0ec792f3cd_2464x406.png 1272w, https://substackcdn.com/image/fetch/$s_!IhEm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4ed3b-81bd-44a2-b8b7-5c0ec792f3cd_2464x406.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IhEm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4ed3b-81bd-44a2-b8b7-5c0ec792f3cd_2464x406.png" width="1456" height="240" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bbc4ed3b-81bd-44a2-b8b7-5c0ec792f3cd_2464x406.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:240,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:248243,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!IhEm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4ed3b-81bd-44a2-b8b7-5c0ec792f3cd_2464x406.png 424w, https://substackcdn.com/image/fetch/$s_!IhEm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4ed3b-81bd-44a2-b8b7-5c0ec792f3cd_2464x406.png 848w, https://substackcdn.com/image/fetch/$s_!IhEm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4ed3b-81bd-44a2-b8b7-5c0ec792f3cd_2464x406.png 1272w, https://substackcdn.com/image/fetch/$s_!IhEm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4ed3b-81bd-44a2-b8b7-5c0ec792f3cd_2464x406.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(from [8])</figcaption></figure></div><h2>Conclusion</h2><p>The advent of large reasoning models has completely transformed LLM research, especially the domain of reinforcement learning. For years, research on RL has centered around complex algorithms like PPO that require substantial domain knowledge and extensive compute resources. As a result, much of the research in this area has been confined to a handful of top research labs. This trend has recently changed, however, as open reasoning models and simpler RL algorithms like GRPO have become increasingly popular. Today, there are more public resources than ever before for doing useful research at the intersection of RL and LLMs. Hopefully, the details outlined in this post will contribute to further democratizing research on this important and rapidly evolving topic.</p><h4>New to the newsletter?</h4><p>Hi! I&#8217;m <a href="https://cameronrwolfe.me/">Cameron R. Wolfe</a>, Deep Learning Ph.D. and Senior Research Scientist at <a href="https://research.netflix.com/research-area/nlp-and-conversations">Netflix</a>. This is the Deep (Learning) Focus newsletter, where I help readers better understand important topics in AI research. The newsletter will always be free and open to read. If you like the newsletter, please subscribe, consider a paid subscription, share it, or follow me on <a href="https://twitter.com/cwolferesearch">X</a> and <a href="https://www.linkedin.com/in/cameron-r-wolfe-ph-d-04744a238/">LinkedIn</a>!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://cameronrwolfe.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://cameronrwolfe.substack.com/subscribe?"><span>Subscribe now</span></a></p><h4>Bibliography</h4><p>[1] Shao, Zhihong, et al. &#8220;Deepseekmath: Pushing the limits of mathematical reasoning in open language models.&#8221; <em>arXiv preprint arXiv:2402.03300</em> (2024).</p><p>[2] Paster, Keiran, et al. &#8220;Openwebmath: An open dataset of high-quality mathematical web text.&#8221; <em>arXiv preprint arXiv:2310.06786</em> (2023).</p><p>[3] Wei, Jason, et al. &#8220;Chain-of-thought prompting elicits reasoning in large language models.&#8221; <em>Advances in neural information processing systems</em> 35 (2022): 24824-24837.</p><p>[4] Chen, Wenhu, et al. &#8220;Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks.&#8221; <em>arXiv preprint arXiv:2211.12588</em> (2022).</p><p>[5] Gou, Zhibin, et al. &#8220;Tora: A tool-integrated reasoning agent for mathematical problem solving.&#8221; <em>arXiv preprint arXiv:2309.17452</em> (2023).</p><p>[6] Lambert, Nathan. &#8220;Reinforcement Learning from Human Feedback.&#8221; Online (2025). </p><p>https://rlhfbook.com</p><p>[7] Schulman, John. &#8220;Approximating KL Divergence.&#8221; Online (2020). <a href="http://joschu.net/blog/kl-approx.html">http://joschu.net/blog/kl-approx.html</a>.</p><p>[8] Guo, Daya, et al. &#8220;Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.&#8221; <em>arXiv preprint arXiv:2501.12948</em> (2025).</p><p>[9] OpenAI et al. &#8220;Learning to Reason with LLMs.&#8221; <em>https://openai.com/index/learning-to-reason-with-llms/</em> (2024).</p><p>[10] Liu, Aixin, et al. &#8220;Deepseek-v3 technical report.&#8221; <em>arXiv preprint arXiv:2412.19437</em> (2024).</p><p>[11] Lu, Kevin et al. &#8220;On-Policy Distillation.&#8221; <a href="https://thinkingmachines.ai/blog/on-policy-distillation/">https://thinkingmachines.ai/blog/on-policy-distillation/</a> (2025).</p><p>[12] Schulman, John, et al. &#8220;Proximal policy optimization algorithms.&#8221; <em>arXiv preprint arXiv:1707.06347</em> (2017).</p><p>[13] Schulman, John, et al. &#8220;High-dimensional continuous control using generalized advantage estimation.&#8221; <em>arXiv preprint arXiv:1506.02438</em> (2015).</p><p>[14] Team, Kimi, et al. &#8220;Kimi k2: Open agentic intelligence.&#8221; <em>arXiv preprint arXiv:2507.20534</em> (2025).</p><p>[15] Khatri, Devvrit, et al. &#8220;The art of scaling reinforcement learning compute for llms.&#8221; <em>arXiv preprint arXiv:2510.13786</em> (2025).</p><p>[16] Ouyang, Long, et al. &#8220;Training language models to follow instructions with human feedback.&#8221; <em>Advances in neural information processing systems</em> 35 (2022): 27730-27744.</p><p>[17] Stiennon, Nisan, et al. &#8220;Learning to summarize with human feedback.&#8221; <em>Advances in neural information processing systems</em> 33 (2020): 3008-3021.</p><p>[18] Bai, Yuntao, et al. &#8220;Training a helpful and harmless assistant with reinforcement learning from human feedback.&#8221; arXiv preprint arXiv:2204.05862 (2022).</p><p>[19] Lambert, Nathan, et al. &#8220;Tulu 3: Pushing frontiers in open language model post-training.&#8221; <em>arXiv preprint arXiv:2411.15124</em> (2024).</p><p>[20] Bespoke Labs et al. &#8220;Scaling up Open Reasoning with OpenThinker-32B.&#8221; <a href="https://www.bespokelabs.ai/blog/scaling-up-open-reasoning-with-openthinker-32b">https://www.bespokelabs.ai/blog/scaling-up-open-reasoning-with-openthinker-32b</a> (2025).</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>In fact, some researchers argue that the distinction between an LLM and an LRM is an unnecessary gray area&#8212;<em>they are still the same types of models</em>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Frontier labs <a href="https://arxiv.org/abs/2507.11473">have argued</a> that the LRM&#8217;s chain of thought is a useful artifact for monitoring the model for harmful behavior. To maintain our ability to monitor, the reasoning process is usually kept &#8220;unsafe&#8221;&#8212;<em>we apply no safety post-training to it to ensure that the model does not learn to omit info from its reasoning process for safety purposes</em>. As a result, the reasoning process is potentially unsafe and will be kept that way for monitoring benefits) and cannot be directly exposed to the end user. Alternatively, top labs could be simply omitting the reasoning trajectory to make distilling from their best reasoning models more difficult.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>This naming stems from the fact that the surrogate objective is different from the RL training objective. In RL, we aim to maximize cumulative reward. However, directly maximizing this objective can lead to instability. The surrogate is a more stable proxy that can be optimized in place of the true objective.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>The critic is very similar to a <a href="https://cameronrwolfe.substack.com/p/reward-models">reward model</a>&#8212;<em>both models predict rewards</em>. However, the critic predicts reward per-token, while a reward model usually predicts outcome rewards for an entire completion. Additionally, reward models are usually fixed during RL training while the critic is trained alongside the policy itself.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>The bias comes from relying on an approximate value model for this estimate and only using a small amount of exact reward information <code>r_t</code>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>A commonly used setting for <code>&#955;</code> is ~0.95.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>The stop gradient is used here because, when using the GRPO loss function, we are computing the gradient of the loss with respect to our policy. Usually, the policy in the denominator of this expression is the old policy. We consider the output of this policy to be a constant when computing the gradient. When performing only a single policy update per batch of data, the old policy is equal to our current policy, but we still consider this denominator term a constant when computing the gradient. This is accomplished via the stop gradient operation.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>For example, hosting Qwen-3-32B in half precision with its full context length (131K tokens) would increase the memory footprint from ~70GB to ~400GB.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-9" href="#footnote-anchor-9" class="footnote-number" contenteditable="false" target="_self">9</a><div class="footnote-content"><p>This exact number will vary drastically depending on our exact training settings. For example, this calculation assumes that we are using the <a href="https://arxiv.org/abs/1711.05101">AdamW</a> optimizer, which maintains three separate optimizer states for every model parameter at full precision (<a href="https://kaitchup.substack.com/p/fine-tuning-llms-with-32-bit-8-bit">default setting for AdamW parameters and optimizer states</a>). We can reduce memory by using an <a href="https://huggingface.co/docs/bitsandbytes/main/en/optimizers">8-bit AdamW optimizer</a>. Additionally, we can adopt various sharding (e.g., <a href="https://arxiv.org/abs/1910.02054">ZeRO</a>, <a href="https://arxiv.org/abs/2304.11277">FSDP</a>, and more) or <a href="https://docs.pytorch.org/docs/stable/distributed.pipelining.html">pipelining</a> strategies if we have multiple GPUs or nodes available for training to reduce per-GPU memory consumption significantly.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-10" href="#footnote-anchor-10" class="footnote-number" contenteditable="false" target="_self">10</a><div class="footnote-content"><p>The implementation also draws upon code from a prior <a href="https://cameronrwolfe.substack.com/p/ppo-llm">PPO tutorial</a>, as well as the implementation of <a href="https://github.com/huggingface/trl/blob/main/trl/trainer/grpo_trainer.py">GRPO in TRL</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-11" href="#footnote-anchor-11" class="footnote-number" contenteditable="false" target="_self">11</a><div class="footnote-content"><p>Some open reasoning models like <a href="https://qwen.ai/blog?id=468238499cc16b40068fbf0cbf9456a66e7624e8">QwQ</a> preceded the release of DeepSeek-R1.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-12" href="#footnote-anchor-12" class="footnote-number" contenteditable="false" target="_self">12</a><div class="footnote-content"><p>The cost of training an LLM is dominated by pretraining. However, the cost of post-training can still be expensive, especially when human data annotation is considered; see <a href="https://www.interconnects.ai/p/the-state-of-post-training-2025">here</a> for more details. Therefore, the ratio of cost spent on post-training varies, but it would generally be &lt;10% of the total LLM training cost.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-13" href="#footnote-anchor-13" class="footnote-number" contenteditable="false" target="_self">13</a><div class="footnote-content"><p>See <a href="https://cameronrwolfe.substack.com/i/153722335/deepseek-r">here</a> for more info on the role of SFT in training reasoning models.</p></div></div>]]></content:encoded></item><item><title><![CDATA[PPO for LLMs: A Guide for Normal People]]></title><description><![CDATA[Understanding the complex RL algorithm that gave us modern LLMs&#8230;]]></description><link>https://cameronrwolfe.substack.com/p/ppo-llm</link><guid isPermaLink="false">https://cameronrwolfe.substack.com/p/ppo-llm</guid><dc:creator><![CDATA[Cameron R. Wolfe, Ph.D.]]></dc:creator><pubDate>Mon, 27 Oct 2025 09:33:23 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/61f107c1-95cb-4438-84b9-8d87c9cdc04f_2502x1408.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PJsw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8db8bc5-f39d-4d1a-be16-26e0c0eb01a7_2502x1398.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PJsw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8db8bc5-f39d-4d1a-be16-26e0c0eb01a7_2502x1398.png 424w, https://substackcdn.com/image/fetch/$s_!PJsw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8db8bc5-f39d-4d1a-be16-26e0c0eb01a7_2502x1398.png 848w, https://substackcdn.com/image/fetch/$s_!PJsw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8db8bc5-f39d-4d1a-be16-26e0c0eb01a7_2502x1398.png 1272w, https://substackcdn.com/image/fetch/$s_!PJsw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8db8bc5-f39d-4d1a-be16-26e0c0eb01a7_2502x1398.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PJsw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8db8bc5-f39d-4d1a-be16-26e0c0eb01a7_2502x1398.png" width="1456" height="814" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f8db8bc5-f39d-4d1a-be16-26e0c0eb01a7_2502x1398.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:814,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1257053,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/175107358?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8db8bc5-f39d-4d1a-be16-26e0c0eb01a7_2502x1398.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PJsw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8db8bc5-f39d-4d1a-be16-26e0c0eb01a7_2502x1398.png 424w, https://substackcdn.com/image/fetch/$s_!PJsw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8db8bc5-f39d-4d1a-be16-26e0c0eb01a7_2502x1398.png 848w, https://substackcdn.com/image/fetch/$s_!PJsw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8db8bc5-f39d-4d1a-be16-26e0c0eb01a7_2502x1398.png 1272w, https://substackcdn.com/image/fetch/$s_!PJsw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8db8bc5-f39d-4d1a-be16-26e0c0eb01a7_2502x1398.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [4, 5, 8])</figcaption></figure></div><p>Over the last several years, reinforcement learning (RL) has been one of the most impactful areas of research for large language models (LLMs). Early research used RL to align LLMs to human preferences, and this initial work on applying RL to LLMs relied almost exclusively on Proximal Policy Optimization (PPO) [1]. This choice led PPO to become the default RL algorithm in LLM post-training for years&#8212;<em>this is a long reign given the fast pace of LLM research</em>! Only in recent work on LLM reasoning have researchers begun to use alternative algorithms like GRPO.</p><p>Despite its importance, PPO is poorly understood outside of top research labs. This lack of understanding is for good reason. <em>Not only is PPO a complicated algorithm packed with nuanced implementation details</em>, but its high compute and memory overhead make experimentation difficult without extensive compute resources. Successfully leveraging PPO requires both a deep understanding of the algorithm and substantial domain knowledge or practical experience.</p><p>This overview will begin with basic concepts in RL and develop a detailed understanding of PPO step-by-step. Building on this foundation, we will explain key practical considerations for using PPO, including pseudocode for PPO and its various components. Finally, we will tie all of this knowledge together by examining several seminal works that popularized PPO in the LLM domain.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://cameronrwolfe.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Join 50,000 others who use Deep (Learning) Focus to stay up-to-date with AI research.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Reinforcement Learning (RL) Preliminaries</h2><p>Before learning more about PPO, we need to learn about RL in general. This section will cover basic problem setup and terminology for RL. Additionally, we will derive a simple policy gradient expression, which forms a basis for PPO.</p><h4><strong>Problem Setup and Terminology</strong></h4><p>When running RL training, we have an <strong>agent</strong> that takes <strong>actions</strong> within some <strong>environment</strong>; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lQCe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7117e42-c6ab-43c4-8878-5a88cb99c9ae_2203x870.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lQCe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7117e42-c6ab-43c4-8878-5a88cb99c9ae_2203x870.png 424w, https://substackcdn.com/image/fetch/$s_!lQCe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7117e42-c6ab-43c4-8878-5a88cb99c9ae_2203x870.png 848w, https://substackcdn.com/image/fetch/$s_!lQCe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7117e42-c6ab-43c4-8878-5a88cb99c9ae_2203x870.png 1272w, https://substackcdn.com/image/fetch/$s_!lQCe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7117e42-c6ab-43c4-8878-5a88cb99c9ae_2203x870.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lQCe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7117e42-c6ab-43c4-8878-5a88cb99c9ae_2203x870.png" width="1456" height="575" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d7117e42-c6ab-43c4-8878-5a88cb99c9ae_2203x870.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:575,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:139371,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7117e42-c6ab-43c4-8878-5a88cb99c9ae_2203x870.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!lQCe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7117e42-c6ab-43c4-8878-5a88cb99c9ae_2203x870.png 424w, https://substackcdn.com/image/fetch/$s_!lQCe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7117e42-c6ab-43c4-8878-5a88cb99c9ae_2203x870.png 848w, https://substackcdn.com/image/fetch/$s_!lQCe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7117e42-c6ab-43c4-8878-5a88cb99c9ae_2203x870.png 1272w, https://substackcdn.com/image/fetch/$s_!lQCe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7117e42-c6ab-43c4-8878-5a88cb99c9ae_2203x870.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Basic problem setup for RL</figcaption></figure></div><p>These actions are predicted by a <strong>policy</strong>&#8212;<em>we can think of the policy as the agent&#8217;s brain</em>&#8212;that is usually parameterized. For example, the policy is the LLM itself in the context of training LLMs. We can model the probability of a given action under our policy as <code>&#960;_&#952;(a_t | s_t)</code>. When the policy outputs an action, the <strong>state</strong> of the environment will be updated according to a <strong>transition function</strong>, which is part of the environment. We will denote our transition function as <code>P(s_t+1 | a_t, s_t)</code>. However, transition functions are less relevant for LLMs because they are typically a pass-through; i.e., we assume <code>s_t = {x, a_1, a_2, &#8230;, a_t}</code>, where <code>x</code> is the prompt.</p><p>Finally, each state visited by the agent receives a <strong>reward</strong> from the environment that may be positive, negative, or zero (i.e., no reward). As shown in the prior figure, our agent acts iteratively and each action (<code>a_t</code>), reward (<code>r_t</code>), and state (<code>s_t</code>) are associated with a time step <code>t</code>. Combining these time steps together yields a <strong>trajectory</strong>; see below. Here, we assume that the agent takes a total of <code>T</code> steps in the environment for this particular trajectory.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cjh1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbee11fdb-dee8-4d4e-8819-b97642a17129_2008x338.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cjh1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbee11fdb-dee8-4d4e-8819-b97642a17129_2008x338.png 424w, https://substackcdn.com/image/fetch/$s_!cjh1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbee11fdb-dee8-4d4e-8819-b97642a17129_2008x338.png 848w, https://substackcdn.com/image/fetch/$s_!cjh1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbee11fdb-dee8-4d4e-8819-b97642a17129_2008x338.png 1272w, https://substackcdn.com/image/fetch/$s_!cjh1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbee11fdb-dee8-4d4e-8819-b97642a17129_2008x338.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cjh1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbee11fdb-dee8-4d4e-8819-b97642a17129_2008x338.png" width="1456" height="245" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bee11fdb-dee8-4d4e-8819-b97642a17129_2008x338.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:245,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:108505,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbee11fdb-dee8-4d4e-8819-b97642a17129_2008x338.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!cjh1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbee11fdb-dee8-4d4e-8819-b97642a17129_2008x338.png 424w, https://substackcdn.com/image/fetch/$s_!cjh1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbee11fdb-dee8-4d4e-8819-b97642a17129_2008x338.png 848w, https://substackcdn.com/image/fetch/$s_!cjh1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbee11fdb-dee8-4d4e-8819-b97642a17129_2008x338.png 1272w, https://substackcdn.com/image/fetch/$s_!cjh1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbee11fdb-dee8-4d4e-8819-b97642a17129_2008x338.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Using the chain rule of probabilities, we can also compute the probability of a full trajectory by combining the probabilities of:</p><ul><li><p>Each action <code>a_t</code> given by our policy <code>&#960;_&#952;(a_t | s_t)</code>.</p></li><li><p>Each state <code>s_t+1</code> given by the transition function <code>P(s_t+1 | a_t, s_t)</code>.</p></li></ul><p>The full expression for the probability of a trajectory is provided below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YCeT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52061751-cc8a-4f3e-a889-5d4e542b21bf_2092x770.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YCeT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52061751-cc8a-4f3e-a889-5d4e542b21bf_2092x770.png 424w, https://substackcdn.com/image/fetch/$s_!YCeT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52061751-cc8a-4f3e-a889-5d4e542b21bf_2092x770.png 848w, https://substackcdn.com/image/fetch/$s_!YCeT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52061751-cc8a-4f3e-a889-5d4e542b21bf_2092x770.png 1272w, https://substackcdn.com/image/fetch/$s_!YCeT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52061751-cc8a-4f3e-a889-5d4e542b21bf_2092x770.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YCeT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52061751-cc8a-4f3e-a889-5d4e542b21bf_2092x770.png" width="650" height="239.28571428571428" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/52061751-cc8a-4f3e-a889-5d4e542b21bf_2092x770.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:536,&quot;width&quot;:1456,&quot;resizeWidth&quot;:650,&quot;bytes&quot;:245378,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52061751-cc8a-4f3e-a889-5d4e542b21bf_2092x770.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!YCeT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52061751-cc8a-4f3e-a889-5d4e542b21bf_2092x770.png 424w, https://substackcdn.com/image/fetch/$s_!YCeT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52061751-cc8a-4f3e-a889-5d4e542b21bf_2092x770.png 848w, https://substackcdn.com/image/fetch/$s_!YCeT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52061751-cc8a-4f3e-a889-5d4e542b21bf_2092x770.png 1272w, https://substackcdn.com/image/fetch/$s_!YCeT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52061751-cc8a-4f3e-a889-5d4e542b21bf_2092x770.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Computing the probability of a trajectory</figcaption></figure></div><p><strong>RL objective.</strong> When training a model with RL, our goal is to maximize the cumulative reward over the entire trajectory (i.e., the sum of <code>r_t</code>). However, there are a few variations of this objective that commonly appear. Specifically, the reward that we maximize can either be discounted or non-discounted<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>; see below. By incorporating a discount factor <code>&#947;</code>, we reward our policy for achieving rewards sooner rather than later. In other words, <em>money now is better than money later</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8D_n!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbfd6da8-2406-4197-b9d0-d3a1ec301b39_1496x876.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8D_n!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbfd6da8-2406-4197-b9d0-d3a1ec301b39_1496x876.png 424w, https://substackcdn.com/image/fetch/$s_!8D_n!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbfd6da8-2406-4197-b9d0-d3a1ec301b39_1496x876.png 848w, https://substackcdn.com/image/fetch/$s_!8D_n!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbfd6da8-2406-4197-b9d0-d3a1ec301b39_1496x876.png 1272w, https://substackcdn.com/image/fetch/$s_!8D_n!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbfd6da8-2406-4197-b9d0-d3a1ec301b39_1496x876.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8D_n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbfd6da8-2406-4197-b9d0-d3a1ec301b39_1496x876.png" width="496" height="290.5824175824176" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bbfd6da8-2406-4197-b9d0-d3a1ec301b39_1496x876.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:853,&quot;width&quot;:1456,&quot;resizeWidth&quot;:496,&quot;bytes&quot;:158346,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbfd6da8-2406-4197-b9d0-d3a1ec301b39_1496x876.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!8D_n!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbfd6da8-2406-4197-b9d0-d3a1ec301b39_1496x876.png 424w, https://substackcdn.com/image/fetch/$s_!8D_n!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbfd6da8-2406-4197-b9d0-d3a1ec301b39_1496x876.png 848w, https://substackcdn.com/image/fetch/$s_!8D_n!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbfd6da8-2406-4197-b9d0-d3a1ec301b39_1496x876.png 1272w, https://substackcdn.com/image/fetch/$s_!8D_n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbfd6da8-2406-4197-b9d0-d3a1ec301b39_1496x876.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Our objective is usually expressed as an expected cumulative reward, where the <a href="https://en.wikipedia.org/wiki/Expected_value">expectation</a> is taken over the trajectory. Expanding this expectation yields a sum over trajectories weighted by their probabilities. We can formulate this in a continuous or discrete manner; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!45io!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523baab0-10b4-438e-85d7-e7c5c0681209_1692x884.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!45io!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523baab0-10b4-438e-85d7-e7c5c0681209_1692x884.png 424w, https://substackcdn.com/image/fetch/$s_!45io!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523baab0-10b4-438e-85d7-e7c5c0681209_1692x884.png 848w, https://substackcdn.com/image/fetch/$s_!45io!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523baab0-10b4-438e-85d7-e7c5c0681209_1692x884.png 1272w, https://substackcdn.com/image/fetch/$s_!45io!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523baab0-10b4-438e-85d7-e7c5c0681209_1692x884.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!45io!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523baab0-10b4-438e-85d7-e7c5c0681209_1692x884.png" width="522" height="272.83104395604397" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/523baab0-10b4-438e-85d7-e7c5c0681209_1692x884.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:761,&quot;width&quot;:1456,&quot;resizeWidth&quot;:522,&quot;bytes&quot;:235822,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523baab0-10b4-438e-85d7-e7c5c0681209_1692x884.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!45io!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523baab0-10b4-438e-85d7-e7c5c0681209_1692x884.png 424w, https://substackcdn.com/image/fetch/$s_!45io!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523baab0-10b4-438e-85d7-e7c5c0681209_1692x884.png 848w, https://substackcdn.com/image/fetch/$s_!45io!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523baab0-10b4-438e-85d7-e7c5c0681209_1692x884.png 1272w, https://substackcdn.com/image/fetch/$s_!45io!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523baab0-10b4-438e-85d7-e7c5c0681209_1692x884.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>State, value, and advantage functions.</strong> Related to RL objective, we can also define the following set of functions:</p><ul><li><p><em>Value Function</em> <code>V(s)</code>: the expected cumulative reward when you start in state <code>s</code> and act according to your current policy <code>&#960;_&#952;</code>.</p></li><li><p><em>Action-Value Function</em> <code>Q(s, a)</code>: the expected cumulative reward when you start in state <code>s</code>, take action <code>a</code>, then act according to your policy <code>&#960;_&#952;</code>.</p></li><li><p><em>Advantage Function</em> <code>A(s, a)</code>: the difference between the action-value and value function; i.e., <code>A(s, a) = Q(s, a) - V(s)</code>.</p></li></ul><p>Intuitively, the advantage function tells us how useful some action <code>a</code> is by taking the difference between the expected reward after taking action <code>a</code> in state <code>s</code> and the general expected reward from state <code>s</code>. The advantage will be positive if the reward from action <code>a</code> is higher than expected and vice versa. Advantage functions play a huge role in RL research&#8212;<em>they are used to compute the gradient for our policy</em>.</p><blockquote><p><em>&#8220;Sometimes in RL, we don&#8217;t need to describe how good an action is in an absolute sense, but only how much better it is than others on average. That is to say, we want to know the relative advantage of that action. We make this concept precise with the advantage function.<strong>&#8221;</strong></em> - <a href="https://spinningup.openai.com/en/latest/spinningup/rl_intro.html">Spinning up in Deep RL</a></p></blockquote><h4>RL Formulation for LLMs</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RBDE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4b8b6b8-fe96-4b70-87d2-038a3b3511cf_1346x1134.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RBDE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4b8b6b8-fe96-4b70-87d2-038a3b3511cf_1346x1134.png 424w, https://substackcdn.com/image/fetch/$s_!RBDE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4b8b6b8-fe96-4b70-87d2-038a3b3511cf_1346x1134.png 848w, https://substackcdn.com/image/fetch/$s_!RBDE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4b8b6b8-fe96-4b70-87d2-038a3b3511cf_1346x1134.png 1272w, https://substackcdn.com/image/fetch/$s_!RBDE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4b8b6b8-fe96-4b70-87d2-038a3b3511cf_1346x1134.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RBDE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4b8b6b8-fe96-4b70-87d2-038a3b3511cf_1346x1134.png" width="506" height="426.30312035661217" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d4b8b6b8-fe96-4b70-87d2-038a3b3511cf_1346x1134.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1134,&quot;width&quot;:1346,&quot;resizeWidth&quot;:506,&quot;bytes&quot;:117379,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/175107358?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4b8b6b8-fe96-4b70-87d2-038a3b3511cf_1346x1134.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RBDE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4b8b6b8-fe96-4b70-87d2-038a3b3511cf_1346x1134.png 424w, https://substackcdn.com/image/fetch/$s_!RBDE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4b8b6b8-fe96-4b70-87d2-038a3b3511cf_1346x1134.png 848w, https://substackcdn.com/image/fetch/$s_!RBDE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4b8b6b8-fe96-4b70-87d2-038a3b3511cf_1346x1134.png 1272w, https://substackcdn.com/image/fetch/$s_!RBDE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4b8b6b8-fe96-4b70-87d2-038a3b3511cf_1346x1134.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">RL terminology mapping for LLMs</figcaption></figure></div><p>Now that we understand RL basics, we need to map the terminology that we have learned to the setting of LLM training. We can do this as follows (shown above):</p><ul><li><p>Our <strong>policy</strong> is the LLM itself.</p></li><li><p>Our <strong>initial state</strong> is the prompt.</p></li><li><p>The LLM&#8217;s output&#8212;<em>either each token or the entire completion</em>&#8212;is an <strong>action</strong>.</p></li><li><p>Our <strong>state</strong> is the combination of our prompt with the LLM&#8217;s output.</p></li><li><p>The entire completion from the LLM forms a <strong>trajectory</strong>.</p></li><li><p>The <strong>reward</strong> comes from a verifier or reward model (more details to follow).</p></li></ul><p>Notably, there is no transition function in this setup because the transition function is completely deterministic. If we start with a prompt <code>x</code> and our LLM predicts tokens <code>t_1</code> and <code>t_2</code> given this prompt as input, then our updated state simply becomes <code>s_2 = {x, t_1, t_2}</code>. In other words, <em>our state is just the running completion being generated by the LLM for a given prompt </em><code>x</code>.</p><p><strong>MDP formulation.</strong> For LLMs, there are two key ways in which RL can be formulated that differ in how they model actions:</p><ol><li><p><em>Bandit formulation</em>: the entire completion or response from the LLM is modeled as a single action.</p></li><li><p><em>Markov Decision Process (MDP) formulation</em>: each token within the LLM&#8217;s output is modeled as an individual action.</p></li></ol><p>We outlined the details for both of these formulations in a <a href="https://cameronrwolfe.substack.com/i/173306894/markov-decision-process-mdp-versus-bandit-formulation">prior overview</a>. However, PPO relies upon the MDP formulation, so we will primarily focus upon the MDP formulation here. As we should recall, an LLM generates output via <a href="https://cameronrwolfe.substack.com/i/136638774/understanding-next-token-prediction">next token prediction</a>; i.e., by generating each token in the output completion sequentially. This autoregressive process is depicted below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QUg4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b1a8412-5cfb-481f-bd50-473f0a6fd9b5_1992x1037.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QUg4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b1a8412-5cfb-481f-bd50-473f0a6fd9b5_1992x1037.png 424w, https://substackcdn.com/image/fetch/$s_!QUg4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b1a8412-5cfb-481f-bd50-473f0a6fd9b5_1992x1037.png 848w, https://substackcdn.com/image/fetch/$s_!QUg4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b1a8412-5cfb-481f-bd50-473f0a6fd9b5_1992x1037.png 1272w, https://substackcdn.com/image/fetch/$s_!QUg4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b1a8412-5cfb-481f-bd50-473f0a6fd9b5_1992x1037.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QUg4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b1a8412-5cfb-481f-bd50-473f0a6fd9b5_1992x1037.png" width="682" height="355.0521978021978" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5b1a8412-5cfb-481f-bd50-473f0a6fd9b5_1992x1037.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:758,&quot;width&quot;:1456,&quot;resizeWidth&quot;:682,&quot;bytes&quot;:144540,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b1a8412-5cfb-481f-bd50-473f0a6fd9b5_1992x1037.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!QUg4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b1a8412-5cfb-481f-bd50-473f0a6fd9b5_1992x1037.png 424w, https://substackcdn.com/image/fetch/$s_!QUg4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b1a8412-5cfb-481f-bd50-473f0a6fd9b5_1992x1037.png 848w, https://substackcdn.com/image/fetch/$s_!QUg4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b1a8412-5cfb-481f-bd50-473f0a6fd9b5_1992x1037.png 1272w, https://substackcdn.com/image/fetch/$s_!QUg4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b1a8412-5cfb-481f-bd50-473f0a6fd9b5_1992x1037.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Autoregressive next token prediction with an LLM</figcaption></figure></div><p>Next token prediction maps easily to an RL setup&#8212;<em>we can model each token as an action</em>! This setup is called the <a href="https://en.wikipedia.org/wiki/Markov_decision_process">Markov Decision Process (MDP)</a> formulation. An MDP is a probabilistic framework for modeling decision-making that includes states, actions, transition probabilities and rewards&#8212;<em>this is exactly the setup we have discussed so far for RL</em>! The MDP formulation used for RL is shown below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KWz-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52f4f8de-4456-4cbd-935c-a945968b704d_1466x916.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KWz-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52f4f8de-4456-4cbd-935c-a945968b704d_1466x916.png 424w, https://substackcdn.com/image/fetch/$s_!KWz-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52f4f8de-4456-4cbd-935c-a945968b704d_1466x916.png 848w, https://substackcdn.com/image/fetch/$s_!KWz-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52f4f8de-4456-4cbd-935c-a945968b704d_1466x916.png 1272w, https://substackcdn.com/image/fetch/$s_!KWz-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52f4f8de-4456-4cbd-935c-a945968b704d_1466x916.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KWz-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52f4f8de-4456-4cbd-935c-a945968b704d_1466x916.png" width="540" height="337.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/52f4f8de-4456-4cbd-935c-a945968b704d_1466x916.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:910,&quot;width&quot;:1456,&quot;resizeWidth&quot;:540,&quot;bytes&quot;:119785,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52f4f8de-4456-4cbd-935c-a945968b704d_1466x916.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!KWz-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52f4f8de-4456-4cbd-935c-a945968b704d_1466x916.png 424w, https://substackcdn.com/image/fetch/$s_!KWz-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52f4f8de-4456-4cbd-935c-a945968b704d_1466x916.png 848w, https://substackcdn.com/image/fetch/$s_!KWz-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52f4f8de-4456-4cbd-935c-a945968b704d_1466x916.png 1272w, https://substackcdn.com/image/fetch/$s_!KWz-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52f4f8de-4456-4cbd-935c-a945968b704d_1466x916.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>When modeling RL as an MDP for LLMs, our initial state is the prompt and our policy acts by predicting individual tokens. Our LLM forms a (stochastic) policy that predicts a probability distribution over tokens. During generation, actions are taken by selecting a token from this distribution&#8212;<em>each token is its own action</em>. After a token is predicted, it is added to the current state and used by the LLM to predict the next token&#8212;<em>this is just autoregressive next token prediction</em>! Eventually, the LLM predicts a stop token (e.g., <code>&lt;|end_of_text|&gt;</code> or <code>&lt;eos&gt;</code>) to complete the generation process, thus yielding a complete trajectory.</p><h4>Policy Gradient Basics</h4><p>During RL training, we want to maximize our objective&#8212;<em>the cumulative (possibly discounted) reward</em>. To accomplish this, we can just use <a href="https://en.wikipedia.org/wiki/Gradient_descent">gradient ascent</a>; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!slrY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3072897-d905-42be-b385-6186c24ae059_2390x302.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!slrY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3072897-d905-42be-b385-6186c24ae059_2390x302.png 424w, https://substackcdn.com/image/fetch/$s_!slrY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3072897-d905-42be-b385-6186c24ae059_2390x302.png 848w, https://substackcdn.com/image/fetch/$s_!slrY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3072897-d905-42be-b385-6186c24ae059_2390x302.png 1272w, https://substackcdn.com/image/fetch/$s_!slrY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3072897-d905-42be-b385-6186c24ae059_2390x302.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!slrY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3072897-d905-42be-b385-6186c24ae059_2390x302.png" width="1456" height="184" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f3072897-d905-42be-b385-6186c24ae059_2390x302.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:184,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:153828,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3072897-d905-42be-b385-6186c24ae059_2390x302.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!slrY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3072897-d905-42be-b385-6186c24ae059_2390x302.png 424w, https://substackcdn.com/image/fetch/$s_!slrY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3072897-d905-42be-b385-6186c24ae059_2390x302.png 848w, https://substackcdn.com/image/fetch/$s_!slrY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3072897-d905-42be-b385-6186c24ae059_2390x302.png 1272w, https://substackcdn.com/image/fetch/$s_!slrY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3072897-d905-42be-b385-6186c24ae059_2390x302.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Solving the RL objective with gradient ascent</figcaption></figure></div><p>To put this in the context of LLMs, RL training follows the sequence of steps shown below. We first sample a batch of prompts and generate completions to these prompts with our LLM or policy. Then, we compute the rewards for these completions (more details to follow in later sections) and use these rewards to derive a policy update. <em>This final policy update step is where gradient ascent is used</em>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yR8D!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b7b374-8bee-45fb-b7ee-a26008aa7259_1267x843.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yR8D!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b7b374-8bee-45fb-b7ee-a26008aa7259_1267x843.png 424w, https://substackcdn.com/image/fetch/$s_!yR8D!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b7b374-8bee-45fb-b7ee-a26008aa7259_1267x843.png 848w, https://substackcdn.com/image/fetch/$s_!yR8D!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b7b374-8bee-45fb-b7ee-a26008aa7259_1267x843.png 1272w, https://substackcdn.com/image/fetch/$s_!yR8D!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b7b374-8bee-45fb-b7ee-a26008aa7259_1267x843.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yR8D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b7b374-8bee-45fb-b7ee-a26008aa7259_1267x843.png" width="450" height="299.4080505130229" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/20b7b374-8bee-45fb-b7ee-a26008aa7259_1267x843.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:843,&quot;width&quot;:1267,&quot;resizeWidth&quot;:450,&quot;bytes&quot;:158014,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/175107358?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b7b374-8bee-45fb-b7ee-a26008aa7259_1267x843.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yR8D!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b7b374-8bee-45fb-b7ee-a26008aa7259_1267x843.png 424w, https://substackcdn.com/image/fetch/$s_!yR8D!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b7b374-8bee-45fb-b7ee-a26008aa7259_1267x843.png 848w, https://substackcdn.com/image/fetch/$s_!yR8D!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b7b374-8bee-45fb-b7ee-a26008aa7259_1267x843.png 1272w, https://substackcdn.com/image/fetch/$s_!yR8D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b7b374-8bee-45fb-b7ee-a26008aa7259_1267x843.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Key steps in RL training for LLMs</figcaption></figure></div><p>To be more specific, we use the completions and rewards to estimate the gradient of the RL training objective with respect to the parameters of our policy&#8212;<em>this is called the &#8220;policy gradient&#8221;</em>. If we can compute this gradient, then we can train our policy using gradient ascent. But, the question is: <em>How do we compute this gradient?</em></p><blockquote><p><em>&#8220;The goal of reinforcement learning is to find an optimal behavior strategy for the agent to obtain optimal rewards. The policy gradient methods target at modeling and optimizing the policy directly.&#8221;</em> - <a href="https://lilianweng.github.io/posts/2018-04-08-policy-gradient/">Lilian Weng</a></p></blockquote><p><strong>Policy gradients.</strong> Nearly all RL optimizers used for LLM training (e.g., PPO [1], <a href="https://arxiv.org/abs/2402.03300">GRPO</a>, and <a href="https://cameronrwolfe.substack.com/p/reinforce">REINFORCE</a>) are policy gradient algorithms, which operate by <em>i)</em> estimating the policy gradient and <em>ii)</em> performing gradient ascent with this estimate. These algorithms use different approaches for estimating the policy gradient, but the high-level idea behind all of them is quite similar&#8212;<em>we just tweak small details depending on the exact technique being used</em>. To understand policy gradient algorithms more deeply, we will first derive the simplest form of a policy gradient. Then, we will extend this idea to recover more intricate policy gradient algorithms like Trust Region Policy Optimization (TRPO) [6] and PPO [1].</p><p>The <strong>Vanilla Policy Gradient (VPG)</strong> has been extensively covered by many online resources. Other useful explanations of the VPG include:</p><ul><li><p>Intro to Policy Optimization from OpenAI [<a href="https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html">link</a>]</p></li><li><p>RLHF Book from <a href="https://natolambert.com/">Nathan Lambert</a> [<a href="https://rlhfbook.com/c/11-policy-gradients.html">link</a>]</p></li><li><p>Policy Optimization Algorithms from <a href="https://lilianweng.github.io/">Lilian Weng</a> [<a href="https://lilianweng.github.io/posts/2018-04-08-policy-gradient/">link</a>]</p></li><li><p>Policy Gradient Algorithms from this blog<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a> [<a href="https://cameronrwolfe.substack.com/p/policy-gradients-the-foundation-of">link</a>]</p></li></ul><p>However, we will again derive some simple forms of the policy gradient here for completeness. As we already know, our goal in RL is to maximize cumulative rewards. If we try to compute the gradient of this objective with respect to the parameters of our policy <code>&#952;</code>, we can derive the following:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GetI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1685ea69-1b2c-438c-87ed-dba51c4bee65_2406x1065.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GetI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1685ea69-1b2c-438c-87ed-dba51c4bee65_2406x1065.png 424w, https://substackcdn.com/image/fetch/$s_!GetI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1685ea69-1b2c-438c-87ed-dba51c4bee65_2406x1065.png 848w, https://substackcdn.com/image/fetch/$s_!GetI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1685ea69-1b2c-438c-87ed-dba51c4bee65_2406x1065.png 1272w, https://substackcdn.com/image/fetch/$s_!GetI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1685ea69-1b2c-438c-87ed-dba51c4bee65_2406x1065.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GetI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1685ea69-1b2c-438c-87ed-dba51c4bee65_2406x1065.png" width="1456" height="644" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1685ea69-1b2c-438c-87ed-dba51c4bee65_2406x1065.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:644,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!GetI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1685ea69-1b2c-438c-87ed-dba51c4bee65_2406x1065.png 424w, https://substackcdn.com/image/fetch/$s_!GetI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1685ea69-1b2c-438c-87ed-dba51c4bee65_2406x1065.png 848w, https://substackcdn.com/image/fetch/$s_!GetI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1685ea69-1b2c-438c-87ed-dba51c4bee65_2406x1065.png 1272w, https://substackcdn.com/image/fetch/$s_!GetI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1685ea69-1b2c-438c-87ed-dba51c4bee65_2406x1065.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(<a href="https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html">source</a>)</figcaption></figure></div><p>This derivation starts with the gradient of our RL training objective (cumulative reward) and ends with a basic expression for the policy gradient. The steps used in this derivation are enumerated above. The only complicated steps here are the use of the <a href="https://andrewcharlesjones.github.io/journal/log-derivative.html">log-derivative trick</a> and the final step, which leverages our definition for the probability of a trajectory. In the final step, we substitute in our definition for the probability of a trajectory and observe that the gradients of the initial state probability and transition function with respect to the policy parameters are always zero because neither of them depend on the policy; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Rkmm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0f526be-55f2-4eae-abd8-fa4382d8335a_1564x432.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Rkmm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0f526be-55f2-4eae-abd8-fa4382d8335a_1564x432.png 424w, https://substackcdn.com/image/fetch/$s_!Rkmm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0f526be-55f2-4eae-abd8-fa4382d8335a_1564x432.png 848w, https://substackcdn.com/image/fetch/$s_!Rkmm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0f526be-55f2-4eae-abd8-fa4382d8335a_1564x432.png 1272w, https://substackcdn.com/image/fetch/$s_!Rkmm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0f526be-55f2-4eae-abd8-fa4382d8335a_1564x432.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Rkmm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0f526be-55f2-4eae-abd8-fa4382d8335a_1564x432.png" width="620" height="171.1813186813187" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b0f526be-55f2-4eae-abd8-fa4382d8335a_1564x432.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:402,&quot;width&quot;:1456,&quot;resizeWidth&quot;:620,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Rkmm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0f526be-55f2-4eae-abd8-fa4382d8335a_1564x432.png 424w, https://substackcdn.com/image/fetch/$s_!Rkmm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0f526be-55f2-4eae-abd8-fa4382d8335a_1564x432.png 848w, https://substackcdn.com/image/fetch/$s_!Rkmm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0f526be-55f2-4eae-abd8-fa4382d8335a_1564x432.png 1272w, https://substackcdn.com/image/fetch/$s_!Rkmm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0f526be-55f2-4eae-abd8-fa4382d8335a_1564x432.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(<a href="https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html">source</a>)</figcaption></figure></div><p><strong>Implementing a basic policy gradient.</strong> The basic policy gradient expression we have derived so far is theoretical&#8212;<em>it involves an expectation</em>. If we want to actually compute this gradient in practice, we must approximate it with a sample mean. In other words, we sample a fixed number of trajectories&#8212;<em>or prompts and completions in the case of an LLM</em>&#8212;and take an average over the policy gradient expression for each of these trajectories. The basic policy gradient expression contains two key quantities that we already know how to compute:</p><ul><li><p>The reward comes directly from a verifier or reward model.</p></li><li><p>Log probabilities of actions can be computed with our LLM (i.e., these are just the token probabilities from the LLM&#8217;s output).</p></li></ul><p>To make the process of computing the basic policy gradient more concrete, a step-by-step implementation in PyTorch pseudocode has been provided below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PYzF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e4bdafe-cd71-48b7-8a10-abdc895432f7_1920x1076.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PYzF!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e4bdafe-cd71-48b7-8a10-abdc895432f7_1920x1076.gif 424w, https://substackcdn.com/image/fetch/$s_!PYzF!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e4bdafe-cd71-48b7-8a10-abdc895432f7_1920x1076.gif 848w, https://substackcdn.com/image/fetch/$s_!PYzF!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e4bdafe-cd71-48b7-8a10-abdc895432f7_1920x1076.gif 1272w, https://substackcdn.com/image/fetch/$s_!PYzF!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e4bdafe-cd71-48b7-8a10-abdc895432f7_1920x1076.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PYzF!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e4bdafe-cd71-48b7-8a10-abdc895432f7_1920x1076.gif" width="1456" height="816" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3e4bdafe-cd71-48b7-8a10-abdc895432f7_1920x1076.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:816,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;[animate output image]&quot;,&quot;title&quot;:&quot;[animate output image]&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="[animate output image]" title="[animate output image]" srcset="https://substackcdn.com/image/fetch/$s_!PYzF!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e4bdafe-cd71-48b7-8a10-abdc895432f7_1920x1076.gif 424w, https://substackcdn.com/image/fetch/$s_!PYzF!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e4bdafe-cd71-48b7-8a10-abdc895432f7_1920x1076.gif 848w, https://substackcdn.com/image/fetch/$s_!PYzF!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e4bdafe-cd71-48b7-8a10-abdc895432f7_1920x1076.gif 1272w, https://substackcdn.com/image/fetch/$s_!PYzF!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e4bdafe-cd71-48b7-8a10-abdc895432f7_1920x1076.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>One key detail that we should notice in the above implementation is that we do not compute the policy gradient directly. Rather, we formulate a loss function for which the gradient is equal to the policy gradient then use <a href="https://en.wikipedia.org/wiki/Automatic_differentiation">autodiff</a> in PyTorch to compute the policy gradient&#8212;<em>this happens during </em><code>loss.backward()</code>. The exact loss function used to compute the policy gradient is shown below. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TwP0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4bb2d85-fdea-4cfc-a46b-e6c5f78ff4f4_1613x593.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TwP0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4bb2d85-fdea-4cfc-a46b-e6c5f78ff4f4_1613x593.png 424w, https://substackcdn.com/image/fetch/$s_!TwP0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4bb2d85-fdea-4cfc-a46b-e6c5f78ff4f4_1613x593.png 848w, https://substackcdn.com/image/fetch/$s_!TwP0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4bb2d85-fdea-4cfc-a46b-e6c5f78ff4f4_1613x593.png 1272w, https://substackcdn.com/image/fetch/$s_!TwP0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4bb2d85-fdea-4cfc-a46b-e6c5f78ff4f4_1613x593.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TwP0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4bb2d85-fdea-4cfc-a46b-e6c5f78ff4f4_1613x593.png" width="604" height="221.9368131868132" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a4bb2d85-fdea-4cfc-a46b-e6c5f78ff4f4_1613x593.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:535,&quot;width&quot;:1456,&quot;resizeWidth&quot;:604,&quot;bytes&quot;:135252,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/175107358?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4bb2d85-fdea-4cfc-a46b-e6c5f78ff4f4_1613x593.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TwP0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4bb2d85-fdea-4cfc-a46b-e6c5f78ff4f4_1613x593.png 424w, https://substackcdn.com/image/fetch/$s_!TwP0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4bb2d85-fdea-4cfc-a46b-e6c5f78ff4f4_1613x593.png 848w, https://substackcdn.com/image/fetch/$s_!TwP0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4bb2d85-fdea-4cfc-a46b-e6c5f78ff4f4_1613x593.png 1272w, https://substackcdn.com/image/fetch/$s_!TwP0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4bb2d85-fdea-4cfc-a46b-e6c5f78ff4f4_1613x593.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Creating a loss function for the policy gradient</figcaption></figure></div><p>This distinction is important to understand because we will formulate PPO (and TRPO!) via a loss function rather than a direct expression for the policy gradient. </p><p><strong>Problems with the basic policy gradient.</strong> The basic policy gradient expression is straightforward, but it suffers from several notable issues:</p><ul><li><p><em>High Variance</em>: The gradient estimates can have high variance, making training unstable.</p></li><li><p><em>Unstable Policy Updates</em>: There is no mechanism to prevent large, potentially destabilizing updates to the policy.</p></li></ul><p>Due to the high variance, accurately estimating the policy gradient often requires sampling many trajectories per training iteration, which is computationally expensive. We must generate many completions with the LLM and compute the rewards and token log probabilities for all of these completions. </p><p>Additionally, this high variance increases the risk of training instability&#8212;<em>large and inaccurate updates could potentially cause significant harm to our policy</em>. To solve these issues, most policy gradient algorithms focus on reducing the variance of policy gradient estimates and enforcing a trust region on policy updates (i.e., limiting how much the policy can change in a single update).</p><blockquote><p><em>&#8220;Taking a step with this gradient pushes up the log-probabilities of each action in proportion to </em><code>R(&#120591;)</code><em>, the sum of all rewards ever obtained.&#8221;</em> - <a href="https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html">Spinning up in Deep RL</a></p></blockquote><p><strong>Reward-to-go.</strong> For example, we see in our basic policy gradient (copied below for reference) that we are increasing the probability of a given action based upon the cumulative reward of a trajectory. Therefore, we may increase the probability of an action due to rewards that were observed before the action even occurred!</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ymws!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b14bade-8617-4bfa-9e4a-59811bbe8de7_1374x218.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ymws!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b14bade-8617-4bfa-9e4a-59811bbe8de7_1374x218.png 424w, https://substackcdn.com/image/fetch/$s_!Ymws!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b14bade-8617-4bfa-9e4a-59811bbe8de7_1374x218.png 848w, https://substackcdn.com/image/fetch/$s_!Ymws!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b14bade-8617-4bfa-9e4a-59811bbe8de7_1374x218.png 1272w, https://substackcdn.com/image/fetch/$s_!Ymws!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b14bade-8617-4bfa-9e4a-59811bbe8de7_1374x218.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ymws!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b14bade-8617-4bfa-9e4a-59811bbe8de7_1374x218.png" width="499" height="79.17176128093159" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6b14bade-8617-4bfa-9e4a-59811bbe8de7_1374x218.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:218,&quot;width&quot;:1374,&quot;resizeWidth&quot;:499,&quot;bytes&quot;:51212,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/175107358?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b14bade-8617-4bfa-9e4a-59811bbe8de7_1374x218.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ymws!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b14bade-8617-4bfa-9e4a-59811bbe8de7_1374x218.png 424w, https://substackcdn.com/image/fetch/$s_!Ymws!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b14bade-8617-4bfa-9e4a-59811bbe8de7_1374x218.png 848w, https://substackcdn.com/image/fetch/$s_!Ymws!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b14bade-8617-4bfa-9e4a-59811bbe8de7_1374x218.png 1272w, https://substackcdn.com/image/fetch/$s_!Ymws!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b14bade-8617-4bfa-9e4a-59811bbe8de7_1374x218.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Basic policy gradient expression</figcaption></figure></div><p>This simple observation led to the creation of the &#8220;reward-to-go&#8221; policy gradient; see below. This modified policy gradient expression just replaces the cumulative reward with the sum of rewards observed after an action. Using the <a href="https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html#expected-grad-log-prob-lemma">EGLP lemma</a>, we can show that this reward-to-go formulation is an unbiased estimator of the policy gradient. Additionally, the reward-to-go policy gradient has provably lower variance compared to the basic policy gradient expression from before. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!s3m9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92c4ac85-74ac-4c12-8d51-c6c9b3bf22ba_2216x460.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!s3m9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92c4ac85-74ac-4c12-8d51-c6c9b3bf22ba_2216x460.png 424w, https://substackcdn.com/image/fetch/$s_!s3m9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92c4ac85-74ac-4c12-8d51-c6c9b3bf22ba_2216x460.png 848w, https://substackcdn.com/image/fetch/$s_!s3m9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92c4ac85-74ac-4c12-8d51-c6c9b3bf22ba_2216x460.png 1272w, https://substackcdn.com/image/fetch/$s_!s3m9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92c4ac85-74ac-4c12-8d51-c6c9b3bf22ba_2216x460.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!s3m9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92c4ac85-74ac-4c12-8d51-c6c9b3bf22ba_2216x460.png" width="1456" height="302" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/92c4ac85-74ac-4c12-8d51-c6c9b3bf22ba_2216x460.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:302,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:177462,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/175107358?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92c4ac85-74ac-4c12-8d51-c6c9b3bf22ba_2216x460.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!s3m9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92c4ac85-74ac-4c12-8d51-c6c9b3bf22ba_2216x460.png 424w, https://substackcdn.com/image/fetch/$s_!s3m9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92c4ac85-74ac-4c12-8d51-c6c9b3bf22ba_2216x460.png 848w, https://substackcdn.com/image/fetch/$s_!s3m9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92c4ac85-74ac-4c12-8d51-c6c9b3bf22ba_2216x460.png 1272w, https://substackcdn.com/image/fetch/$s_!s3m9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92c4ac85-74ac-4c12-8d51-c6c9b3bf22ba_2216x460.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The reward-to-go policy gradient</figcaption></figure></div><p><strong>Baselines.</strong> To further reduce variance, we can also add a baseline to our policy gradient expression; see below. Similarly to the reward-to-go policy gradient, we can use the EGLP lemma to show that a baselined version of our policy gradient is unbiased and has lower variance. Due to the EGLP lemma, this baseline must only depend upon the current state (i.e., otherwise an assumption of the EGLP lemma is violated and the proofs are no longer valid).</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QhBq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4801db8-b3f3-4ec3-9d3f-624b8ffbd550_1774x344.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QhBq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4801db8-b3f3-4ec3-9d3f-624b8ffbd550_1774x344.png 424w, https://substackcdn.com/image/fetch/$s_!QhBq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4801db8-b3f3-4ec3-9d3f-624b8ffbd550_1774x344.png 848w, https://substackcdn.com/image/fetch/$s_!QhBq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4801db8-b3f3-4ec3-9d3f-624b8ffbd550_1774x344.png 1272w, https://substackcdn.com/image/fetch/$s_!QhBq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4801db8-b3f3-4ec3-9d3f-624b8ffbd550_1774x344.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QhBq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4801db8-b3f3-4ec3-9d3f-624b8ffbd550_1774x344.png" width="587" height="113.69093406593407" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d4801db8-b3f3-4ec3-9d3f-624b8ffbd550_1774x344.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:282,&quot;width&quot;:1456,&quot;resizeWidth&quot;:587,&quot;bytes&quot;:105481,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/175107358?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4801db8-b3f3-4ec3-9d3f-624b8ffbd550_1774x344.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QhBq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4801db8-b3f3-4ec3-9d3f-624b8ffbd550_1774x344.png 424w, https://substackcdn.com/image/fetch/$s_!QhBq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4801db8-b3f3-4ec3-9d3f-624b8ffbd550_1774x344.png 848w, https://substackcdn.com/image/fetch/$s_!QhBq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4801db8-b3f3-4ec3-9d3f-624b8ffbd550_1774x344.png 1272w, https://substackcdn.com/image/fetch/$s_!QhBq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4801db8-b3f3-4ec3-9d3f-624b8ffbd550_1774x344.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Adding a baseline to our policy gradient expression</figcaption></figure></div><p>This expression is nearly identical to the reward-to-go policy gradient&#8212;<em>we just subtract an additional baseline from the reward-to-go term</em>. There are many possible choices for baselines that can be used in policy gradient estimates. One common baseline is the value function. <em>Using the value function as a baseline positively reinforces actions that achieve a cumulative reward that is higher than expected.</em></p><div class="pullquote"><p><em>A common problem with vanilla policy gradient algorithms is the high variance in gradient updates&#8230; In order to alleviate this, various techniques are used to normalize the value estimation, called baselines. Baselines accomplish this in multiple ways, effectively normalizing by the value of the state relative to the downstream action (e.g. in the case of Advantage, which is the difference between the Q value and the value). The simplest baselines are averages over the batch of rewards or a moving average. - <a href="https://rlhfbook.com/c/11-policy-gradients.html">RLHF book</a></em></p></div><p><strong>Generic policy gradient.</strong> In [3], the options for computing the policy gradient were summarized with a more generic policy gradient expression; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Vl-C!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58aa8bae-6778-4ec0-ac53-3f8b8550390f_2137x836.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Vl-C!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58aa8bae-6778-4ec0-ac53-3f8b8550390f_2137x836.png 424w, https://substackcdn.com/image/fetch/$s_!Vl-C!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58aa8bae-6778-4ec0-ac53-3f8b8550390f_2137x836.png 848w, https://substackcdn.com/image/fetch/$s_!Vl-C!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58aa8bae-6778-4ec0-ac53-3f8b8550390f_2137x836.png 1272w, https://substackcdn.com/image/fetch/$s_!Vl-C!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58aa8bae-6778-4ec0-ac53-3f8b8550390f_2137x836.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Vl-C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58aa8bae-6778-4ec0-ac53-3f8b8550390f_2137x836.png" width="1456" height="570" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/58aa8bae-6778-4ec0-ac53-3f8b8550390f_2137x836.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:570,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:663697,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/175107358?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58aa8bae-6778-4ec0-ac53-3f8b8550390f_2137x836.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Vl-C!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58aa8bae-6778-4ec0-ac53-3f8b8550390f_2137x836.png 424w, https://substackcdn.com/image/fetch/$s_!Vl-C!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58aa8bae-6778-4ec0-ac53-3f8b8550390f_2137x836.png 848w, https://substackcdn.com/image/fetch/$s_!Vl-C!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58aa8bae-6778-4ec0-ac53-3f8b8550390f_2137x836.png 1272w, https://substackcdn.com/image/fetch/$s_!Vl-C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58aa8bae-6778-4ec0-ac53-3f8b8550390f_2137x836.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><p>This expression is nearly identical to expressions we have seen so far. The only difference is that we have changed our reward term <code>R(&#120591;)</code> to a generic <code>&#936;_t</code> term, which can be set equal to several different expressions. For example, we can:</p><ul><li><p>Set <code>&#936;_t = R(&#120591;)</code> to recover our basic policy gradient expression.</p></li><li><p>Set <code>&#936;_t</code> equal to rewards received after time <code>t</code> to recover our reward-to-go variant of the policy gradient.</p></li><li><p>Set <code>&#936;_t</code> equal to a baselined version of the reward; e.g., the difference between cumulative reward <code>R(&#120591;)</code> and the value function <code>V(s_t)</code>.</p></li><li><p>Set <code>&#936;_t</code> equal to the state-action (<code>Q</code>) or advantage function (<code>A</code>).</p></li></ul><p>Despite the many possible formulations, PPO&#8212;<em>and nearly all of the RL optimizers used in the domain of LLMs</em>&#8212;focuses upon setting <code>&#936;_t</code> equal to the advantage function <code>A(s_t, a_t)</code>. <em>This setting is referred to as the vanilla policy gradient (VPG)</em>; see below. In theory, the VPG yields the lowest-variance gradient estimate.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1PL6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbd6ad6-4d9e-4085-b4a7-849b29789350_1662x470.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1PL6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbd6ad6-4d9e-4085-b4a7-849b29789350_1662x470.png 424w, https://substackcdn.com/image/fetch/$s_!1PL6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbd6ad6-4d9e-4085-b4a7-849b29789350_1662x470.png 848w, https://substackcdn.com/image/fetch/$s_!1PL6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbd6ad6-4d9e-4085-b4a7-849b29789350_1662x470.png 1272w, https://substackcdn.com/image/fetch/$s_!1PL6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbd6ad6-4d9e-4085-b4a7-849b29789350_1662x470.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1PL6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbd6ad6-4d9e-4085-b4a7-849b29789350_1662x470.png" width="482" height="136.3901098901099" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3dbd6ad6-4d9e-4085-b4a7-849b29789350_1662x470.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:412,&quot;width&quot;:1456,&quot;resizeWidth&quot;:482,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!1PL6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbd6ad6-4d9e-4085-b4a7-849b29789350_1662x470.png 424w, https://substackcdn.com/image/fetch/$s_!1PL6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbd6ad6-4d9e-4085-b4a7-849b29789350_1662x470.png 848w, https://substackcdn.com/image/fetch/$s_!1PL6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbd6ad6-4d9e-4085-b4a7-849b29789350_1662x470.png 1272w, https://substackcdn.com/image/fetch/$s_!1PL6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbd6ad6-4d9e-4085-b4a7-849b29789350_1662x470.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The vanilla policy gradient</figcaption></figure></div><p>Although the VPG has low variance, there is still no mechanism to enforce a trust region in the policy update&#8212;<em>a large and destructive policy update can still destabilize the training process</em>. PPO was created as a solution to this problem. As we will see, PPO resembles the basic policy gradient expressions we have seen but has added mechanisms for enforcing a trust region on the policy update. We will now learn more about PPO and the many practical details involved in its implementation. </p><h2>Proximal Policy Optimization (PPO)</h2><p>Now that we understand RL basics, we will spend the next section learning about Proximal Policy Optimization (PPO) [1]. This explanation will build upon the VPG expression that we derived in the last section, beginning with Trust Region Policy Optimization (TRPO) [6]&#8212;<em>a predecessor to PPO</em>. TRPO is effective at stabilizing training, but it is also relatively complex. PPO was developed as a more practical alternative with similar benefits. To conclude the section, we will also cover Generalized Advantage Estimation (GAE) [3], which is the most common approach for computing the advantage function in PPO. </p><h4><a href="https://arxiv.org/abs/1502.05477">Trust Region Policy Optimization (TRPO)</a> [6]</h4><blockquote><p><em>&#8220;TRPO uses a hard constraint rather than a penalty because it is hard to choose a single value of &#946; that performs well across different problems&#8212;or even within a single problem, where the characteristics change over the course of learning.&#8221;</em> - from [1]</p></blockquote><p>Prior to learning about PPO, we need to take a look at its predecessor, Trust Region Policy Optimization (TRPO) [6]. The key motivation behind TRPO is creating an algorithm that is data efficient and does not require too much hyperparameter tuning. To do this, authors in [6] propose the constrained objective below, <em>which is guaranteed to monotonically improve our policy</em>. This objective enforces a trust region on the policy update, thus eliminating the risk of large and destructive policy updates that could destabilize training.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!x5A5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a9c1514-c3dd-4692-bb7a-d63644987d5e_1784x940.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!x5A5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a9c1514-c3dd-4692-bb7a-d63644987d5e_1784x940.png 424w, https://substackcdn.com/image/fetch/$s_!x5A5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a9c1514-c3dd-4692-bb7a-d63644987d5e_1784x940.png 848w, https://substackcdn.com/image/fetch/$s_!x5A5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a9c1514-c3dd-4692-bb7a-d63644987d5e_1784x940.png 1272w, https://substackcdn.com/image/fetch/$s_!x5A5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a9c1514-c3dd-4692-bb7a-d63644987d5e_1784x940.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!x5A5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a9c1514-c3dd-4692-bb7a-d63644987d5e_1784x940.png" width="656" height="345.57142857142856" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1a9c1514-c3dd-4692-bb7a-d63644987d5e_1784x940.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:767,&quot;width&quot;:1456,&quot;resizeWidth&quot;:656,&quot;bytes&quot;:288550,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/175107358?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a9c1514-c3dd-4692-bb7a-d63644987d5e_1784x940.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!x5A5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a9c1514-c3dd-4692-bb7a-d63644987d5e_1784x940.png 424w, https://substackcdn.com/image/fetch/$s_!x5A5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a9c1514-c3dd-4692-bb7a-d63644987d5e_1784x940.png 848w, https://substackcdn.com/image/fetch/$s_!x5A5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a9c1514-c3dd-4692-bb7a-d63644987d5e_1784x940.png 1272w, https://substackcdn.com/image/fetch/$s_!x5A5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a9c1514-c3dd-4692-bb7a-d63644987d5e_1784x940.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Surrogate objective for TRPO (from [1])</figcaption></figure></div><p><strong>Surrogate objective.</strong> This objective shown above is called the surrogate objective in TRPO. This naming stems from the fact that the surrogate objective is different from the standard RL training objective. In RL, we aim to maximize cumulative reward, but&#8212;<em>as we have seen in our discussion of the VPG</em>&#8212;directly maximizing this &#8220;true&#8221; objective of RL can lead to training instability. TRPO formulates the surrogate objective to maximize in place of the true objective. </p><p>There are a few noticeable differences between the above expression for TRPO and the VPG:</p><ul><li><p>Action probabilities in the current policy are normalized by the probability of that action in the old policy (i.e., the policy prior to training)&#8212;<em>this forms the policy ratio (also called an importance ratio)</em>. We also use probabilities in this formulation instead of log probabilities. </p></li><li><p>There is a constraint placed on the objective to ensure that the expected KL divergence between the new and old policies is less than a threshold <code>&#948;</code>. </p></li></ul><p>Otherwise, the TRPO loss function shares a similar structure to that of VPG&#8212;<em>it includes the advantage function and a sum over token-level probabilities in a trajectory</em>. </p><p><strong>Policy ratio.</strong> The centerpiece of the TRPO loss function is the policy ratio, defined as shown below. The policy ratio tells us how much more likely a given action is in our current policy relative to the probability of that action before the training process started&#8212;<em>this is denoted as the &#8220;old&#8221; policy</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IXsZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7d1530-a2cc-48c6-9e95-8571b781ba35_1994x792.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IXsZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7d1530-a2cc-48c6-9e95-8571b781ba35_1994x792.png 424w, https://substackcdn.com/image/fetch/$s_!IXsZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7d1530-a2cc-48c6-9e95-8571b781ba35_1994x792.png 848w, https://substackcdn.com/image/fetch/$s_!IXsZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7d1530-a2cc-48c6-9e95-8571b781ba35_1994x792.png 1272w, https://substackcdn.com/image/fetch/$s_!IXsZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7d1530-a2cc-48c6-9e95-8571b781ba35_1994x792.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IXsZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7d1530-a2cc-48c6-9e95-8571b781ba35_1994x792.png" width="580" height="230.24725274725276" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a7d1530-a2cc-48c6-9e95-8571b781ba35_1994x792.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:578,&quot;width&quot;:1456,&quot;resizeWidth&quot;:580,&quot;bytes&quot;:203606,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/175107358?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7d1530-a2cc-48c6-9e95-8571b781ba35_1994x792.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!IXsZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7d1530-a2cc-48c6-9e95-8571b781ba35_1994x792.png 424w, https://substackcdn.com/image/fetch/$s_!IXsZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7d1530-a2cc-48c6-9e95-8571b781ba35_1994x792.png 848w, https://substackcdn.com/image/fetch/$s_!IXsZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7d1530-a2cc-48c6-9e95-8571b781ba35_1994x792.png 1272w, https://substackcdn.com/image/fetch/$s_!IXsZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7d1530-a2cc-48c6-9e95-8571b781ba35_1994x792.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The policy (or importance) ratio</figcaption></figure></div><p>This quantity serves the purpose of assigning an importance to different actions within our trajectory. If the new policy assigns a higher probability to an action than the old policy did, this ratio is greater than one, increasing the influence of that action&#8217;s advantage in the objective. Conversely, if the new policy assigns a lower probability, the ratio is less than one, reducing the influence of that action. The policy ratio ensures that the policy update emphasizes actions that the new policy is making more likely&#8212;<em>especially if those actions have high advantage</em>&#8212;while suppressing actions that are becoming less likely under the new policy. By doing this, we ensure that the update is properly weighted according to how the new policy differs from the old, enabling stable and efficient policy improvement. </p><p><strong>Solving the surrogate objective.</strong> Although this objective yields stable policy updates, solving it can be quite involved. By introducing an explicit constraint into our objective, we eliminate the ability to solve this objective with simple gradient ascent<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>. Instead, we have to solve this objective via the more complex <a href="https://en.wikipedia.org/wiki/Conjugate_gradient_method">conjugate gradient algorithm</a>. Alternatively, we could remove this constraint and instead add the KL divergence as a penalty into our loss function; see below. This unconstrained loss is simpler and can again be solved with basic gradient ascent. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fFIz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F301f1d55-7e7c-4c2f-8138-67a3bc162338_1872x388.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fFIz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F301f1d55-7e7c-4c2f-8138-67a3bc162338_1872x388.png 424w, https://substackcdn.com/image/fetch/$s_!fFIz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F301f1d55-7e7c-4c2f-8138-67a3bc162338_1872x388.png 848w, https://substackcdn.com/image/fetch/$s_!fFIz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F301f1d55-7e7c-4c2f-8138-67a3bc162338_1872x388.png 1272w, https://substackcdn.com/image/fetch/$s_!fFIz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F301f1d55-7e7c-4c2f-8138-67a3bc162338_1872x388.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fFIz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F301f1d55-7e7c-4c2f-8138-67a3bc162338_1872x388.png" width="1456" height="302" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/301f1d55-7e7c-4c2f-8138-67a3bc162338_1872x388.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:302,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:130947,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/175107358?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F301f1d55-7e7c-4c2f-8138-67a3bc162338_1872x388.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fFIz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F301f1d55-7e7c-4c2f-8138-67a3bc162338_1872x388.png 424w, https://substackcdn.com/image/fetch/$s_!fFIz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F301f1d55-7e7c-4c2f-8138-67a3bc162338_1872x388.png 848w, https://substackcdn.com/image/fetch/$s_!fFIz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F301f1d55-7e7c-4c2f-8138-67a3bc162338_1872x388.png 1272w, https://substackcdn.com/image/fetch/$s_!fFIz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F301f1d55-7e7c-4c2f-8138-67a3bc162338_1872x388.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The penalty objective for TRPO</figcaption></figure></div><p><strong>From TRPO to PPO.</strong> Formulating the constraint from TRPO as a penalty allows us to avoid complicated optimization techniques and rely upon basic gradient ascent. However, a new hyperparameter &#946; is introduced to the optimization process that makes tuning difficult. Properly setting the value of &#946; is essential for this objective to perform well, and finding a single value of &#946; that generalizes to many domains is hard. As a result, both of the above objectives have their issues:</p><ul><li><p>The TRPO surrogate objective is too complex to solve in practice.</p></li><li><p>The reformulated penalty objective is sensitive to the setting of &#946;.</p></li></ul><p>We want to develop an algorithm that retains the benefits of TRPO&#8212;<em>such as stability, data efficiency, and reliability</em>&#8212;while avoiding its complexity. Ideally, the algorithm should be broadly applicable and solvable using basic gradient ascent. These goals led to the proposal of PPO, which is largely inspired by TRPO. PPO&#8217;s objective is inspired by the TRPO surrogate objective but replaces the hard KL constraint with a clipping mechanism to enforce a trust region in a simpler way.</p><h4><a href="https://arxiv.org/abs/1707.06347">Proximal Policy Optimization Algorithms</a> [1]</h4><blockquote><p><em>&#8220;We propose a new family of policy gradient methods for RL, which alternate between sampling data through interaction with the environment, and optimizing a surrogate objective function using stochastic gradient ascent.&#8221;</em> - from [1]</p></blockquote><p>The VPG is simple to compute in practice, but it has poor data efficiency (i.e., the model must be trained over many samples to perform well) and high variance in the policy updates. These problems are largely solved by TRPO but at the cost of significant added complexity. PPO is an algorithm with the data efficiency and reliability benefits of TRPO that is still solvable with gradient ascent. In this way, PPO is a simpler algorithm compared to TRPO. As we will see, however, <em>PPO is still a complex algorithm with many implementation complexities of its own</em>. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!S1nc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc38f9ea3-d07f-4240-898e-de3c75e66878_2264x786.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!S1nc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc38f9ea3-d07f-4240-898e-de3c75e66878_2264x786.png 424w, https://substackcdn.com/image/fetch/$s_!S1nc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc38f9ea3-d07f-4240-898e-de3c75e66878_2264x786.png 848w, https://substackcdn.com/image/fetch/$s_!S1nc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc38f9ea3-d07f-4240-898e-de3c75e66878_2264x786.png 1272w, https://substackcdn.com/image/fetch/$s_!S1nc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc38f9ea3-d07f-4240-898e-de3c75e66878_2264x786.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!S1nc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc38f9ea3-d07f-4240-898e-de3c75e66878_2264x786.png" width="624" height="216.42857142857142" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c38f9ea3-d07f-4240-898e-de3c75e66878_2264x786.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:505,&quot;width&quot;:1456,&quot;resizeWidth&quot;:624,&quot;bytes&quot;:168867,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/175107358?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc38f9ea3-d07f-4240-898e-de3c75e66878_2264x786.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!S1nc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc38f9ea3-d07f-4240-898e-de3c75e66878_2264x786.png 424w, https://substackcdn.com/image/fetch/$s_!S1nc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc38f9ea3-d07f-4240-898e-de3c75e66878_2264x786.png 848w, https://substackcdn.com/image/fetch/$s_!S1nc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc38f9ea3-d07f-4240-898e-de3c75e66878_2264x786.png 1272w, https://substackcdn.com/image/fetch/$s_!S1nc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc38f9ea3-d07f-4240-898e-de3c75e66878_2264x786.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Update procedure in PPO (from [1])</figcaption></figure></div><p><strong>Training process.</strong> Similarly to TRPO, PPO focuses upon optimizing a surrogate objective, but the objective in PPO has no constraint and has been slightly modified. As shown in the algorithm above, PPO performs more than a single policy update in each step, instead alternating between:</p><ol><li><p>Sampling new data or trajectories from the policy.</p></li><li><p>Performing several epochs of optimization on the sampled data. </p></li></ol><p><strong>The PPO surrogate objective</strong> is again based upon the policy ratio between the current policy and the old model (i.e., the policy before any training is performed). To match notation in [1], we will denote the policy ratio as <code>r_t(&#952;)</code>, which is similar to the <code>r_t</code> notation used for the reward for time step <code>t</code>. However, <em>the policy ratio is unrelated to the reward</em>! To obtain the PPO objective, we start with the surrogate objective being maximized by TRPO with no KL constraint; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fqSm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80447ac5-6fd2-4cbb-b33c-a4e385e7fc2c_1390x478.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fqSm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80447ac5-6fd2-4cbb-b33c-a4e385e7fc2c_1390x478.png 424w, https://substackcdn.com/image/fetch/$s_!fqSm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80447ac5-6fd2-4cbb-b33c-a4e385e7fc2c_1390x478.png 848w, https://substackcdn.com/image/fetch/$s_!fqSm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80447ac5-6fd2-4cbb-b33c-a4e385e7fc2c_1390x478.png 1272w, https://substackcdn.com/image/fetch/$s_!fqSm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80447ac5-6fd2-4cbb-b33c-a4e385e7fc2c_1390x478.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fqSm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80447ac5-6fd2-4cbb-b33c-a4e385e7fc2c_1390x478.png" width="520" height="178.82014388489208" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/80447ac5-6fd2-4cbb-b33c-a4e385e7fc2c_1390x478.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:478,&quot;width&quot;:1390,&quot;resizeWidth&quot;:520,&quot;bytes&quot;:115075,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/175107358?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80447ac5-6fd2-4cbb-b33c-a4e385e7fc2c_1390x478.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fqSm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80447ac5-6fd2-4cbb-b33c-a4e385e7fc2c_1390x478.png 424w, https://substackcdn.com/image/fetch/$s_!fqSm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80447ac5-6fd2-4cbb-b33c-a4e385e7fc2c_1390x478.png 848w, https://substackcdn.com/image/fetch/$s_!fqSm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80447ac5-6fd2-4cbb-b33c-a4e385e7fc2c_1390x478.png 1272w, https://substackcdn.com/image/fetch/$s_!fqSm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80447ac5-6fd2-4cbb-b33c-a4e385e7fc2c_1390x478.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The unclipped PPO objective</figcaption></figure></div><p>We will call this formulation the &#8220;unclipped&#8221; objective. Because it does not have a constraint, this objective can be easily computed to derive the policy gradient by <em>i)</em> estimating the advantage and <em>ii)</em> computing the policy ratio. However, if we try to maximize this unconstrained objective, this will potentially lead to large and destructive policy gradient updates that make the training process unstable. To solve this issue, PPO introduces a novel clipping mechanism into the surrogate objective that helps us with maintaining the trust region; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oHJG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oHJG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png 424w, https://substackcdn.com/image/fetch/$s_!oHJG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png 848w, https://substackcdn.com/image/fetch/$s_!oHJG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png 1272w, https://substackcdn.com/image/fetch/$s_!oHJG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oHJG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png" width="1456" height="246" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:246,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:121736,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/175107358?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oHJG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png 424w, https://substackcdn.com/image/fetch/$s_!oHJG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png 848w, https://substackcdn.com/image/fetch/$s_!oHJG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png 1272w, https://substackcdn.com/image/fetch/$s_!oHJG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f6be9f2-f165-4e48-be0c-e63074454d2a_2003x338.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The PPO surrogate objective</figcaption></figure></div><p>The main term in the objective is unchanged, but there is an added term with a clipped version of the policy ratio&#8212;<em>the policy ratio must fall in the range </em><code>[1 - &#949;, 1 + &#949;]</code><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>. The clipping term disincentivizes the RL training process from moving the policy ratio away from a value of one. The PPO surrogate objective takes the minimum of clipped and unclipped objectives. In this way, <em>the PPO objective is a pessimistic (lower) bound for the original, unclipped objective</em><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ovlv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38769a7f-6549-4fed-ab3e-f829185b5069_1544x642.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ovlv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38769a7f-6549-4fed-ab3e-f829185b5069_1544x642.png 424w, https://substackcdn.com/image/fetch/$s_!ovlv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38769a7f-6549-4fed-ab3e-f829185b5069_1544x642.png 848w, https://substackcdn.com/image/fetch/$s_!ovlv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38769a7f-6549-4fed-ab3e-f829185b5069_1544x642.png 1272w, https://substackcdn.com/image/fetch/$s_!ovlv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38769a7f-6549-4fed-ab3e-f829185b5069_1544x642.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ovlv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38769a7f-6549-4fed-ab3e-f829185b5069_1544x642.png" width="1456" height="605" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/38769a7f-6549-4fed-ab3e-f829185b5069_1544x642.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:605,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:95904,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/175107358?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38769a7f-6549-4fed-ab3e-f829185b5069_1544x642.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ovlv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38769a7f-6549-4fed-ab3e-f829185b5069_1544x642.png 424w, https://substackcdn.com/image/fetch/$s_!ovlv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38769a7f-6549-4fed-ab3e-f829185b5069_1544x642.png 848w, https://substackcdn.com/image/fetch/$s_!ovlv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38769a7f-6549-4fed-ab3e-f829185b5069_1544x642.png 1272w, https://substackcdn.com/image/fetch/$s_!ovlv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38769a7f-6549-4fed-ab3e-f829185b5069_1544x642.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>Depending upon whether the advantage is positive or negative, the behavior of clipping is slightly different; see above. The use of a minimum in the surrogate objective causes clipping to be applied in only one direction. In particular, we can arbitrarily <em>decrease</em> surrogate objective by moving the policy ratio far away from a value of one, but clipping prevents arbitrarily <em>increasing</em> the objective via the policy ratio. In this way, PPO de-incentivize large policy ratios so that our policy does not deviate too much from the old policy after training updates. </p><blockquote><p><em>&#8220;With this scheme, we only ignore the change in probability ratio when it would make the objective improve, and we include it when it makes the objective worse.&#8221;</em> - from [1]</p></blockquote><p>To more deeply understand the clipping logic of PPO, we can consider each of the four possible cases that can arise when optimizing the surrogate objective:</p><ul><li><p>Case #1 [<code>A &gt; 0</code>, <code>r_t(&#952;) &#8804; 1 + &#949;</code>]: advantage is positive&#8212;<em>this is an action that we want to reinforce</em>. Our policy ratio is below <code>1 + &#949;</code>, so we perform a normal policy gradient update to increase the probability of this action.</p></li><li><p>Case #2 [<code>A &gt; 0</code>, <code>r_t(&#952;) &gt; 1 + &#949;</code>]: advantage is positive again, but our policy ratio is greater than <code>1 + &#949;</code>. This means that this action is already more likely in the new policy relative to the old policy. The objective gets clipped, and the gradient with respect to further increases in the policy ratio is zero. This prevents the policy from making the action even more likely</p></li><li><p>Case #3 [<code>A &lt; 0</code>, <code>r_t(&#952;) &#8805; 1 - &#949;</code>]: advantage is negative&#8212;<em>this is an action we want to negatively reinforce (i.e., decrease probability)</em>. Our policy ratio is above <code>1 - &#949;</code>, so we perform a normal policy gradient update to decrease the probability of this action. </p></li><li><p>Case #4 [<code>A &lt; 0</code>, <code>r_t(&#952;) &lt; 1 - &#949;</code>]: advantage is negative again, but our policy ratio is less than <code>1 - &#949;</code>. This means that this action is already less likely in the new policy relative to the old policy. The objective gets clipped, and the gradient with respect to further decreases in the policy ratio is zero. This prevents the policy from making the action even less likely.</p></li></ul><p>The policy ratio is computed between the current and old policies. The old policy is updated to match the current policy each time new data is sampled in PPO. In the context of LLMs, we perform 2-4 gradient updates (or sometimes more) [2] for each batch of data, <em>so</em> <em>the old model is updated frequently</em>. The clipping operation in PPO, therefore, maintains a trust region for a particular batch of data.</p><p><strong>KL divergence.</strong> When training LLMs with PPO, we usually incorporate the KL divergence between the current policy and a reference policy&#8212;<em>usually some policy from before RL training begins (e.g., the SFT model)</em>&#8212;into the training process. This added KL divergence term penalizes the policy from becoming too different from the reference policy, which has a regularizing effect. We compute KL divergence per token by comparing the token probability distributions outputted by the two LLMs for each token within the sequence. Details on how exactly the KL divergence is computed in practice can be found <a href="https://cameronrwolfe.substack.com/i/167254905/kullback-leibler-kl-divergence">here</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MMrI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3d5004-2390-489f-995a-e0245c174535_2534x530.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MMrI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3d5004-2390-489f-995a-e0245c174535_2534x530.png 424w, https://substackcdn.com/image/fetch/$s_!MMrI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3d5004-2390-489f-995a-e0245c174535_2534x530.png 848w, https://substackcdn.com/image/fetch/$s_!MMrI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3d5004-2390-489f-995a-e0245c174535_2534x530.png 1272w, https://substackcdn.com/image/fetch/$s_!MMrI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3d5004-2390-489f-995a-e0245c174535_2534x530.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MMrI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3d5004-2390-489f-995a-e0245c174535_2534x530.png" width="587" height="122.9635989010989" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cc3d5004-2390-489f-995a-e0245c174535_2534x530.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:305,&quot;width&quot;:1456,&quot;resizeWidth&quot;:587,&quot;bytes&quot;:188292,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/175107358?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3d5004-2390-489f-995a-e0245c174535_2534x530.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!MMrI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3d5004-2390-489f-995a-e0245c174535_2534x530.png 424w, https://substackcdn.com/image/fetch/$s_!MMrI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3d5004-2390-489f-995a-e0245c174535_2534x530.png 848w, https://substackcdn.com/image/fetch/$s_!MMrI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3d5004-2390-489f-995a-e0245c174535_2534x530.png 1272w, https://substackcdn.com/image/fetch/$s_!MMrI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc3d5004-2390-489f-995a-e0245c174535_2534x530.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Incorporating KL divergence into the reward</figcaption></figure></div><p>There are two common ways of adding the KL divergence into PPO training. First, we can directly subtract the KL divergence from the reward in RL; see above. Alternatively, we can add the KL divergence as a penalty term to the RL training objective as shown below. In both cases, we simply want to maximize rewards without making our new policy too different from the reference. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kyeM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kyeM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png 424w, https://substackcdn.com/image/fetch/$s_!kyeM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png 848w, https://substackcdn.com/image/fetch/$s_!kyeM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png 1272w, https://substackcdn.com/image/fetch/$s_!kyeM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kyeM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png" width="657" height="118.67513736263736" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:263,&quot;width&quot;:1456,&quot;resizeWidth&quot;:657,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!kyeM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png 424w, https://substackcdn.com/image/fetch/$s_!kyeM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png 848w, https://substackcdn.com/image/fetch/$s_!kyeM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png 1272w, https://substackcdn.com/image/fetch/$s_!kyeM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Incorporating a KL penalty into the RL training objective</figcaption></figure></div><p>Such a KL divergence term is almost universally used in RL training for LLMs, though the exact implementation varies. Both of the approaches outlined above have been used successfully. However, capturing the KL divergence via a penalty term in the training objective is probably more common (and a bit simpler). </p><p><strong>The critic.</strong> Recall that the advantage function is defined as the difference between the state-action value function and the value function. In PPO, we estimate the state-action value function&#8212;<em>the expected reward for taking a specific action in a given state</em>&#8212;by using the actual reward observed for a trajectory. The value function, in contrast, is typically estimated using a learned model; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!noKQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55141cda-9010-48ea-ba62-5cd56e9bd814_1772x629.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!noKQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55141cda-9010-48ea-ba62-5cd56e9bd814_1772x629.png 424w, https://substackcdn.com/image/fetch/$s_!noKQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55141cda-9010-48ea-ba62-5cd56e9bd814_1772x629.png 848w, https://substackcdn.com/image/fetch/$s_!noKQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55141cda-9010-48ea-ba62-5cd56e9bd814_1772x629.png 1272w, https://substackcdn.com/image/fetch/$s_!noKQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55141cda-9010-48ea-ba62-5cd56e9bd814_1772x629.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!noKQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55141cda-9010-48ea-ba62-5cd56e9bd814_1772x629.png" width="494" height="175.41071428571428" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/55141cda-9010-48ea-ba62-5cd56e9bd814_1772x629.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:517,&quot;width&quot;:1456,&quot;resizeWidth&quot;:494,&quot;bytes&quot;:168163,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/175107358?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55141cda-9010-48ea-ba62-5cd56e9bd814_1772x629.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!noKQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55141cda-9010-48ea-ba62-5cd56e9bd814_1772x629.png 424w, https://substackcdn.com/image/fetch/$s_!noKQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55141cda-9010-48ea-ba62-5cd56e9bd814_1772x629.png 848w, https://substackcdn.com/image/fetch/$s_!noKQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55141cda-9010-48ea-ba62-5cd56e9bd814_1772x629.png 1272w, https://substackcdn.com/image/fetch/$s_!noKQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55141cda-9010-48ea-ba62-5cd56e9bd814_1772x629.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>For example, we can create a separate copy of our policy, or&#8212;<em>for better parameter efficiency</em>&#8212;add a dedicated value head that shares weights with the policy to predict the value function. This learned value function is often referred to as a value model or critic. Taking a partial response as input, the critic predicts the expected final reward for every token position within the sequence; see below.</p><p><strong>Critic versus reward model.</strong> In the context of LLMs, we predict the reward with a reward model. Additionally, most LLMs are trained using outcome supervision, meaning that a reward is only assigned after the model has generated a complete response (i.e., after the <code>&lt;eos&gt;</code> token has been outputted). The critic and reward model are similar in that they are both learned models&#8212;<em>usually another copy of our LLM policy</em>&#8212;that predict rewards. However, the critic predicts expected rewards given a partial completion as input, while the reward model typically predicts the reward received by an entire response; see below. Going further, the reward model is fixed throughout RL training, while the critic is continually updated. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fXOv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8133ba-f772-44f5-bfbc-19e800a842cc_1732x570.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fXOv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8133ba-f772-44f5-bfbc-19e800a842cc_1732x570.png 424w, https://substackcdn.com/image/fetch/$s_!fXOv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8133ba-f772-44f5-bfbc-19e800a842cc_1732x570.png 848w, https://substackcdn.com/image/fetch/$s_!fXOv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8133ba-f772-44f5-bfbc-19e800a842cc_1732x570.png 1272w, https://substackcdn.com/image/fetch/$s_!fXOv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8133ba-f772-44f5-bfbc-19e800a842cc_1732x570.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fXOv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8133ba-f772-44f5-bfbc-19e800a842cc_1732x570.png" width="1456" height="479" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fb8133ba-f772-44f5-bfbc-19e800a842cc_1732x570.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:479,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:105507,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/175107358?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8133ba-f772-44f5-bfbc-19e800a842cc_1732x570.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fXOv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8133ba-f772-44f5-bfbc-19e800a842cc_1732x570.png 424w, https://substackcdn.com/image/fetch/$s_!fXOv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8133ba-f772-44f5-bfbc-19e800a842cc_1732x570.png 848w, https://substackcdn.com/image/fetch/$s_!fXOv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8133ba-f772-44f5-bfbc-19e800a842cc_1732x570.png 1272w, https://substackcdn.com/image/fetch/$s_!fXOv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb8133ba-f772-44f5-bfbc-19e800a842cc_1732x570.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Value model versus reward model</figcaption></figure></div><p><strong>Critic training.</strong> The value function is on-policy&#8212;<em>it is dependent upon the current parameters of our policy</em>. Unlike <a href="https://cameronrwolfe.substack.com/p/reward-models">reward models</a> which are fixed at the beginning of RL training, the critic is trained alongside the LLM in each policy update to ensure that its predictions remain on-policy&#8212;<em>this is called an actor-critic setup</em><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a>. This is accomplished by adding an extra <a href="https://en.wikipedia.org/wiki/Mean_squared_error">mean-squared error (MSE) loss</a>&#8212;<em>between the rewards predicted by the critic and actual rewards</em>&#8212;to the surrogate loss. </p><p><strong>PPO implementation.</strong> To make each of these ideas more complete, we have implemented PPO in PyTorch pseudocode below. In this implementation, we see several of the key ideas we have discussed so far, such as:</p><ul><li><p>Computing the KL divergence between the current policy and a reference model, then directly subtracting this KL divergence from our reward.</p></li><li><p>Using a learned critic to compute the advantage (and training this critic via an MSE loss alongside the policy itself). </p></li><li><p>Computing the policy ratio with respect to the old model. The script below performs a single policy update, but PPO usually performs several (i.e., 2-4 in the case of LLMs [2]) policy updates for each batch of data. The &#8220;old&#8221; model in the policy ratio is the model from before the first update for a batch. </p></li><li><p>Computing the full (clipped) PPO loss. We take the negative of this loss because PyTorch performs gradient descent (not ascent) by default. </p></li><li><p>Aggregating or averaging the token-level PPO loss across a batch of sequences. There are many ways to aggregate the loss in a batch, and the approach used can significantly impact results [2]<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a>.</p></li></ul><p>One interesting detail we see here is that&#8212;<em>despite the PPO loss using token probabilities and not log probabilities</em>&#8212;we choose to work with token log probabilities and exponentiate them instead of using raw probabilities when computing the policy ratio. This is a commonly-used numerical stability trick. </p><pre><code><code>import torch
import torch.nn.functional as F

# constants
kl_beta = 0.1
critic_weight = 0.5
ppo_eps = 0.2

# sample prompt completions and rewards
with torch.no_grad():
    completions = LLM.generate(prompts)  # (B*G, L)
    rewards = RM(completions)  # (B*G, 1)

# create a padding mask from lengths of completions in batch
completion_mask = &lt;... mask out padding tokens ...&gt;

# compute value function / critic output
values = CRITIC(completions)  # (B*G, L) - predicted reward per token!

# get policy logprobs for each action
llm_out = LLM(completions)
per_token_logps = F.log_softmax(llm_out, dim=-1)  # (B*G, L)

# get reference logprobs for each action
ref_out = REF(completions)
ref_per_token_logps = F.log_softmax(ref_out, dim=-1)  # (B*G, L)

# compute KL divergence between policy and reference policy
kl_div = per_token_logps - ref_per_token_logps

# directly subtract KL divergence from rewards
# NOTE: KL div is per token, so reward becomes per token and reward
# for all tokens (besides last token) is just kl divergence.
# Reward for last token is sum of outcome reward and KL div.
rewards -= kl_beta * kl_div # (B*G, L)

# compute the advantage - simple approach
advantage = rewards - values.detach()  # (B*G, L)

# compute the policy ratio
# NOTE: old_per_token_logps must be persisted during first policy
# update for this batch of data and re-used in each subsequent update
policy_ratio = torch.exp(
    per_token_logps - old_per_token_logps,
)  # (B*G, L)
clip_policy_ratio = torch.clamp(
    policy_ratio,
    min=1.0 - ppo_eps,
    max=1.0 + ppo_eps,
)

# compute the ppo loss
ppo_loss = torch.min(
    advantage * policy_ratio,
    advantage * clip_policy_ratio,
)  # (B*G, L)
ppo_loss = -ppo_loss

# combine ppo loss and critic mse loss
critic_loss = ((rewards - values) ** 2)  # (B*G, L)
loss = ppo_loss + critic_weight * critic_loss

# aggregate the loss across tokens (many options exist here)
loss = ((loss * completion_mask).sum(axis=-1) /
        completion_mask.sum(axis=-1)).mean()

# perform policy gradient update
optimizer.zero_grad()
loss.backward()
optimizer.step()</code></code></pre><p><strong>Experiments.</strong> The LLM setting is not considered in [1], as PPO was proposed during the heyday of <a href="https://spinningup.openai.com/en/latest/spinningup/rl_intro.html">DeepRL</a>&#8212;<em>well before the proliferation of LLMs</em>. Understanding the experimental results in [1] is nonetheless useful for gaining intuition on the mechanics of PPO. In these experiments, PPO is used to train fully-connected <a href="https://en.wikipedia.org/wiki/Multilayer_perceptron">multi-layer perceptrons</a> (MLPs)  from scratch on a variety of robotics and video game tasks. The policy and critic are kept separate (i.e., no parameter sharing). </p><p>First, authors use several simulated robotics tasks from the <a href="https://github.com/Farama-Foundation/Gymnasium">OpenAI Gym</a> to test different formulations of the surrogate loss in PPO:</p><ul><li><p>The clipped objective (standard for PPO).</p></li><li><p>The unclipped objective.</p></li><li><p>The unclipped objective with (adaptive<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a>) KL divergence.</p></li></ul><p>Unlike the typical RL training setup for LLMs, these experiments compute the KL divergence between the current policy and the old model, with the goal of testing whether this approach works better than the standard PPO clipping mechanism. Ordinarily, when training LLMs with PPO, the KL divergence is computed between the current policy and a reference model (e.g., the SFT model), not the old model<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-9" href="#footnote-9" target="_self">9</a>. However, in these experiments, using a reference model for the KL divergence is not possible because we are training models from scratch&#8212;<em>there is no pretrained model to serve as a reference</em>. </p><p>The results from testing these different objectives are outlined below&#8212;<em>the clipped objective for PPO stabilizes training and clearly outperforms the other options</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CHQh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1cc9a21-11e9-4c34-8d72-0576cde83e94_2086x894.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CHQh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1cc9a21-11e9-4c34-8d72-0576cde83e94_2086x894.png 424w, https://substackcdn.com/image/fetch/$s_!CHQh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1cc9a21-11e9-4c34-8d72-0576cde83e94_2086x894.png 848w, https://substackcdn.com/image/fetch/$s_!CHQh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1cc9a21-11e9-4c34-8d72-0576cde83e94_2086x894.png 1272w, https://substackcdn.com/image/fetch/$s_!CHQh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1cc9a21-11e9-4c34-8d72-0576cde83e94_2086x894.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CHQh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1cc9a21-11e9-4c34-8d72-0576cde83e94_2086x894.png" width="1456" height="624" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a1cc9a21-11e9-4c34-8d72-0576cde83e94_2086x894.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:624,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:383545,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/175107358?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1cc9a21-11e9-4c34-8d72-0576cde83e94_2086x894.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CHQh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1cc9a21-11e9-4c34-8d72-0576cde83e94_2086x894.png 424w, https://substackcdn.com/image/fetch/$s_!CHQh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1cc9a21-11e9-4c34-8d72-0576cde83e94_2086x894.png 848w, https://substackcdn.com/image/fetch/$s_!CHQh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1cc9a21-11e9-4c34-8d72-0576cde83e94_2086x894.png 1272w, https://substackcdn.com/image/fetch/$s_!CHQh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1cc9a21-11e9-4c34-8d72-0576cde83e94_2086x894.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>PPO is also tested on 49 games in the <a href="https://arxiv.org/abs/1207.4708">Atari gameplay domain</a> and compared to strong baseline RL algorithms like <a href="https://arxiv.org/abs/1602.01783">A2C</a> and <a href="https://arxiv.org/abs/1611.01224">ACER</a>. Performance is measured based on two metrics:</p><ol><li><p>Average reward throughout training (favors faster learning).</p></li><li><p>Average reward over the last 100 training steps (favors final quality / reward). </p></li></ol><p>For each of these metrics, we compute a &#8220;win rate&#8221;, which captures the number of times each algorithm achieves the top score across all Atari games. The results of these experiments are shown below, where we see that baseline algorithms like ACER perform similarly to or better than PPO but learn much slower. <em>PPO stabilizes training, performs well, and yields an improvement in sample complexity</em><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-10" href="#footnote-10" target="_self">10</a>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SgN4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc79fdf5d-6d9e-4f9c-b87e-885fe063de66_1814x499.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SgN4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc79fdf5d-6d9e-4f9c-b87e-885fe063de66_1814x499.png 424w, https://substackcdn.com/image/fetch/$s_!SgN4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc79fdf5d-6d9e-4f9c-b87e-885fe063de66_1814x499.png 848w, https://substackcdn.com/image/fetch/$s_!SgN4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc79fdf5d-6d9e-4f9c-b87e-885fe063de66_1814x499.png 1272w, https://substackcdn.com/image/fetch/$s_!SgN4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc79fdf5d-6d9e-4f9c-b87e-885fe063de66_1814x499.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SgN4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc79fdf5d-6d9e-4f9c-b87e-885fe063de66_1814x499.png" width="1456" height="401" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c79fdf5d-6d9e-4f9c-b87e-885fe063de66_1814x499.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:401,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:163361,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/175107358?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc79fdf5d-6d9e-4f9c-b87e-885fe063de66_1814x499.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SgN4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc79fdf5d-6d9e-4f9c-b87e-885fe063de66_1814x499.png 424w, https://substackcdn.com/image/fetch/$s_!SgN4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc79fdf5d-6d9e-4f9c-b87e-885fe063de66_1814x499.png 848w, https://substackcdn.com/image/fetch/$s_!SgN4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc79fdf5d-6d9e-4f9c-b87e-885fe063de66_1814x499.png 1272w, https://substackcdn.com/image/fetch/$s_!SgN4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc79fdf5d-6d9e-4f9c-b87e-885fe063de66_1814x499.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><h4><a href="https://arxiv.org/abs/1506.02438">Generalized Advantage Estimation (GAE)</a> [3]</h4><p>The advantage tells us how much better a given action is compared to the average action in a given state: <code>A(s_t, a_t) = Q(s_t, a_t) - V(s_t)</code>. The value function in this formulation is estimated by our critic, but we have not yet discussed in detail how the advantage function can be computed. In PPO, the advantage function is estimated on a per-token (or action) basis. There are two main approaches that can be used to compute the advantage, and these approaches form the basis for most other techniques.</p><p><strong>(1) Monte Carlo (MC). </strong>An MC estimate of the advantage relies upon the actual reward observed for the full trajectory. Namely, the advantage is computed as the difference between the cumulative reward for the full trajectory <code>R(s_t)</code><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-11" href="#footnote-11" target="_self">11</a> and the value function for the current state <code>V(s_t)</code>, as predicted by the critic.</p><p>So far, our discussions of PPO have assumed an MC approach for estimating the advantage. The MC estimate has low bias because it relies on the actual reward observed for the trajectory (exact information), but MC estimates also have high variance. Therefore, we need to take many samples and make a sufficient number of observations to yield an accurate advantage estimate&#8212;<em>this can be expensive</em>.</p><p><strong>(2) Temporal Difference (TD).</strong> The TD residual uses per-token value predictions from the critic to form a one-step estimate of the advantage, as shown below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!A4K-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c1e98c7-da70-4da6-a365-3b2fe9cd2230_1723x896.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!A4K-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c1e98c7-da70-4da6-a365-3b2fe9cd2230_1723x896.png 424w, https://substackcdn.com/image/fetch/$s_!A4K-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c1e98c7-da70-4da6-a365-3b2fe9cd2230_1723x896.png 848w, https://substackcdn.com/image/fetch/$s_!A4K-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c1e98c7-da70-4da6-a365-3b2fe9cd2230_1723x896.png 1272w, https://substackcdn.com/image/fetch/$s_!A4K-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c1e98c7-da70-4da6-a365-3b2fe9cd2230_1723x896.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!A4K-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c1e98c7-da70-4da6-a365-3b2fe9cd2230_1723x896.png" width="509" height="264.63804945054943" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4c1e98c7-da70-4da6-a365-3b2fe9cd2230_1723x896.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:757,&quot;width&quot;:1456,&quot;resizeWidth&quot;:509,&quot;bytes&quot;:168566,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/175107358?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c1e98c7-da70-4da6-a365-3b2fe9cd2230_1723x896.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!A4K-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c1e98c7-da70-4da6-a365-3b2fe9cd2230_1723x896.png 424w, https://substackcdn.com/image/fetch/$s_!A4K-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c1e98c7-da70-4da6-a365-3b2fe9cd2230_1723x896.png 848w, https://substackcdn.com/image/fetch/$s_!A4K-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c1e98c7-da70-4da6-a365-3b2fe9cd2230_1723x896.png 1272w, https://substackcdn.com/image/fetch/$s_!A4K-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c1e98c7-da70-4da6-a365-3b2fe9cd2230_1723x896.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Temporal difference (TD) residual</figcaption></figure></div><p>This TD residual analyzes how much the expected reward changes after predicting a single token and observing the actual reward for that action<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-12" href="#footnote-12" target="_self">12</a>. We subtract the value for the current state <code>V(s_t)</code> from the sum of:</p><ol><li><p>The observed reward for the current state <code>r_t</code>.</p></li><li><p>The (discounted) value of the next state <code>V(s_{t+1})</code>.</p></li></ol><p>Similarly to <code>V(s_t)</code>, the sum of these two terms captures the expected return at state <code>s_t</code>. However, the reward for the current state is captured via the actual observed reward <code>r_t</code> rather than being estimated by the critic. Therefore, the difference between these terms is capturing how much better the actual reward observed at state <code>s_t</code> is than expected&#8212;<em>this is the advantage</em>!</p><p>By using the actual reward <code>r_t</code>, we incorporate some exact information into our advantage estimate&#8212;<em>the terms in the estimate come partly from our critic and partly from real rewards</em>. Using such token-level rewards to estimate the advantage lowers the variance of the policy gradient. If our value function were exact, then the TD residual would also form an unbiased advantage estimate. Unfortunately, we do not have access to the ground truth value function, so we train a critic to estimate the value function<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-13" href="#footnote-13" target="_self">13</a>. Because accurately anticipating final rewards from a partial response is difficult, <em>the TD residual is biased.</em></p><p><strong>N-step estimators. </strong>The TD residual analyzes the difference between actual and expected reward for a single step. However, we can generalize this idea to capture any number of steps. As shown below, an <code>N</code>-step advantage estimator has a similar structure to the TD residual, but it incorporates real rewards for <code>N</code> states, where <code>N</code> can be greater than one.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_U8s!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ae75ed-997b-4654-b383-dda56a8d9b2e_2298x716.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_U8s!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ae75ed-997b-4654-b383-dda56a8d9b2e_2298x716.png 424w, https://substackcdn.com/image/fetch/$s_!_U8s!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ae75ed-997b-4654-b383-dda56a8d9b2e_2298x716.png 848w, https://substackcdn.com/image/fetch/$s_!_U8s!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ae75ed-997b-4654-b383-dda56a8d9b2e_2298x716.png 1272w, https://substackcdn.com/image/fetch/$s_!_U8s!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ae75ed-997b-4654-b383-dda56a8d9b2e_2298x716.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_U8s!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ae75ed-997b-4654-b383-dda56a8d9b2e_2298x716.png" width="696" height="217.02197802197801" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/18ae75ed-997b-4654-b383-dda56a8d9b2e_2298x716.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:454,&quot;width&quot;:1456,&quot;resizeWidth&quot;:696,&quot;bytes&quot;:249508,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/175107358?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ae75ed-997b-4654-b383-dda56a8d9b2e_2298x716.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_U8s!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ae75ed-997b-4654-b383-dda56a8d9b2e_2298x716.png 424w, https://substackcdn.com/image/fetch/$s_!_U8s!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ae75ed-997b-4654-b383-dda56a8d9b2e_2298x716.png 848w, https://substackcdn.com/image/fetch/$s_!_U8s!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ae75ed-997b-4654-b383-dda56a8d9b2e_2298x716.png 1272w, https://substackcdn.com/image/fetch/$s_!_U8s!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18ae75ed-997b-4654-b383-dda56a8d9b2e_2298x716.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"><code>N</code>-step advantage estimators</figcaption></figure></div><p>Similarly to the single-step TD residual, advantage estimators with lower values of <code>N</code> have low variance but high bias. As we increase the value of <code>N</code>, however, we are incorporating more exact reward information into the advantage estimate, thus lowering the bias (and, in turn, increasing variance).</p><p>Taking this further, we can even recover an MC estimate by setting <code>N</code> equal to the total number of steps in the trajectory! This setting of <code>N</code> simply yields the difference between cumulative reward and the value of the current state <code>V(s_t)</code>. Therefore, different settings of <code>N</code> yield different tradeoffs in bias and variance, spanning all the way from the single-step TD residual (high bias, low variance) to an MC estimate (high variance, low bias).</p><div class="pullquote"><p><em>&#8220;GAE is an alternate method to compute the advantage for policy gradient algorithms that better balances the bias-variance tradeoff. Traditional single-step advantage estimates can introduce too much bias, while using complete trajectories often suffer from high variance. GAE works by combining two ideas &#8211; multi-step prediction and weighted running average (or just one of these).&#8221; - from [2]</em></p></div><p><strong>Generalized Advantage Estimation (GAE)</strong>, which is the most commonly-used approach for estimating the advantage with PPO, makes use of <code>N</code>-step advantage estimates. Instead of choosing a single value of <code>N</code>, however, GAE uses all values of <code>N</code> by taking an average of <code>N</code>-step advantage estimates with different values of <code>N</code>. This is done by introducing a mixing parameter <code>&#955;</code> for GAE as shown below<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-14" href="#footnote-14" target="_self">14</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!v3wn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff11ed641-c3be-442a-ad17-b41072a721a8_2015x843.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!v3wn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff11ed641-c3be-442a-ad17-b41072a721a8_2015x843.png 424w, https://substackcdn.com/image/fetch/$s_!v3wn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff11ed641-c3be-442a-ad17-b41072a721a8_2015x843.png 848w, https://substackcdn.com/image/fetch/$s_!v3wn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff11ed641-c3be-442a-ad17-b41072a721a8_2015x843.png 1272w, https://substackcdn.com/image/fetch/$s_!v3wn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff11ed641-c3be-442a-ad17-b41072a721a8_2015x843.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!v3wn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff11ed641-c3be-442a-ad17-b41072a721a8_2015x843.png" width="1456" height="609" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f11ed641-c3be-442a-ad17-b41072a721a8_2015x843.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:609,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:272434,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/175107358?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff11ed641-c3be-442a-ad17-b41072a721a8_2015x843.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!v3wn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff11ed641-c3be-442a-ad17-b41072a721a8_2015x843.png 424w, https://substackcdn.com/image/fetch/$s_!v3wn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff11ed641-c3be-442a-ad17-b41072a721a8_2015x843.png 848w, https://substackcdn.com/image/fetch/$s_!v3wn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff11ed641-c3be-442a-ad17-b41072a721a8_2015x843.png 1272w, https://substackcdn.com/image/fetch/$s_!v3wn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff11ed641-c3be-442a-ad17-b41072a721a8_2015x843.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">GAE formulation</figcaption></figure></div><p>In this formulation, setting <code>&#955; = 0</code> yields a single-step TD residual because only the first term in the sum receives a non-zero weight. Additionally, a setting of <code>&#955; = 1</code> recovers the MC estimate. To see this, we can expand the definition of each TD residual in the sum, yielding the difference in cumulative discounted rewards and the value function of the current state <code>V(s_t)</code>; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DRfY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdc295ca-a904-4885-85b2-59968c744cc0_2872x674.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DRfY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdc295ca-a904-4885-85b2-59968c744cc0_2872x674.png 424w, https://substackcdn.com/image/fetch/$s_!DRfY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdc295ca-a904-4885-85b2-59968c744cc0_2872x674.png 848w, https://substackcdn.com/image/fetch/$s_!DRfY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdc295ca-a904-4885-85b2-59968c744cc0_2872x674.png 1272w, https://substackcdn.com/image/fetch/$s_!DRfY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdc295ca-a904-4885-85b2-59968c744cc0_2872x674.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DRfY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdc295ca-a904-4885-85b2-59968c744cc0_2872x674.png" width="1456" height="342" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fdc295ca-a904-4885-85b2-59968c744cc0_2872x674.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:342,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:153117,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/175107358?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdc295ca-a904-4885-85b2-59968c744cc0_2872x674.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DRfY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdc295ca-a904-4885-85b2-59968c744cc0_2872x674.png 424w, https://substackcdn.com/image/fetch/$s_!DRfY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdc295ca-a904-4885-85b2-59968c744cc0_2872x674.png 848w, https://substackcdn.com/image/fetch/$s_!DRfY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdc295ca-a904-4885-85b2-59968c744cc0_2872x674.png 1272w, https://substackcdn.com/image/fetch/$s_!DRfY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdc295ca-a904-4885-85b2-59968c744cc0_2872x674.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The benefit of GAE is that the value of <code>&#955; &#8712; [0, 1]</code> controls the bias variance tradeoff. As we increase the value of <code>&#955;</code>, more exact reward information is used in the advantage estimate, thus lowering the bias (but increasing variance). Similarly, we can use lower values of <code>&#955;</code> to reduce variance at the cost of higher bias.</p><p><strong>Outcome rewards. </strong>When we are working with LLMs, we usually use an outcome reward setup, which simplifies GAE. The reward is always zero<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-15" href="#footnote-15" target="_self">15</a>, unless we are at the final step of the trajectory. In this scenario, most of the TD residual terms in our GAE summation are simply the difference in (discounted) value functions between two time steps <code>&#947;V(s_{t + 1}) - V(s_t)</code>. The final term in the summation contains the actual outcome reward observed for the trajectory.</p><p><strong>GAE implementation.</strong> To make the concept of GAE more concrete, let&#8217;s examine a real-world example adapted from AI2&#8217;s <a href="https://github.com/allenai/open-instruct">OpenInstruct</a> library. The full PPO training script, available <a href="https://github.com/allenai/open-instruct/blob/main/open_instruct/ppo2.py">here</a>, is a great resource for learning the details of PPO in a production-grade training setting. The GAE component of this script is shown below with some additional comments for clarity. We can efficiently compute the GAE recursion by iterating through the sequence in reverse order.</p><pre><code>import torch

# store advantages in reverse order while iterating thru sequence
advantages_reversed = []

# iterate backward to compute GAE recursion
lastgaelam = 0
gen_length = responses.shape[1]
for t in reversed(range(gen_length)):
    if t &lt; gen_length - 1:
        # get value model prediction for time t + 1
        nextvalues = values[:, t + 1]
    else:
        # no values predicted beyond end of sequence
        nextvalues = 0.0

    # compute TD residual at time t    
    delta = rewards[:, t] + gamma * nextvalues - values[:, t]

    # add to the discounted sum of TD residuals for GAE    
    lastgaelam = delta + gamma * lam * lastgaelam

    # store the advantage for step t in our list
    advantages_reversed.append(lastgaelam)

# put the list of advantages in the correct order
advantages = torch.stack(advantages_reversed[::-1], axis=1)</code></pre><h2>Using PPO for LLMs</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CJn6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CJn6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 424w, https://substackcdn.com/image/fetch/$s_!CJn6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 848w, https://substackcdn.com/image/fetch/$s_!CJn6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 1272w, https://substackcdn.com/image/fetch/$s_!CJn6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CJn6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png" width="1456" height="430" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:430,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!CJn6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 424w, https://substackcdn.com/image/fetch/$s_!CJn6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 848w, https://substackcdn.com/image/fetch/$s_!CJn6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 1272w, https://substackcdn.com/image/fetch/$s_!CJn6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [7])</figcaption></figure></div><p>There are two different types of RL training that are commonly used to train LLMs (shown above):</p><ul><li><p><em><a href="https://cameronrwolfe.substack.com/p/the-story-of-rlhf-origins-motivations">Reinforcement Learning from Human Feedback (RLHF)</a></em> trains the LLM using RL with rewards derived from a human preference <a href="https://cameronrwolfe.substack.com/p/reward-models">reward model</a>.</p></li><li><p><em><a href="https://cameronrwolfe.substack.com/i/153722335/reinforcement-learning-with-verifiable-rewards">Reinforcement Learning with Verifiable Rewards (RLVR)</a></em> trains the LLM using RL with rewards derived from rules-based or deterministic verifiers.</p></li></ul><p>These RL training techniques differ mainly in how they derive the reward for training, but other details of the algorithms are mostly similar. As depicted below, they both operate by generating completions over a set of prompts, computing the reward for these completions, and using the rewards to derive a <a href="https://cameronrwolfe.substack.com/p/policy-gradients-the-foundation-of">policy update</a>&#8212;<em>or an update to the LLM&#8217;s parameters</em>&#8212;with an RL optimizer (e.g., PPO). </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uPv8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uPv8!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 424w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 848w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 1272w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uPv8!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif" width="1456" height="817" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:817,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;[animate output image]&quot;,&quot;title&quot;:&quot;[animate output image]&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="[animate output image]" title="[animate output image]" srcset="https://substackcdn.com/image/fetch/$s_!uPv8!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 424w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 848w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 1272w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Visual walkthrough of RL training for LLMs</figcaption></figure></div><p>RLHF was the original form of RL explored by LLMs like InstructGPT [8], the predecessor to ChatGPT. Early research on RLHF for LLMs used PPO as the default RL optimizer, which ultimately made PPO a standard choice for training LLMs with RL. RLVR was introduced <a href="https://cameronrwolfe.substack.com/p/demystifying-reasoning-models">more recently</a>, and most works in this space use <a href="https://arxiv.org/abs/2402.03300">GRPO</a> as the underlying RL optimizer instead of PPO. </p><blockquote><p><em>&#8220;PPO has been positioned as the canonical method for RLHF. However, it involves both high computational cost and sensitive hyperparameter tuning.&#8221;</em> - from [9]</p></blockquote><p><strong>Downsides of PPO.</strong> Though it quickly became the default RL optimizer for RLHF, PPO is a complex actor-critic algorithm with high compute and memory overhead, as well as many low-level implementation complexities. The memory overhead of PPO is high because we keep four copies of the LLM in memory:</p><ol><li><p>The policy.</p></li><li><p>The reference policy.</p></li><li><p>The critic.</p></li><li><p>The reward model (if we are using a reward model).</p></li></ol><p>Additionally, we are updating the parameters of our critic alongside the policy itself and running inference for all of these models simultaneously, leading to high compute costs. Beyond memory and compute overhead, there are also many implementation details that we must carefully consider during PPO training:</p><ul><li><p>How do we initialize the critic and reward model? What training settings should we adopt for these models?</p></li><li><p>What value of <code>&#949;</code> should we use for clipping in PPO? </p></li><li><p>Which model should we use as our reference model for the KL divergence? </p></li><li><p>How many policy updates should we perform for a batch of data?</p></li><li><p>Do we add the KL divergence as a penalty to the loss or directly incorporate it into the reward function? What scaling factor <code>&#946;</code> should we use?</p></li><li><p>How should we weight the critic&#8217;s loss relative to the main PPO loss?</p></li><li><p>Should we use GAE? What setting should we use for <code>&#955;</code>?</p></li></ul><p>Each of these choices may impact the results of RL training! PPO is a sensitive algorithm that is prone to instability&#8212;<em>we may spend a lot of compute and time on training a model that ultimately performs poorly due to an incorrect hyperparameter setting</em>. For these reasons, simpler RL algorithms like <a href="https://cameronrwolfe.substack.com/p/reinforce">REINFORCE</a> and <a href="https://arxiv.org/abs/2402.03300">GRPO</a>&#8212;<em>or even RL-free techniques like <a href="https://cameronrwolfe.substack.com/p/direct-preference-optimization">DPO</a></em>&#8212;have become popular alternatives to PPO. </p><p><strong>PPO for LLMs.</strong> In this final section, we will take what we have learned and study PPO specifically in the context of LLM training. We will focus particularly on the foundational works that were the first to use PPO for training LLMs [5, 8]&#8212;<em>this research laid the groundwork for the modern LLM boom shortly after</em>. While studying these papers, we will emphasize implementation details and practical lessons that are necessary to obtain a working PPO implementation.</p><h4><strong><a href="https://arxiv.org/abs/2009.01325">Learning to Summarize from Human Feedback</a> [5]</strong></h4><p>Abstractive summarization&#8212;<em>or using models to create a human-readable, concise summary of a piece of text&#8212;</em>has been studied for a long time. Prior to the rise of LLMs and RLHF, most papers on this topic trained language models using a <a href="https://cameronrwolfe.substack.com/p/understanding-and-using-supervised">supervised learning</a> approach with human-written reference summaries and evaluated these models using traditional metrics like the <a href="https://cameronrwolfe.substack.com/i/138218863/evaluating-language-models-and-the-rouge-score">ROUGE score</a>. </p><p>These approaches can work well, but supervised learning and ROUGE are both proxies for what is actually desired&#8212;<em>a model that writes high-quality summaries</em>. In [5], authors solve this problem by replacing supervised learning with RLHF. Such an approach allows us to finetune language models to produce better summaries by directly using human feedback on model outputs as a training signal. </p><p><strong>PPO for summarization.</strong> Authors in [5] are commonly credited with proposing the first RLHF framework for LLM finetuning. The proposed approach allows us to optimize an LLM based on the quality of its responses, as assessed by human annotators. Beginning with a pretrained LLM, we can iteratively:</p><ol><li><p>Collect human <a href="https://cameronrwolfe.substack.com/i/166169560/the-bradley-terry-model-of-preference">preference data</a>.</p></li><li><p>Train a <a href="https://cameronrwolfe.substack.com/p/reward-models">reward model</a> over this preference data.</p></li><li><p>Finetune our LLM with RL using this reward model. </p></li></ol><p>Notably, authors in [5] adopt PPO as their underlying RL optimizer, which led PPO to become the common choice in subsequent RLHF research. With this RL training strategy, we can train an LLM to produce summaries that surpass the quality of human summaries and are even better than those produced by larger LLMs trained with a supervised learning approach; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bjdU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F377524f4-cff7-44f9-b717-ed1e842b50bb_1612x970.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bjdU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F377524f4-cff7-44f9-b717-ed1e842b50bb_1612x970.png 424w, https://substackcdn.com/image/fetch/$s_!bjdU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F377524f4-cff7-44f9-b717-ed1e842b50bb_1612x970.png 848w, https://substackcdn.com/image/fetch/$s_!bjdU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F377524f4-cff7-44f9-b717-ed1e842b50bb_1612x970.png 1272w, https://substackcdn.com/image/fetch/$s_!bjdU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F377524f4-cff7-44f9-b717-ed1e842b50bb_1612x970.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bjdU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F377524f4-cff7-44f9-b717-ed1e842b50bb_1612x970.png" width="656" height="394.68131868131866" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/377524f4-cff7-44f9-b717-ed1e842b50bb_1612x970.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:876,&quot;width&quot;:1456,&quot;resizeWidth&quot;:656,&quot;bytes&quot;:189610,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!bjdU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F377524f4-cff7-44f9-b717-ed1e842b50bb_1612x970.png 424w, https://substackcdn.com/image/fetch/$s_!bjdU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F377524f4-cff7-44f9-b717-ed1e842b50bb_1612x970.png 848w, https://substackcdn.com/image/fetch/$s_!bjdU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F377524f4-cff7-44f9-b717-ed1e842b50bb_1612x970.png 1272w, https://substackcdn.com/image/fetch/$s_!bjdU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F377524f4-cff7-44f9-b717-ed1e842b50bb_1612x970.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [5])</figcaption></figure></div><p><strong>SFT stage. </strong>In [5], the LLM is first trained using <a href="https://cameronrwolfe.substack.com/p/understanding-and-using-supervised">supervised finetuning</a> over human reference summaries for a single epoch, producing a supervised baseline that is later finetuned via RLHF. The methodology for RLHF proposed in [5]&#8212;<em>as illustrated in the figure shown below</em>&#8212;is tailored to the summarization task. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oeIY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc713702e-ca1c-4759-bff4-b1dedfdf1bbf_1650x1016.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oeIY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc713702e-ca1c-4759-bff4-b1dedfdf1bbf_1650x1016.png 424w, https://substackcdn.com/image/fetch/$s_!oeIY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc713702e-ca1c-4759-bff4-b1dedfdf1bbf_1650x1016.png 848w, https://substackcdn.com/image/fetch/$s_!oeIY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc713702e-ca1c-4759-bff4-b1dedfdf1bbf_1650x1016.png 1272w, https://substackcdn.com/image/fetch/$s_!oeIY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc713702e-ca1c-4759-bff4-b1dedfdf1bbf_1650x1016.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oeIY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc713702e-ca1c-4759-bff4-b1dedfdf1bbf_1650x1016.png" width="1456" height="897" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c713702e-ca1c-4759-bff4-b1dedfdf1bbf_1650x1016.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:897,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:280506,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!oeIY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc713702e-ca1c-4759-bff4-b1dedfdf1bbf_1650x1016.png 424w, https://substackcdn.com/image/fetch/$s_!oeIY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc713702e-ca1c-4759-bff4-b1dedfdf1bbf_1650x1016.png 848w, https://substackcdn.com/image/fetch/$s_!oeIY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc713702e-ca1c-4759-bff4-b1dedfdf1bbf_1650x1016.png 1272w, https://substackcdn.com/image/fetch/$s_!oeIY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc713702e-ca1c-4759-bff4-b1dedfdf1bbf_1650x1016.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [5])</figcaption></figure></div><p><strong>Preferences and reward models.</strong> In [5], a preference dataset is constructed by:</p><ul><li><p>Grabbing a textual input to summarize&#8212;<em>this is our prompt</em>. </p></li><li><p>Producing many summaries of the input using several different policies&#8212;<em>these are different responses to the same prompt</em>. </p></li><li><p>Sampling two summaries or responses for the prompt.</p></li><li><p>Asking a human annotator to identify the better of the two summaries.</p></li></ul><p>Authors in [5] collect this preference data in large batches. Once we have finished collecting a new batch of preference data, we train a reward model on the data such that it accurately predicts human preference scores given an LLM-generated summary. Then, we use this reward model to finetune our policy with PPO.</p><p><strong>A</strong> <strong>KL divergence</strong> term is used for PPO in [5] to minimize divergence from the SFT model. Interestingly, authors in [5] were not the first to use this strategy&#8212;<em>it was actually adopted from <a href="https://arxiv.org/abs/1907.00456">prior work</a>. </em>The KL divergence is directly subtracted from the rewards instead of being added to the PPO loss as a penalty term. We see in [5] that adding the KL divergence into RL training helps to prevent the model&#8217;s summaries from becoming too different from those seen during training.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZjlA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc088796c-52eb-45e5-afbc-195116ec5d1f_1612x764.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZjlA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc088796c-52eb-45e5-afbc-195116ec5d1f_1612x764.png 424w, https://substackcdn.com/image/fetch/$s_!ZjlA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc088796c-52eb-45e5-afbc-195116ec5d1f_1612x764.png 848w, https://substackcdn.com/image/fetch/$s_!ZjlA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc088796c-52eb-45e5-afbc-195116ec5d1f_1612x764.png 1272w, https://substackcdn.com/image/fetch/$s_!ZjlA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc088796c-52eb-45e5-afbc-195116ec5d1f_1612x764.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZjlA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc088796c-52eb-45e5-afbc-195116ec5d1f_1612x764.png" width="620" height="293.81868131868134" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c088796c-52eb-45e5-afbc-195116ec5d1f_1612x764.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:690,&quot;width&quot;:1456,&quot;resizeWidth&quot;:620,&quot;bytes&quot;:255931,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!ZjlA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc088796c-52eb-45e5-afbc-195116ec5d1f_1612x764.png 424w, https://substackcdn.com/image/fetch/$s_!ZjlA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc088796c-52eb-45e5-afbc-195116ec5d1f_1612x764.png 848w, https://substackcdn.com/image/fetch/$s_!ZjlA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc088796c-52eb-45e5-afbc-195116ec5d1f_1612x764.png 1272w, https://substackcdn.com/image/fetch/$s_!ZjlA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc088796c-52eb-45e5-afbc-195116ec5d1f_1612x764.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [5])</figcaption></figure></div><p><strong>Experiments. </strong>In [5], large pretrained models matching the style of GPT-3 with 1.3B to 6.7B parameters are finetuned over the <a href="https://huggingface.co/datasets/openai/summarize_from_feedback">TL;DR dataset</a>. This dataset, which contains over three million posts from Reddit with author-written summaries, is filtered to only 120K high-quality examples; see above. Models are first trained using SFT&#8212;<em>these supervised models are also used as baselines across experiments</em>&#8212;and then further finetuned with RLHF. Given that summary length can impact the resulting quality score, the authors in [5] constrain generated summaries to 48 tokens and finetune the model accordingly.</p><p>Finetuning language models with human feedback outperforms a variety of strong English summarization baselines. Notably, the 1.3B summarization model outperforms a 10&#215; larger model trained with SFT, and the 6.7B summarization model performs even better than the 1.3B model, revealing that summarization quality improves with model scale. Furthermore, we see that summarization models trained via RLHF generalize better to new domains. In particular, the models in [5] are applied to summarizing news articles&#8212;<em>a domain outside of the training data</em>&#8212;and found to perform well without further finetuning; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HYOl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda0d4ac2-cee0-464b-ba5d-3b278f1b1b9c_1628x846.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HYOl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda0d4ac2-cee0-464b-ba5d-3b278f1b1b9c_1628x846.png 424w, https://substackcdn.com/image/fetch/$s_!HYOl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda0d4ac2-cee0-464b-ba5d-3b278f1b1b9c_1628x846.png 848w, https://substackcdn.com/image/fetch/$s_!HYOl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda0d4ac2-cee0-464b-ba5d-3b278f1b1b9c_1628x846.png 1272w, https://substackcdn.com/image/fetch/$s_!HYOl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda0d4ac2-cee0-464b-ba5d-3b278f1b1b9c_1628x846.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HYOl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda0d4ac2-cee0-464b-ba5d-3b278f1b1b9c_1628x846.png" width="1456" height="757" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/da0d4ac2-cee0-464b-ba5d-3b278f1b1b9c_1628x846.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:757,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:261660,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!HYOl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda0d4ac2-cee0-464b-ba5d-3b278f1b1b9c_1628x846.png 424w, https://substackcdn.com/image/fetch/$s_!HYOl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda0d4ac2-cee0-464b-ba5d-3b278f1b1b9c_1628x846.png 848w, https://substackcdn.com/image/fetch/$s_!HYOl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda0d4ac2-cee0-464b-ba5d-3b278f1b1b9c_1628x846.png 1272w, https://substackcdn.com/image/fetch/$s_!HYOl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda0d4ac2-cee0-464b-ba5d-3b278f1b1b9c_1628x846.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [5])</figcaption></figure></div><p>From here, summarization models are evaluated in terms of:</p><ul><li><p><em>Coverage</em>: the summary covers all information from the original post.</p></li><li><p><em>Accuracy</em>: statements in the summary are accurate.</p></li><li><p><em>Coherence</em>: the summary is easy to read on its own.</p></li><li><p><em>Quality</em>: the overall quality of the summary is good.</p></li></ul><p>When evaluated in this manner, we see that summarization models trained via RLHF benefit the most in terms of coverage, while coherence and accuracy are only slightly improved compared to supervised baseline models; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!d5Qe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1f3213a-8fd2-4703-8987-b2cfcbc5880a_662x672.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!d5Qe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1f3213a-8fd2-4703-8987-b2cfcbc5880a_662x672.png 424w, https://substackcdn.com/image/fetch/$s_!d5Qe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1f3213a-8fd2-4703-8987-b2cfcbc5880a_662x672.png 848w, https://substackcdn.com/image/fetch/$s_!d5Qe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1f3213a-8fd2-4703-8987-b2cfcbc5880a_662x672.png 1272w, https://substackcdn.com/image/fetch/$s_!d5Qe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1f3213a-8fd2-4703-8987-b2cfcbc5880a_662x672.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!d5Qe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1f3213a-8fd2-4703-8987-b2cfcbc5880a_662x672.png" width="286" height="290.3202416918429" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d1f3213a-8fd2-4703-8987-b2cfcbc5880a_662x672.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:672,&quot;width&quot;:662,&quot;resizeWidth&quot;:286,&quot;bytes&quot;:71869,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!d5Qe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1f3213a-8fd2-4703-8987-b2cfcbc5880a_662x672.png 424w, https://substackcdn.com/image/fetch/$s_!d5Qe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1f3213a-8fd2-4703-8987-b2cfcbc5880a_662x672.png 848w, https://substackcdn.com/image/fetch/$s_!d5Qe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1f3213a-8fd2-4703-8987-b2cfcbc5880a_662x672.png 1272w, https://substackcdn.com/image/fetch/$s_!d5Qe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1f3213a-8fd2-4703-8987-b2cfcbc5880a_662x672.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [5])</figcaption></figure></div><p><strong>Beyond summarization. </strong>Although RLHF was explored only in the context of summarization in [5], the authors of this paper had an incredible amount of foresight about what was to come. The approach proposed in [5] later became a standard part of LLM post-training, as we will soon see with InstructGPT [8].</p><blockquote><p><em>&#8220;The methods we present in this paper are motivated in part by longer-term concerns about the misalignment of AI systems with what humans want them to do. When misaligned summarization models make up facts, their mistakes are fairly low-risk and easy to spot. However, as AI systems become more powerful and are given increasingly important tasks, the mistakes they make will likely become more subtle and safety-critical, making this an important area for further research.&#8221;</em> - from [1] </p></blockquote><p>Interestingly, authors in [5] explicitly state their intent to leverage the proposed methodology to better align LLMs to human desires in the long term. This statement was made over two years prior to the proposal of ChatGPT! Work in [5] was a building block for major advancements in AI that were yet to come.</p><h4><strong><a href="https://arxiv.org/abs/2403.17031">The N+ Implementation Details of RLHF with PPO</a> [4]</strong></h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Om25!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdf3dce4-738f-47c5-a5e3-f12c75887538_1864x1216.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Om25!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdf3dce4-738f-47c5-a5e3-f12c75887538_1864x1216.png 424w, https://substackcdn.com/image/fetch/$s_!Om25!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdf3dce4-738f-47c5-a5e3-f12c75887538_1864x1216.png 848w, https://substackcdn.com/image/fetch/$s_!Om25!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdf3dce4-738f-47c5-a5e3-f12c75887538_1864x1216.png 1272w, https://substackcdn.com/image/fetch/$s_!Om25!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdf3dce4-738f-47c5-a5e3-f12c75887538_1864x1216.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Om25!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdf3dce4-738f-47c5-a5e3-f12c75887538_1864x1216.png" width="1456" height="950" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bdf3dce4-738f-47c5-a5e3-f12c75887538_1864x1216.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:950,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:282620,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/175107358?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdf3dce4-738f-47c5-a5e3-f12c75887538_1864x1216.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Om25!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdf3dce4-738f-47c5-a5e3-f12c75887538_1864x1216.png 424w, https://substackcdn.com/image/fetch/$s_!Om25!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdf3dce4-738f-47c5-a5e3-f12c75887538_1864x1216.png 848w, https://substackcdn.com/image/fetch/$s_!Om25!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdf3dce4-738f-47c5-a5e3-f12c75887538_1864x1216.png 1272w, https://substackcdn.com/image/fetch/$s_!Om25!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdf3dce4-738f-47c5-a5e3-f12c75887538_1864x1216.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [4])</figcaption></figure></div><p>There are many moving parts in PPO training, including multiple copies of the LLM (i.e., policy, reference, critic, and reward model) and various hyperparameter settings that must be carefully tuned to ensure stable training. For these reasons&#8212;<em>and due to computational expense</em>&#8212;reproducing RL training results is difficult.</p><blockquote><p><em>&#8220;It has proven challenging to reproduce OpenAI&#8217;s RLHF pipeline&#8230; for several reasons: 1) RL and RLHF have many subtle implementation details that can significantly impact training stability, 2) the models are challenging to evaluate&#8230; 3) they take a long time to train and iterate.&#8221; </em>- from [4]</p></blockquote><p>As a starting point for democratizing understanding of RL, authors in [4] focus on a simple setup&#8212;<em>OpenAI&#8217;s prior work on RLHF for summarization</em> [5]. Though many details are already provided in the original work, authors in [4] fully reproduce these results while enumerating all implementation details needed to arrive at a working PPO implementation. The TL;DR summarization task is simple relative to most modern RLHF pipelines. However, this study&#8212;<em>based on Pythia models [10] with 1B, 2.8B, and 6.8B parameters</em>&#8212;provides a clear and comprehensive view of key practical considerations when training an LLM with PPO. </p><p><strong>Dataset considerations.</strong> Authors in [4] enumerate around 20 practical details needed to obtain a working RLHF pipeline with PPO. Nearly half of these details are not related to PPO&#8212;<em>they focus on the training data</em>. For those who have worked with LLMs, this data emphasis should not come as a surprise: <em>data quality is the key determinant of success in all forms of LLM training, including RL</em>.</p><p>All experiments in [4] use the <a href="https://huggingface.co/datasets/CarperAI/openai_summarize_tldr">TL;DR summarization dataset</a> from OpenAI, which contains both an SFT and preference dataset. Some notable remarks about the data used for PPO in [4] include:</p><ul><li><p>There is a misalignment in completion lengths between the SFT and preference portion of the TL;DR dataset&#8212;<em>the preference data tends to have longer completions</em>.</p></li><li><p>Data must occasionally be truncated to fit within the fixed sequence length used in [4], but the authors choose to truncate at paragraph boundaries&#8212;<em>determined by newline characters</em>&#8212;instead of performing a hard truncation at the maximum sequence length.</p></li><li><p>All completions are followed by an <code>&lt;EOS&gt;</code> token. Authors in [4] emphasize that this <code>&lt;EOS&gt;</code> token must be different than the padding token used by the LLM. Otherwise, the loss for the <code>&lt;EOS&gt;</code> token will be masked with the other padding tokens, preventing the model from learning to properly complete each sequence with an <code>&lt;EOS&gt;</code> token.</p></li></ul><p><strong>Reward model.</strong> Several choices exist for initializing the reward model in RLHF. In [4], we initialize with the weights of the SFT model, which matches settings used in [5]. A randomly-initialized linear head that is used to predict the reward is then added to the reward model&#8217;s architecture before the model is trained for a single epoch over the available preference data.</p><p>An outcome reward setting is used in [4]. To extract the reward, a forward pass is performed on the full sequence, and we extract the reward prediction from the <code>&lt;EOS&gt;</code> token only. To teach the policy to consistently output sequences of reasonable length with a corresponding <code>&lt;EOS&gt;</code> token, the <strong>EOS trick</strong> is used, which assigns a reward of -1 to any sequence with no <code>&lt;EOS&gt;</code> token.</p><blockquote><p><em>&#8220;If the padding token does not exist, the extracted reward will then be logits corresponding to the last token of the sequence &#8211; if that token is not the EOS token, its reward won&#8217;t be used for PPO training&#8221;</em> - from [4]</p></blockquote><p>After the reward model is trained, authors follow the recommendation in [5] of <strong>normalizing rewards</strong> outputted by the model. Specifically, the reward model is used to predict rewards for the entire SFT dataset. Then, we compute the mean reward across this dataset and use this mean to center the average reward. In other words, this mean is subtracted as a bias from the reward model&#8217;s output, ensuring that rewards predicted over the SFT dataset have an average of zero. Normalizing the reward model&#8217;s output benefits training stability for PPO. </p><p><strong>Critic settings.</strong> We must also choose how to initialize the critic. In [4], the critic is initialized with the weights of the reward model at the beginning of PPO training. After all, <em>the value model is effectively a reward model that predicts the reward on a per-token basis</em>. Authors observe in [4] that the reward model&#8217;s predictions are usually negative for all tokens except the <code>&lt;EOS&gt;</code> token; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fBTb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4cd7447-83f7-4f34-921a-41672d4c391c_1866x536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fBTb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4cd7447-83f7-4f34-921a-41672d4c391c_1866x536.png 424w, https://substackcdn.com/image/fetch/$s_!fBTb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4cd7447-83f7-4f34-921a-41672d4c391c_1866x536.png 848w, https://substackcdn.com/image/fetch/$s_!fBTb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4cd7447-83f7-4f34-921a-41672d4c391c_1866x536.png 1272w, https://substackcdn.com/image/fetch/$s_!fBTb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4cd7447-83f7-4f34-921a-41672d4c391c_1866x536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fBTb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4cd7447-83f7-4f34-921a-41672d4c391c_1866x536.png" width="1456" height="418" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d4cd7447-83f7-4f34-921a-41672d4c391c_1866x536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:418,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:196664,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/175107358?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4cd7447-83f7-4f34-921a-41672d4c391c_1866x536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fBTb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4cd7447-83f7-4f34-921a-41672d4c391c_1866x536.png 424w, https://substackcdn.com/image/fetch/$s_!fBTb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4cd7447-83f7-4f34-921a-41672d4c391c_1866x536.png 848w, https://substackcdn.com/image/fetch/$s_!fBTb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4cd7447-83f7-4f34-921a-41672d4c391c_1866x536.png 1272w, https://substackcdn.com/image/fetch/$s_!fBTb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4cd7447-83f7-4f34-921a-41672d4c391c_1866x536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [4])</figcaption></figure></div><p>Therefore, the value estimated by the critic is negative for nearly every token at the start of PPO training. However, we see in [4] that warm starting the critic in this way helps to improve the initial stability of gradients during training.</p><p><strong>Reward and advantage whitening.</strong> In addition to normalizing rewards after training the reward model, many PPO implementations perform reward and advantage <a href="https://joelouismarino.github.io/posts/2017/08/statistical_whitening/">whitening</a>. An example implementation of the whitening operation is shown below, where the values can be a list of either rewards or advantages.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XoxA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9646db42-a84e-4dca-99a2-e585c053143c_1722x336.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XoxA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9646db42-a84e-4dca-99a2-e585c053143c_1722x336.png 424w, https://substackcdn.com/image/fetch/$s_!XoxA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9646db42-a84e-4dca-99a2-e585c053143c_1722x336.png 848w, https://substackcdn.com/image/fetch/$s_!XoxA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9646db42-a84e-4dca-99a2-e585c053143c_1722x336.png 1272w, https://substackcdn.com/image/fetch/$s_!XoxA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9646db42-a84e-4dca-99a2-e585c053143c_1722x336.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XoxA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9646db42-a84e-4dca-99a2-e585c053143c_1722x336.png" width="1456" height="284" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9646db42-a84e-4dca-99a2-e585c053143c_1722x336.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:284,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:83121,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/175107358?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9646db42-a84e-4dca-99a2-e585c053143c_1722x336.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XoxA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9646db42-a84e-4dca-99a2-e585c053143c_1722x336.png 424w, https://substackcdn.com/image/fetch/$s_!XoxA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9646db42-a84e-4dca-99a2-e585c053143c_1722x336.png 848w, https://substackcdn.com/image/fetch/$s_!XoxA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9646db42-a84e-4dca-99a2-e585c053143c_1722x336.png 1272w, https://substackcdn.com/image/fetch/$s_!XoxA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9646db42-a84e-4dca-99a2-e585c053143c_1722x336.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(from [4])</figcaption></figure></div><p>When whitening rewards, we usually do not shift the mean (i.e., <code>shift_mean = False</code> in the above code) so that we can retain the magnitude and sign of the rewards. However, the mean is usually shifted when whitening advantages. Based on results in [4], <em>whitening rewards and advantages does not seem to have a huge positive or negative performance impact on the resulting policy</em>. However, whitening is a common implementation detail in PPO. Usually, whitening is applied over the set of rewards or advantages within a batch of data.</p><blockquote><p><em>&#8220;Where normalization bounds all the values from the RM to be between 0 and 1, which can help with learning stability, whitening the rewards or the advantage estimates&#8230; can provide an even stronger boost to stability.&#8221;</em> - from [2]</p></blockquote><p><strong>Beware of dropout.</strong> We must also be sure to avoid using dropout in PPO. Dropout adds noise to the model&#8217;s forward pass, making the computation of policy ratios and KL divergence unreliable. This implementation detail can cause optimization issues and tends to be impactful&#8212;<em>dropout is a perfect example of small but important practical details in PPO</em>. For example, the <a href="https://github.com/allenai/open-instruct/blob/main/open_instruct/ppo2.py">OpenInstruct PPO script</a> explicitly disables dropout in the policy, critic, reference, and reward models. </p><p><strong>Final results. </strong>After enumerating various practical choices and hyperparameter settings, the policies in [4] successfully replicate the original results of [5]. PPO models outperform those trained with SFT, and there are clear scaling trends that can be observed (i.e., larger models achieve better performance metrics) for SFT models, reward models, and the final RL policies. Additionally, the preference rate of the RL policies over human reference summaries&#8212;<em>as predicted by a GPT-3.5-based LLM judge</em>&#8212;scales predictably with model size; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!y_F0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63af44b0-f8ab-4b8a-9872-276a6d78726f_2462x820.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!y_F0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63af44b0-f8ab-4b8a-9872-276a6d78726f_2462x820.png 424w, https://substackcdn.com/image/fetch/$s_!y_F0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63af44b0-f8ab-4b8a-9872-276a6d78726f_2462x820.png 848w, https://substackcdn.com/image/fetch/$s_!y_F0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63af44b0-f8ab-4b8a-9872-276a6d78726f_2462x820.png 1272w, https://substackcdn.com/image/fetch/$s_!y_F0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63af44b0-f8ab-4b8a-9872-276a6d78726f_2462x820.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!y_F0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63af44b0-f8ab-4b8a-9872-276a6d78726f_2462x820.png" width="1456" height="485" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/63af44b0-f8ab-4b8a-9872-276a6d78726f_2462x820.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:485,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:451177,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/175107358?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63af44b0-f8ab-4b8a-9872-276a6d78726f_2462x820.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!y_F0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63af44b0-f8ab-4b8a-9872-276a6d78726f_2462x820.png 424w, https://substackcdn.com/image/fetch/$s_!y_F0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63af44b0-f8ab-4b8a-9872-276a6d78726f_2462x820.png 848w, https://substackcdn.com/image/fetch/$s_!y_F0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63af44b0-f8ab-4b8a-9872-276a6d78726f_2462x820.png 1272w, https://substackcdn.com/image/fetch/$s_!y_F0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63af44b0-f8ab-4b8a-9872-276a6d78726f_2462x820.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [4])</figcaption></figure></div><h4><strong><a href="https://arxiv.org/abs/2203.02155">Training language models to follow instructions with human feedback</a> [8]</strong></h4><p>Going beyond the summarization domain, authors in [8] explore the use of RLHF for language model <a href="https://cameronrwolfe.substack.com/p/the-history-of-open-source-llms-imitation">alignment</a> by directly learning from human feedback. The resulting model, called InstructGPT, is the sister model and predecessor to ChatGPT. Since this model is outlined and explained in detail in [8], the work provides significant insight into how early LLMs at OpenAI were trained.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZdHw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45180b88-a11e-42e8-8910-ceca2c3b447a_1618x980.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZdHw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45180b88-a11e-42e8-8910-ceca2c3b447a_1618x980.png 424w, https://substackcdn.com/image/fetch/$s_!ZdHw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45180b88-a11e-42e8-8910-ceca2c3b447a_1618x980.png 848w, https://substackcdn.com/image/fetch/$s_!ZdHw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45180b88-a11e-42e8-8910-ceca2c3b447a_1618x980.png 1272w, https://substackcdn.com/image/fetch/$s_!ZdHw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45180b88-a11e-42e8-8910-ceca2c3b447a_1618x980.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZdHw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45180b88-a11e-42e8-8910-ceca2c3b447a_1618x980.png" width="1456" height="882" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/45180b88-a11e-42e8-8910-ceca2c3b447a_1618x980.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:882,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:195101,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!ZdHw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45180b88-a11e-42e8-8910-ceca2c3b447a_1618x980.png 424w, https://substackcdn.com/image/fetch/$s_!ZdHw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45180b88-a11e-42e8-8910-ceca2c3b447a_1618x980.png 848w, https://substackcdn.com/image/fetch/$s_!ZdHw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45180b88-a11e-42e8-8910-ceca2c3b447a_1618x980.png 1272w, https://substackcdn.com/image/fetch/$s_!ZdHw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45180b88-a11e-42e8-8910-ceca2c3b447a_1618x980.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [8])</figcaption></figure></div><p>Following an approach similar to [5], we start with a set of prompts that are either written by human annotators or collected from OpenAI&#8217;s API. We can then have annotators write responses to these prompts and finetune a pretrained LLM&#8212;<em><a href="https://cameronrwolfe.substack.com/i/88082618/language-models-are-few-shot-learners">GPT-3</a> in particular</em>&#8212;over these examples using SFT. Using this model, we can then collect comparison data by asking humans to select their preferred outputs from the LLM and apply the same RLHF process outlined in [5] for finetuning. As shown above, the resulting model is heavily preferred by humans and much better at following detailed instructions provided within the prompt.</p><blockquote><p><em>&#8220;Making language models bigger does not inherently make them better at following a user&#8217;s intent.&#8221;</em> - from [8]</p></blockquote><p><strong>The alignment process. </strong>Pretrained LLMs have a number of undesirable properties that we want to fix during post-training; e.g., hallucinations or an inability to follow detailed instructions. To fix these issues, we align the LLM in [8] according to the following set of criteria:</p><ul><li><p><em>Helpful</em>: follows the user&#8217;s instructions and infers intention from <a href="https://cameronrwolfe.substack.com/i/117151147/few-shot-learning">few-shot prompts</a> or other patterns.</p></li><li><p><em>Honest</em>: makes correct factual statements about the world.</p></li><li><p><em>Harmless</em>: avoids harmful outputs, such as those that denigrate a protected class or contain sexual/violent content.</p></li></ul><p>Using RLHF, we can teach an LLM to reflect each of these qualities within its output. Specifically, this is done by constructing preference pairs where the preferred responses are chosen based upon adherence to these criteria.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ddkD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ee233ce-ea11-4928-bcbc-131c5fdc2f2f_1732x930.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ddkD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ee233ce-ea11-4928-bcbc-131c5fdc2f2f_1732x930.png 424w, https://substackcdn.com/image/fetch/$s_!ddkD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ee233ce-ea11-4928-bcbc-131c5fdc2f2f_1732x930.png 848w, https://substackcdn.com/image/fetch/$s_!ddkD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ee233ce-ea11-4928-bcbc-131c5fdc2f2f_1732x930.png 1272w, https://substackcdn.com/image/fetch/$s_!ddkD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ee233ce-ea11-4928-bcbc-131c5fdc2f2f_1732x930.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ddkD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ee233ce-ea11-4928-bcbc-131c5fdc2f2f_1732x930.png" width="1456" height="782" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7ee233ce-ea11-4928-bcbc-131c5fdc2f2f_1732x930.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:782,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:381494,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!ddkD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ee233ce-ea11-4928-bcbc-131c5fdc2f2f_1732x930.png 424w, https://substackcdn.com/image/fetch/$s_!ddkD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ee233ce-ea11-4928-bcbc-131c5fdc2f2f_1732x930.png 848w, https://substackcdn.com/image/fetch/$s_!ddkD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ee233ce-ea11-4928-bcbc-131c5fdc2f2f_1732x930.png 1272w, https://substackcdn.com/image/fetch/$s_!ddkD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ee233ce-ea11-4928-bcbc-131c5fdc2f2f_1732x930.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [8])</figcaption></figure></div><p><strong>More on RLHF. </strong>Authors in [8] curate a team of 40 human annotators, who are screened with a test to judge their annotation quality, to collect preference data for the LLM. The approach for RLHF used in [8] matches the approach used in [5] almost completely. Using a pretrained LLM and a set of prompts for finetuning, the alignment process proceeds according to the following steps:</p><ol><li><p>Collect human demonstrations of responses for each prompt.</p></li><li><p>Train the model in a supervised fashion over human demonstrations.</p></li><li><p>Collect preference data.</p></li><li><p>Train a <a href="https://cameronrwolfe.substack.com/p/reward-models">reward model</a>.</p></li><li><p>Optimize the underlying LLM or policy with PPO.</p></li><li><p>Repeat steps 3-5.</p></li></ol><p>The distribution of prompts used for finetuning in [8] is outlined in the table below. For SFT, a dataset of over 13K prompt and response pairs is constructed. The reward model is trained over 33K prompts, while a dataset of size 31K is used for finetuning with PPO. Unlike [5], human annotators are shown 4-9 responses to a prompt (i.e., instead of two) when collecting comparison data, allowing them to quickly rank responses and generate larger amounts of comparison data more efficiently. However, <em>later work on RLHF largely abandoned this approach in favor of binary preferences</em>. The dataset used in [8] is also 96% English.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xMFU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9b979ad-bd64-47c4-bfe7-64890b661ba9_1660x724.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xMFU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9b979ad-bd64-47c4-bfe7-64890b661ba9_1660x724.png 424w, https://substackcdn.com/image/fetch/$s_!xMFU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9b979ad-bd64-47c4-bfe7-64890b661ba9_1660x724.png 848w, https://substackcdn.com/image/fetch/$s_!xMFU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9b979ad-bd64-47c4-bfe7-64890b661ba9_1660x724.png 1272w, https://substackcdn.com/image/fetch/$s_!xMFU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9b979ad-bd64-47c4-bfe7-64890b661ba9_1660x724.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xMFU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9b979ad-bd64-47c4-bfe7-64890b661ba9_1660x724.png" width="1456" height="635" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f9b979ad-bd64-47c4-bfe7-64890b661ba9_1660x724.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:635,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:180037,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!xMFU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9b979ad-bd64-47c4-bfe7-64890b661ba9_1660x724.png 424w, https://substackcdn.com/image/fetch/$s_!xMFU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9b979ad-bd64-47c4-bfe7-64890b661ba9_1660x724.png 848w, https://substackcdn.com/image/fetch/$s_!xMFU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9b979ad-bd64-47c4-bfe7-64890b661ba9_1660x724.png 1272w, https://substackcdn.com/image/fetch/$s_!xMFU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9b979ad-bd64-47c4-bfe7-64890b661ba9_1660x724.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [8])</figcaption></figure></div><p>Similarly to [5], a KL divergence term between the policy and the SFT model is directly subtracted from the reward to avoid drifting too far away from the SFT model. Additionally, extra pretraining updates are &#8220;mixed in&#8221; to the RLHF optimization process, which authors find to help with maintaining the model&#8217;s performance across various benchmarks. These pretraining updates, which use a supervised loss, are simply added to the PPO loss used during RL. </p><blockquote><p><em>&#8220;We were able to mitigate most of the performance degradations introduced by our fine-tuning. If this was not the case, these performance degradations would constitute an alignment tax&#8212;an additional cost for aligning the model.&#8221;</em> - from [2]</p></blockquote><p><strong>Experimental findings.</strong> In [8], authors train three models with 1.3B, 6B, and 175B (i.e., same as <a href="https://cameronrwolfe.substack.com/p/language-model-scaling-laws-and-gpt">GPT-3</a>) parameters. From these experiments, we learn that human annotators prefer InstructGPT outputs over those of GPT-3, even for models with 10&#215; fewer parameters; see below. This result is similar to observations in [5], where finetuning via RLHF enables much smaller models to outperform larger models trained in a supervised manner.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BTzq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08415ad7-db55-4f46-8415-2fb3da1c9ab6_1350x1348.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BTzq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08415ad7-db55-4f46-8415-2fb3da1c9ab6_1350x1348.png 424w, https://substackcdn.com/image/fetch/$s_!BTzq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08415ad7-db55-4f46-8415-2fb3da1c9ab6_1350x1348.png 848w, https://substackcdn.com/image/fetch/$s_!BTzq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08415ad7-db55-4f46-8415-2fb3da1c9ab6_1350x1348.png 1272w, https://substackcdn.com/image/fetch/$s_!BTzq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08415ad7-db55-4f46-8415-2fb3da1c9ab6_1350x1348.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BTzq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08415ad7-db55-4f46-8415-2fb3da1c9ab6_1350x1348.png" width="588" height="587.1288888888889" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/08415ad7-db55-4f46-8415-2fb3da1c9ab6_1350x1348.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1348,&quot;width&quot;:1350,&quot;resizeWidth&quot;:588,&quot;bytes&quot;:271168,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!BTzq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08415ad7-db55-4f46-8415-2fb3da1c9ab6_1350x1348.png 424w, https://substackcdn.com/image/fetch/$s_!BTzq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08415ad7-db55-4f46-8415-2fb3da1c9ab6_1350x1348.png 848w, https://substackcdn.com/image/fetch/$s_!BTzq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08415ad7-db55-4f46-8415-2fb3da1c9ab6_1350x1348.png 1272w, https://substackcdn.com/image/fetch/$s_!BTzq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08415ad7-db55-4f46-8415-2fb3da1c9ab6_1350x1348.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [8])</figcaption></figure></div><p>Notably, outputs from InstructGPT-1.3B are preferred to those of GPT-3, which has 100&#215; more parameters. Additionally, we see that InstructGPT-175B produces outputs that are preferred to GPT-3 85% of the time. Going further, InstructGPT models are found to more reliably follow explicit constraints and instructions provided by a human user within the model&#8217;s prompt; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JB4X!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc9280f9-a159-4e81-ab17-86faf28f47ba_1876x882.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JB4X!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc9280f9-a159-4e81-ab17-86faf28f47ba_1876x882.png 424w, https://substackcdn.com/image/fetch/$s_!JB4X!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc9280f9-a159-4e81-ab17-86faf28f47ba_1876x882.png 848w, https://substackcdn.com/image/fetch/$s_!JB4X!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc9280f9-a159-4e81-ab17-86faf28f47ba_1876x882.png 1272w, https://substackcdn.com/image/fetch/$s_!JB4X!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc9280f9-a159-4e81-ab17-86faf28f47ba_1876x882.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JB4X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc9280f9-a159-4e81-ab17-86faf28f47ba_1876x882.png" width="1456" height="685" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fc9280f9-a159-4e81-ab17-86faf28f47ba_1876x882.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:685,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:231677,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!JB4X!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc9280f9-a159-4e81-ab17-86faf28f47ba_1876x882.png 424w, https://substackcdn.com/image/fetch/$s_!JB4X!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc9280f9-a159-4e81-ab17-86faf28f47ba_1876x882.png 848w, https://substackcdn.com/image/fetch/$s_!JB4X!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc9280f9-a159-4e81-ab17-86faf28f47ba_1876x882.png 1272w, https://substackcdn.com/image/fetch/$s_!JB4X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc9280f9-a159-4e81-ab17-86faf28f47ba_1876x882.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [8])</figcaption></figure></div><p>Compared to pretrained and supervised models, InstructGPT is also found to be:</p><ul><li><p>More truthful.</p></li><li><p>Slightly less toxic.</p></li><li><p>Generalizable to instructions beyond the training dataset.</p></li></ul><p>For example, InstructGPT can answer questions about code and handle prompts written in different languages, despite the finetuning dataset lacking sufficient data within this distribution. Although the model did not receive as much recognition as ChatGPT, InstructGPT was a major step forward in AI that introduced many core concepts used for training modern LLMs. </p><h2>Conclusion</h2><p>PPO is one of the most widely used RL algorithms for LLMs that has&#8212;<em>through its key role in RLHF pipelines</em>&#8212;directly contributed to fundamental advancements in AI. As we learned, research on PPO was an important factor in the creation of models like InstructGPT and ChatGPT. These influential models catalyzed the ongoing boom in LLM research in which we currently find ourselves.</p><p>We cannot overstate the impact of PPO on LLM research, and PPO continues to play an important role in LLM post-training pipelines today. However, the barrier to entry for PPO is high due to its memory and compute overhead. Additionally, the results of PPO can vary based on a wide variety of practical implementation details and hyperparameter settings. For these reasons, most research on PPO has been centralized within top frontier labs. Only a small number of groups have sufficient compute resources to empirically tune and obtain a working PPO implementation at scale.</p><p>Nonetheless, understanding PPO is essential due to its fundamental role in AI research. The cost and complexity of PPO remains high, but RL researchers have recently expanded and improved upon ideas proposed by PPO. For example, REINFORCE and GRPO are simpler (and more stable) policy gradient algorithms that can be used to train LLMs, which use less memory than PPO by avoiding the critic. A working understanding of PPO makes understanding these new algorithms&#8212;<em>or even developing our own</em>&#8212;much simpler!</p><h4>New to the newsletter?</h4><p>Hi! I&#8217;m <a href="https://cameronrwolfe.me/">Cameron R. Wolfe</a>, Deep Learning Ph.D. and Senior Research Scientist at <a href="https://research.netflix.com/research-area/nlp-and-conversations">Netflix</a>. This is the Deep (Learning) Focus newsletter, where I help readers better understand important topics in AI research. The newsletter will always be free and open to read. If you like the newsletter, please subscribe, consider a paid subscription, share it, or follow me on <a href="https://twitter.com/cwolferesearch">X</a> and <a href="https://www.linkedin.com/in/cameron-r-wolfe-ph-d-04744a238/">LinkedIn</a>!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://cameronrwolfe.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://cameronrwolfe.substack.com/subscribe?"><span>Subscribe now</span></a></p><h4>Bibliography</h4><p>[1] Schulman, John, et al. &#8220;Proximal policy optimization algorithms.&#8221; <em>arXiv preprint arXiv:1707.06347</em> (2017).</p><p>[2] Lambert, Nathan. &#8220;Reinforcement Learning from Human Feedback.&#8221; Online (2025). https://rlhfbook.com</p><p>[3] Schulman, John, et al. &#8220;High-dimensional continuous control using generalized advantage estimation.&#8221; <em>arXiv preprint arXiv:1506.02438</em> (2015).</p><p>[4] Huang, Shengyi, et al. &#8220;The n+ implementation details of rlhf with ppo: A case study on tl; dr summarization.&#8221; <em>arXiv preprint arXiv:2403.17031</em> (2024).</p><p>[5] Stiennon, Nisan, et al. &#8220;Learning to summarize with human feedback.&#8221; <em>Advances in neural information processing systems</em> 33 (2020): 3008-3021.</p><p>[6] Schulman, John, et al. &#8220;Trust region policy optimization.&#8221; <em>International conference on machine learning</em>. PMLR, 2015.</p><p>[7] Lambert, Nathan, et al. &#8220;Tulu 3: Pushing frontiers in open language model post-training.&#8221; <em>arXiv preprint arXiv:2411.15124</em> (2024).</p><p>[8] Ouyang, Long, et al. &#8220;Training language models to follow instructions with human feedback.&#8221; <em>Advances in neural information processing systems</em> 35 (2022): 27730-27744.</p><p>[9] Ahmadian, Arash, et al. &#8220;Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms.&#8221; <em>arXiv preprint arXiv:2402.14740</em> (2024).</p><p>[10] Biderman, Stella, et al. &#8220;Pythia: A suite for analyzing large language models across training and scaling.&#8221; <em>International Conference on Machine Learning</em>. PMLR, 2023.</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>As we can see, the discounted reward has an infinite horizon in this case. In other words, the total number of steps in the trajectory is infinite <code>T = &#8734;</code>. This is known as the infinite-horizon discounted return.  </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>The VPG was also partially covered in my overview of REINFORCE that was released a few weeks ago; see <a href="https://cameronrwolfe.substack.com/p/reinforce">here</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Specifically, if we wanted to solve a constrained optimization problem like this with gradient ascent, we would have to use constrained gradient ascent. However, this method requires that we project our solution into the space of valid solutions that satisfy the constraint after every optimization step, which would be computationally intractable for neural network parameters. The KL divergence is a very complex constraint for which to perform this projection!</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>More specifically, if the policy ratio is greater than <code>1 + &#949;</code>, we set it equal to <code>1 + &#949;</code>. If the policy ratio is less than <code>1 - &#949;</code>, we set it to <code>1 - &#949;</code>. Otherwise, we keep the value of the policy ratio unchanged. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>The clipped objective will always be less than or equal to the unclipped objective due to the fact that we are taking the minimum of the unclipped and clipped objectives. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>The &#8220;actor&#8221; refers to the LLM&#8212;<em>or the model that is taking actions</em>&#8212;and the &#8220;critic&#8221; refers to the value model. The value model is called a critic due to the fact that it is predicting the reward associated with each action (i.e., effectively critiquing the action).</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>For more details on loss aggregation in RL, see <a href="https://rlhfbook.com/c/11-policy-gradients.html#loss-aggregation">this section</a> of the RLHF book, which provides concrete examples of different aggregation strategies and their impact. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>The adaptive KL divergence is explained in Section 4 of [1]. Instead of setting a fixed scaling factor for the KL divergence, authors propose dynamically adjusting this factor throughout training such that the KL divergence stays close to a target KL divergence <code>d_targ</code>. Put differently, instead of choosing the scaling factor, <em>we specify what we want our KL divergence to be and dynamically adjust the scaling factor throughout training to keep the KL divergence in this range</em>. This approach is not commonly used for recent LLMs, and it is much more common to set a fixed <code>&#946;</code> coefficient for the KL divergence. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-9" href="#footnote-anchor-9" class="footnote-number" contenteditable="false" target="_self">9</a><div class="footnote-content"><p>The reference and old models are different models in PPO! The reference model is the policy parameters before any RL training is performed. For LLMs, the SFT model is usually the reference model. We usually perform multiple updates over a batch of data in PPO, <em>and the old model is the model before the first update</em>. The old model is updated each time a new batch of data is sampled, whereas the reference model is fixed. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-10" href="#footnote-anchor-10" class="footnote-number" contenteditable="false" target="_self">10</a><div class="footnote-content"><p>This means that less data is required to achieve a given level of performance (i.e., the learning process is faster). </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-11" href="#footnote-anchor-11" class="footnote-number" contenteditable="false" target="_self">11</a><div class="footnote-content"><p>Specifically, we would use the cumulative reward after state <code>s_t</code>. However, for LLMs this distinction does not usually matter due to the use of outcome rewards.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-12" href="#footnote-anchor-12" class="footnote-number" contenteditable="false" target="_self">12</a><div class="footnote-content"><p>In fact, this is where the name for the TD residual comes from. We are computing the difference in value between two time steps. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-13" href="#footnote-anchor-13" class="footnote-number" contenteditable="false" target="_self">13</a><div class="footnote-content"><p>The critic is just a model that imperfectly estimates of the value function. The bias in the TD residual comes from the fact that the critic makes mistakes in estimating the value.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-14" href="#footnote-anchor-14" class="footnote-number" contenteditable="false" target="_self">14</a><div class="footnote-content"><p>To derive this expression, we begin with the original formula for the GAE showed in the top line, expand the definitions of the <code>N</code>-step advantage estimates, rearrange the terms, then use the <a href="https://en.wikipedia.org/wiki/Geometric_series">geometric series formula</a> to derive the final expression.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-15" href="#footnote-anchor-15" class="footnote-number" contenteditable="false" target="_self">15</a><div class="footnote-content"><p>This statement assumes that the KL divergence is added to the loss and not directly incorporated into the reward.</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[REINFORCE: Easy Online RL for LLMs]]></title><description><![CDATA[How to get the benefits of online RL without the complexity of PPO...]]></description><link>https://cameronrwolfe.substack.com/p/reinforce</link><guid isPermaLink="false">https://cameronrwolfe.substack.com/p/reinforce</guid><dc:creator><![CDATA[Cameron R. Wolfe, Ph.D.]]></dc:creator><pubDate>Mon, 29 Sep 2025 09:33:23 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/a93eba44-8bc9-40ed-b91a-9e71797aea35_2484x1402.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WNsD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a3143b6-367a-47aa-aae6-11c921d2f0be_2360x1276.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WNsD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a3143b6-367a-47aa-aae6-11c921d2f0be_2360x1276.png 424w, https://substackcdn.com/image/fetch/$s_!WNsD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a3143b6-367a-47aa-aae6-11c921d2f0be_2360x1276.png 848w, https://substackcdn.com/image/fetch/$s_!WNsD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a3143b6-367a-47aa-aae6-11c921d2f0be_2360x1276.png 1272w, https://substackcdn.com/image/fetch/$s_!WNsD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a3143b6-367a-47aa-aae6-11c921d2f0be_2360x1276.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WNsD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a3143b6-367a-47aa-aae6-11c921d2f0be_2360x1276.png" width="1456" height="787" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8a3143b6-367a-47aa-aae6-11c921d2f0be_2360x1276.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:787,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:848077,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a3143b6-367a-47aa-aae6-11c921d2f0be_2360x1276.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WNsD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a3143b6-367a-47aa-aae6-11c921d2f0be_2360x1276.png 424w, https://substackcdn.com/image/fetch/$s_!WNsD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a3143b6-367a-47aa-aae6-11c921d2f0be_2360x1276.png 848w, https://substackcdn.com/image/fetch/$s_!WNsD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a3143b6-367a-47aa-aae6-11c921d2f0be_2360x1276.png 1272w, https://substackcdn.com/image/fetch/$s_!WNsD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a3143b6-367a-47aa-aae6-11c921d2f0be_2360x1276.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Reinforcement learning (RL) is playing an increasingly important role in research on large language models (LLMs). Initially, RL was used to power LLM alignment via approaches like Reinforcement Learning from Human Feedback (RLHF). More recently, it has become foundational for training powerful large reasoning models (LRMs). When training LLMs with RL, online algorithms such as Proximal Policy Optimization (PPO) are often used by default. However, these algorithms are expensive and complex compared to alternatives like <a href="https://cameronrwolfe.substack.com/p/understanding-and-using-supervised">supervised finetuning (SFT)</a> or <a href="https://cameronrwolfe.substack.com/p/direct-preference-optimization">direct preference optimization (DPO)</a>:</p><ul><li><p>Four different copies of the LLM must be kept in memory.</p></li><li><p>The online training process is difficult to orchestrate and can be unstable.</p></li><li><p>There are many training hyperparameters that must be tuned properly.</p></li></ul><p>The complexity of PPO arises from the need to stabilize the online training process. This algorithm was developed in an earlier generation of research, which focused on training neural networks from scratch to solve tasks like robotic locomotion and Atari gameplay. The RL setting for LLMs is much different&#8212;<em>we are fine-tuning pretrained models that already have a powerful prior</em>.</p><blockquote><p><em>&#8220;PPO has been positioned as the canonical method for RLHF. However, it involves both high computational cost and sensitive hyperparameter tuning. We posit that the motivational principles that led to the development of PPO are less of a practical concern in RLHF and advocate for a less computationally expensive method that preserves and even increases performance.&#8221;</em> - from [3]</p></blockquote><p>Many practitioners avoid the use of online RL when training LLMs due to cost and complexity. In this overview, we will learn that online RL does not have to be so difficult! Due to the unique properties of the LLM domain, we can use simpler algorithms&#8212;<em>like REINFORCE or REINFORCE leave-one-out (RLOO)</em>&#8212;and still achieve performance similar to that of PPO. Therefore, instead of avoiding online RL in favor of simpler RL-free or offline alternatives, <em>we can just use algorithms that provide the benefits of online RL without the unnecessary complexity</em>.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://cameronrwolfe.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Join 50,000 others who use Deep (Learning) Focus to stay up-to-date with AI research.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Basics of RL for LLMs</h2><p>We will begin by covering the basics of reinforcement learning (RL). To start, we will explore the problem setup and terminology commonly used in RL, as well as how these formalisms can be translated to the LLM domain. After covering RL fundamentals and how RL is applied in the context of LLMs, we will spend the majority of this section focusing on policy optimization by deriving the standard policy gradient expression frequently used in RL and outlining concrete implementations for the most basic forms of these training algorithms. </p><h4>Problem Setup and Terminology for RL</h4><p>When running RL training, we have an <strong>agent</strong> that takes <strong>actions</strong> within some <strong>environment</strong>; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lQCe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7117e42-c6ab-43c4-8878-5a88cb99c9ae_2203x870.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lQCe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7117e42-c6ab-43c4-8878-5a88cb99c9ae_2203x870.png 424w, https://substackcdn.com/image/fetch/$s_!lQCe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7117e42-c6ab-43c4-8878-5a88cb99c9ae_2203x870.png 848w, https://substackcdn.com/image/fetch/$s_!lQCe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7117e42-c6ab-43c4-8878-5a88cb99c9ae_2203x870.png 1272w, https://substackcdn.com/image/fetch/$s_!lQCe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7117e42-c6ab-43c4-8878-5a88cb99c9ae_2203x870.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lQCe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7117e42-c6ab-43c4-8878-5a88cb99c9ae_2203x870.png" width="1456" height="575" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d7117e42-c6ab-43c4-8878-5a88cb99c9ae_2203x870.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:575,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:139371,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7117e42-c6ab-43c4-8878-5a88cb99c9ae_2203x870.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lQCe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7117e42-c6ab-43c4-8878-5a88cb99c9ae_2203x870.png 424w, https://substackcdn.com/image/fetch/$s_!lQCe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7117e42-c6ab-43c4-8878-5a88cb99c9ae_2203x870.png 848w, https://substackcdn.com/image/fetch/$s_!lQCe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7117e42-c6ab-43c4-8878-5a88cb99c9ae_2203x870.png 1272w, https://substackcdn.com/image/fetch/$s_!lQCe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7117e42-c6ab-43c4-8878-5a88cb99c9ae_2203x870.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Basic problem setup for RL</figcaption></figure></div><p>These actions are predicted by a <strong>policy</strong>&#8212;<em>we can think of the policy as the agent&#8217;s brain</em>&#8212;that is usually parameterized (e.g., the policy is the LLM itself in the context of training LLMs). Our policy can either be deterministic or stochastic, but in this overview we will assume the policy is stochastic<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>. We can model the probability of a given action under our policy as <code>&#960;_&#952;(a_t | s_t)</code>. </p><p>When the policy outputs an action, the <strong>state</strong> of the environment will be updated according to a <strong>transition function</strong>, which is part of the environment. We will denote our transition function as <code>P(s_t+1 | a_t, s_t)</code>.  However, transition functions are less relevant for LLMs because they are typically a pass-through; i.e., we assume <code>s_t = {x, a_1, a_2, &#8230;, a_t}</code>, where <code>x</code> is the prompt. </p><p>Finally, each state visited by the agent receives a <strong>reward</strong> from the environment that may be positive, negative, or zero (i.e., no reward). As shown in the prior figure, our agent acts iteratively and each action (<code>a_t</code>), reward (<code>r_t</code>), and state (<code>s_t</code>) are associated with a time step <code>t</code>. Combining these time steps together yields a <strong>trajectory</strong>; see below. Here, we assume that the agent takes a total of <code>T</code> steps in the environment for this particular trajectory.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cjh1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbee11fdb-dee8-4d4e-8819-b97642a17129_2008x338.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cjh1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbee11fdb-dee8-4d4e-8819-b97642a17129_2008x338.png 424w, https://substackcdn.com/image/fetch/$s_!cjh1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbee11fdb-dee8-4d4e-8819-b97642a17129_2008x338.png 848w, https://substackcdn.com/image/fetch/$s_!cjh1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbee11fdb-dee8-4d4e-8819-b97642a17129_2008x338.png 1272w, https://substackcdn.com/image/fetch/$s_!cjh1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbee11fdb-dee8-4d4e-8819-b97642a17129_2008x338.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cjh1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbee11fdb-dee8-4d4e-8819-b97642a17129_2008x338.png" width="1456" height="245" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bee11fdb-dee8-4d4e-8819-b97642a17129_2008x338.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:245,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:108505,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbee11fdb-dee8-4d4e-8819-b97642a17129_2008x338.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cjh1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbee11fdb-dee8-4d4e-8819-b97642a17129_2008x338.png 424w, https://substackcdn.com/image/fetch/$s_!cjh1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbee11fdb-dee8-4d4e-8819-b97642a17129_2008x338.png 848w, https://substackcdn.com/image/fetch/$s_!cjh1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbee11fdb-dee8-4d4e-8819-b97642a17129_2008x338.png 1272w, https://substackcdn.com/image/fetch/$s_!cjh1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbee11fdb-dee8-4d4e-8819-b97642a17129_2008x338.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Using the chain rule of probabilities, we can also compute the probability of a full trajectory by combining the probabilities of:</p><ul><li><p>Each action <code>a_t</code> given by our policy <code>&#960;_&#952;(a_t | s_t)</code>.</p></li><li><p>Each state <code>s_t+1</code> given by the transition function <code>P(s_t+1 | a_t, s_t)</code>.</p></li></ul><p>The full expression for the probability of a trajectory is provided below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YCeT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52061751-cc8a-4f3e-a889-5d4e542b21bf_2092x770.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YCeT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52061751-cc8a-4f3e-a889-5d4e542b21bf_2092x770.png 424w, https://substackcdn.com/image/fetch/$s_!YCeT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52061751-cc8a-4f3e-a889-5d4e542b21bf_2092x770.png 848w, https://substackcdn.com/image/fetch/$s_!YCeT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52061751-cc8a-4f3e-a889-5d4e542b21bf_2092x770.png 1272w, https://substackcdn.com/image/fetch/$s_!YCeT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52061751-cc8a-4f3e-a889-5d4e542b21bf_2092x770.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YCeT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52061751-cc8a-4f3e-a889-5d4e542b21bf_2092x770.png" width="650" height="239.28571428571428" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/52061751-cc8a-4f3e-a889-5d4e542b21bf_2092x770.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:536,&quot;width&quot;:1456,&quot;resizeWidth&quot;:650,&quot;bytes&quot;:245378,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52061751-cc8a-4f3e-a889-5d4e542b21bf_2092x770.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YCeT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52061751-cc8a-4f3e-a889-5d4e542b21bf_2092x770.png 424w, https://substackcdn.com/image/fetch/$s_!YCeT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52061751-cc8a-4f3e-a889-5d4e542b21bf_2092x770.png 848w, https://substackcdn.com/image/fetch/$s_!YCeT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52061751-cc8a-4f3e-a889-5d4e542b21bf_2092x770.png 1272w, https://substackcdn.com/image/fetch/$s_!YCeT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52061751-cc8a-4f3e-a889-5d4e542b21bf_2092x770.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Computing the probability of a trajectory</figcaption></figure></div><p><strong>RL objective.</strong> When training a model with RL, our goal is to maximize the cumulative reward over the entire trajectory (i.e., the sum of <code>r_t</code>). However, there are a few variations of this objective that commonly appear. Specifically, the reward that we maximize can either be discounted or non-discounted<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>; see below. By incorporating a discount factor, we reward our policy for achieving rewards sooner rather than later. In other words, <em>money now is better than money later</em>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8D_n!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbfd6da8-2406-4197-b9d0-d3a1ec301b39_1496x876.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8D_n!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbfd6da8-2406-4197-b9d0-d3a1ec301b39_1496x876.png 424w, https://substackcdn.com/image/fetch/$s_!8D_n!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbfd6da8-2406-4197-b9d0-d3a1ec301b39_1496x876.png 848w, https://substackcdn.com/image/fetch/$s_!8D_n!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbfd6da8-2406-4197-b9d0-d3a1ec301b39_1496x876.png 1272w, https://substackcdn.com/image/fetch/$s_!8D_n!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbfd6da8-2406-4197-b9d0-d3a1ec301b39_1496x876.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8D_n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbfd6da8-2406-4197-b9d0-d3a1ec301b39_1496x876.png" width="496" height="290.5824175824176" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bbfd6da8-2406-4197-b9d0-d3a1ec301b39_1496x876.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:853,&quot;width&quot;:1456,&quot;resizeWidth&quot;:496,&quot;bytes&quot;:158346,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbfd6da8-2406-4197-b9d0-d3a1ec301b39_1496x876.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8D_n!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbfd6da8-2406-4197-b9d0-d3a1ec301b39_1496x876.png 424w, https://substackcdn.com/image/fetch/$s_!8D_n!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbfd6da8-2406-4197-b9d0-d3a1ec301b39_1496x876.png 848w, https://substackcdn.com/image/fetch/$s_!8D_n!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbfd6da8-2406-4197-b9d0-d3a1ec301b39_1496x876.png 1272w, https://substackcdn.com/image/fetch/$s_!8D_n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbfd6da8-2406-4197-b9d0-d3a1ec301b39_1496x876.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Our objective is usually expressed as an expected cumulative reward, where the <a href="https://en.wikipedia.org/wiki/Expected_value">expectation</a> is taken over the trajectory. Expanding this expectation yields a weighted sum of rewards for each trajectory&#8212;<em>the weight is just the trajectory&#8217;s probability</em>. We can formulate this in a continuous or discrete manner; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!45io!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523baab0-10b4-438e-85d7-e7c5c0681209_1692x884.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!45io!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523baab0-10b4-438e-85d7-e7c5c0681209_1692x884.png 424w, https://substackcdn.com/image/fetch/$s_!45io!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523baab0-10b4-438e-85d7-e7c5c0681209_1692x884.png 848w, https://substackcdn.com/image/fetch/$s_!45io!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523baab0-10b4-438e-85d7-e7c5c0681209_1692x884.png 1272w, https://substackcdn.com/image/fetch/$s_!45io!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523baab0-10b4-438e-85d7-e7c5c0681209_1692x884.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!45io!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523baab0-10b4-438e-85d7-e7c5c0681209_1692x884.png" width="522" height="272.83104395604397" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/523baab0-10b4-438e-85d7-e7c5c0681209_1692x884.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:761,&quot;width&quot;:1456,&quot;resizeWidth&quot;:522,&quot;bytes&quot;:235822,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523baab0-10b4-438e-85d7-e7c5c0681209_1692x884.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!45io!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523baab0-10b4-438e-85d7-e7c5c0681209_1692x884.png 424w, https://substackcdn.com/image/fetch/$s_!45io!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523baab0-10b4-438e-85d7-e7c5c0681209_1692x884.png 848w, https://substackcdn.com/image/fetch/$s_!45io!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523baab0-10b4-438e-85d7-e7c5c0681209_1692x884.png 1272w, https://substackcdn.com/image/fetch/$s_!45io!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523baab0-10b4-438e-85d7-e7c5c0681209_1692x884.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We want to maximize this objective during training, which can be accomplished via <a href="https://en.wikipedia.org/wiki/Gradient_descent">gradient ascent</a><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>; see below. Given this setup, the lingering question that we have to answer is: <em>How do we compute this gradient?</em> As we will see, much of the research on RL focuses on answering this question, and many techniques exist.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!slrY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3072897-d905-42be-b385-6186c24ae059_2390x302.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!slrY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3072897-d905-42be-b385-6186c24ae059_2390x302.png 424w, https://substackcdn.com/image/fetch/$s_!slrY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3072897-d905-42be-b385-6186c24ae059_2390x302.png 848w, https://substackcdn.com/image/fetch/$s_!slrY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3072897-d905-42be-b385-6186c24ae059_2390x302.png 1272w, https://substackcdn.com/image/fetch/$s_!slrY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3072897-d905-42be-b385-6186c24ae059_2390x302.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!slrY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3072897-d905-42be-b385-6186c24ae059_2390x302.png" width="1456" height="184" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f3072897-d905-42be-b385-6186c24ae059_2390x302.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:184,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:153828,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3072897-d905-42be-b385-6186c24ae059_2390x302.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!slrY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3072897-d905-42be-b385-6186c24ae059_2390x302.png 424w, https://substackcdn.com/image/fetch/$s_!slrY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3072897-d905-42be-b385-6186c24ae059_2390x302.png 848w, https://substackcdn.com/image/fetch/$s_!slrY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3072897-d905-42be-b385-6186c24ae059_2390x302.png 1272w, https://substackcdn.com/image/fetch/$s_!slrY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3072897-d905-42be-b385-6186c24ae059_2390x302.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Solving the RL objective with gradient ascent</figcaption></figure></div><p><strong>State, value, and advantage functions.</strong> Related to RL objective, we can also define the following set of functions:</p><ul><li><p><em>Value Function</em> <code>V(s)</code>: the expected cumulative reward when you start in state <code>s</code> and act according to your current policy <code>&#960;_&#952;</code>.</p></li><li><p><em>Action-Value Function</em> <code>Q(s, a)</code>: the expected cumulative reward when you start in state <code>s</code>, take action <code>a</code>, then act according to your policy <code>&#960;_&#952;</code>.</p></li><li><p><em>Advantage Function</em> <code>A(s, a)</code>: the difference between the action-value and value function; i.e., <code>A(s, a) = Q(s, a) - V(s)</code>.</p></li></ul><p>Intuitively, the advantage function tells us how useful some action <code>a</code> is by taking the difference between the expected reward after taking action <code>a</code> in state <code>s</code> and the general expected reward from state <code>s</code>. The advantage will be positive if the reward from action <code>a</code> is higher than expected and vice versa. Advantage functions play a huge role in RL research&#8212;<em>they are used to compute the gradient for our policy</em>.</p><blockquote><p><em>&#8220;Sometimes in RL, we don&#8217;t need to describe how good an action is in an absolute sense, but only how much better it is than others on average. That is to say, we want to know the relative advantage of that action. We make this concept precise with the advantage function.<strong>&#8221;</strong></em> - <a href="https://spinningup.openai.com/en/latest/spinningup/rl_intro.html">Spinning up in Deep RL</a></p></blockquote><h4>Markov Decision Process (MDP) versus Bandit Formulation</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!j0Id!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d609c2-0276-4102-914e-7de5d6a5326e_1404x1012.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!j0Id!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d609c2-0276-4102-914e-7de5d6a5326e_1404x1012.png 424w, https://substackcdn.com/image/fetch/$s_!j0Id!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d609c2-0276-4102-914e-7de5d6a5326e_1404x1012.png 848w, https://substackcdn.com/image/fetch/$s_!j0Id!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d609c2-0276-4102-914e-7de5d6a5326e_1404x1012.png 1272w, https://substackcdn.com/image/fetch/$s_!j0Id!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d609c2-0276-4102-914e-7de5d6a5326e_1404x1012.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!j0Id!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d609c2-0276-4102-914e-7de5d6a5326e_1404x1012.png" width="474" height="341.65811965811963" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/00d609c2-0276-4102-914e-7de5d6a5326e_1404x1012.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1012,&quot;width&quot;:1404,&quot;resizeWidth&quot;:474,&quot;bytes&quot;:121391,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d609c2-0276-4102-914e-7de5d6a5326e_1404x1012.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!j0Id!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d609c2-0276-4102-914e-7de5d6a5326e_1404x1012.png 424w, https://substackcdn.com/image/fetch/$s_!j0Id!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d609c2-0276-4102-914e-7de5d6a5326e_1404x1012.png 848w, https://substackcdn.com/image/fetch/$s_!j0Id!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d609c2-0276-4102-914e-7de5d6a5326e_1404x1012.png 1272w, https://substackcdn.com/image/fetch/$s_!j0Id!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00d609c2-0276-4102-914e-7de5d6a5326e_1404x1012.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">RL terminology mapping for LLMs</figcaption></figure></div><p>Now that we understand RL basics, we need to map the terminology that we have learned to the setting of LLM training. We can do this as follows (shown above):</p><ul><li><p>Our <strong>policy</strong> is the LLM itself.</p></li><li><p>Our <strong>initial state</strong> is the prompt. </p></li><li><p>The LLM&#8217;s output&#8212;<em>either each token or the entire completion</em>&#8212;is an <strong>action</strong>.</p></li><li><p>Our <strong>state</strong> is the combination of our prompt with the LLM&#8217;s output.</p></li><li><p>The entire completion from the LLM forms a <strong>trajectory</strong>. </p></li></ul><p>Notably, there is no transition function in this setup because the transition function is completely deterministic. If we start with a prompt <code>x</code> and our LLM predicts tokens <code>t_1</code> and <code>t_2</code> given this prompt as input, then our updated state simply becomes <code>s_2 = {x, t_1, t_2}</code>. In other words, <em>our state is just the running completion being generated by the LLM for a given prompt </em><code>x</code>. </p><p><strong>Markov decision process (MDP) formulation.</strong> For LLMs, there are two key ways in which RL can be formulated that differ in how they model actions. We should recall that an LLM generates output via <a href="https://cameronrwolfe.substack.com/i/136638774/understanding-next-token-prediction">next token prediction</a>; i.e., by generating each token in the output completion sequentially. This autoregressive process is depicted below. As we can see, the next token prediction process maps very easily to an RL setup&#8212;<em>we can just model each token as an individual action</em>!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QUg4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b1a8412-5cfb-481f-bd50-473f0a6fd9b5_1992x1037.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QUg4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b1a8412-5cfb-481f-bd50-473f0a6fd9b5_1992x1037.png 424w, https://substackcdn.com/image/fetch/$s_!QUg4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b1a8412-5cfb-481f-bd50-473f0a6fd9b5_1992x1037.png 848w, https://substackcdn.com/image/fetch/$s_!QUg4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b1a8412-5cfb-481f-bd50-473f0a6fd9b5_1992x1037.png 1272w, https://substackcdn.com/image/fetch/$s_!QUg4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b1a8412-5cfb-481f-bd50-473f0a6fd9b5_1992x1037.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QUg4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b1a8412-5cfb-481f-bd50-473f0a6fd9b5_1992x1037.png" width="1456" height="758" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5b1a8412-5cfb-481f-bd50-473f0a6fd9b5_1992x1037.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:758,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:144540,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b1a8412-5cfb-481f-bd50-473f0a6fd9b5_1992x1037.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QUg4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b1a8412-5cfb-481f-bd50-473f0a6fd9b5_1992x1037.png 424w, https://substackcdn.com/image/fetch/$s_!QUg4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b1a8412-5cfb-481f-bd50-473f0a6fd9b5_1992x1037.png 848w, https://substackcdn.com/image/fetch/$s_!QUg4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b1a8412-5cfb-481f-bd50-473f0a6fd9b5_1992x1037.png 1272w, https://substackcdn.com/image/fetch/$s_!QUg4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b1a8412-5cfb-481f-bd50-473f0a6fd9b5_1992x1037.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The approach of modeling each token in the LLM&#8217;s output as an individual action is called the <a href="https://en.wikipedia.org/wiki/Markov_decision_process">Markov Decision Process (MDP)</a> formulation. An MDP is simply a probabilistic framework for modeling decision-making that includes states, actions, transition probabilities and rewards&#8212;<em>this is exactly the setup we have discussed so far for RL</em>! The MDP formulation used for RL is shown below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KWz-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52f4f8de-4456-4cbd-935c-a945968b704d_1466x916.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KWz-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52f4f8de-4456-4cbd-935c-a945968b704d_1466x916.png 424w, https://substackcdn.com/image/fetch/$s_!KWz-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52f4f8de-4456-4cbd-935c-a945968b704d_1466x916.png 848w, https://substackcdn.com/image/fetch/$s_!KWz-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52f4f8de-4456-4cbd-935c-a945968b704d_1466x916.png 1272w, https://substackcdn.com/image/fetch/$s_!KWz-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52f4f8de-4456-4cbd-935c-a945968b704d_1466x916.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KWz-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52f4f8de-4456-4cbd-935c-a945968b704d_1466x916.png" width="540" height="337.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/52f4f8de-4456-4cbd-935c-a945968b704d_1466x916.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:910,&quot;width&quot;:1456,&quot;resizeWidth&quot;:540,&quot;bytes&quot;:119785,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52f4f8de-4456-4cbd-935c-a945968b704d_1466x916.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KWz-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52f4f8de-4456-4cbd-935c-a945968b704d_1466x916.png 424w, https://substackcdn.com/image/fetch/$s_!KWz-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52f4f8de-4456-4cbd-935c-a945968b704d_1466x916.png 848w, https://substackcdn.com/image/fetch/$s_!KWz-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52f4f8de-4456-4cbd-935c-a945968b704d_1466x916.png 1272w, https://substackcdn.com/image/fetch/$s_!KWz-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52f4f8de-4456-4cbd-935c-a945968b704d_1466x916.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>When modeling RL as an MDP for LLMs, our initial state is the prompt and our policy acts by predicting individual tokens. Our LLM forms a stochastic policy that predicts a distribution over tokens. During generation, actions are taken by selecting a token from this distribution&#8212;<em>each token is its own action</em>. After a token is predicted, it is added to the current state and used by the LLM to predict the next token&#8212;<em>this is just autoregressive next token prediction</em>! Eventually, the LLM predicts a stop token (e.g., <code>&lt;|end_of_text|&gt;</code> or <code>&lt;eos&gt;</code>) to complete the generation process, thus yielding a complete trajectory.</p><p><strong>Bandit formulation.</strong> In the above depiction of an MDP, we assume that a reward is provided for every time step, but the reward mechanism for an LLM is usually a bit different from this. Most LLMs are trained using outcome supervision<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>, meaning that a reward is only assigned after the model has generated a complete response (i.e., after the <code>&lt;eos&gt;</code> token has been outputted). </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZCyt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1f754f7-353f-4d3e-a79f-a85e5decbc73_1868x490.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZCyt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1f754f7-353f-4d3e-a79f-a85e5decbc73_1868x490.png 424w, https://substackcdn.com/image/fetch/$s_!ZCyt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1f754f7-353f-4d3e-a79f-a85e5decbc73_1868x490.png 848w, https://substackcdn.com/image/fetch/$s_!ZCyt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1f754f7-353f-4d3e-a79f-a85e5decbc73_1868x490.png 1272w, https://substackcdn.com/image/fetch/$s_!ZCyt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1f754f7-353f-4d3e-a79f-a85e5decbc73_1868x490.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZCyt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1f754f7-353f-4d3e-a79f-a85e5decbc73_1868x490.png" width="1456" height="382" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f1f754f7-353f-4d3e-a79f-a85e5decbc73_1868x490.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:382,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:92798,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1f754f7-353f-4d3e-a79f-a85e5decbc73_1868x490.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZCyt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1f754f7-353f-4d3e-a79f-a85e5decbc73_1868x490.png 424w, https://substackcdn.com/image/fetch/$s_!ZCyt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1f754f7-353f-4d3e-a79f-a85e5decbc73_1868x490.png 848w, https://substackcdn.com/image/fetch/$s_!ZCyt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1f754f7-353f-4d3e-a79f-a85e5decbc73_1868x490.png 1272w, https://substackcdn.com/image/fetch/$s_!ZCyt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1f754f7-353f-4d3e-a79f-a85e5decbc73_1868x490.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Outcome versus process supervision for LLMs</figcaption></figure></div><p>In an outcome supervision setting, we may begin to question the utility of modeling each token as its own action. <em>How will we know whether any single action is helpful or not in this scenario?</em> As an alternative, we could model the entire response as a single action that receives an outcome reward. This is the key idea behind the bandit formulation for RL training with LLMs; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nAQM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3080d828-b154-42d7-9f6f-7ac24b2be0f4_2234x338.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nAQM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3080d828-b154-42d7-9f6f-7ac24b2be0f4_2234x338.png 424w, https://substackcdn.com/image/fetch/$s_!nAQM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3080d828-b154-42d7-9f6f-7ac24b2be0f4_2234x338.png 848w, https://substackcdn.com/image/fetch/$s_!nAQM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3080d828-b154-42d7-9f6f-7ac24b2be0f4_2234x338.png 1272w, https://substackcdn.com/image/fetch/$s_!nAQM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3080d828-b154-42d7-9f6f-7ac24b2be0f4_2234x338.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nAQM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3080d828-b154-42d7-9f6f-7ac24b2be0f4_2234x338.png" width="1456" height="220" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3080d828-b154-42d7-9f6f-7ac24b2be0f4_2234x338.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:220,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:79475,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3080d828-b154-42d7-9f6f-7ac24b2be0f4_2234x338.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nAQM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3080d828-b154-42d7-9f6f-7ac24b2be0f4_2234x338.png 424w, https://substackcdn.com/image/fetch/$s_!nAQM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3080d828-b154-42d7-9f6f-7ac24b2be0f4_2234x338.png 848w, https://substackcdn.com/image/fetch/$s_!nAQM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3080d828-b154-42d7-9f6f-7ac24b2be0f4_2234x338.png 1272w, https://substackcdn.com/image/fetch/$s_!nAQM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3080d828-b154-42d7-9f6f-7ac24b2be0f4_2234x338.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>This name comes from the idea of a <a href="https://en.wikipedia.org/wiki/Multi-armed_bandit">contextual bandit</a> in probability theory. The bandit setup is simple: <em>our agent chooses an action, receives a reward and the episode ends</em>. Our complete trajectory is a single action and reward! For LLMs, our action is the full completion generated for a prompt, which receives an outcome reward. </p><p><strong>Which formulation should we use?</strong> In the context of LLMs, we already know how to compute the probability of both individual tokens and the full completion for a prompt. Therefore, we have the ability to model RL using either an MDP or bandit formulation. Given that LLMs usually only receive outcome rewards, however, the bandit formulation&#8212;<em>despite being very simple</em>&#8212;is quite fitting for LLMs. As we will learn, both REINFORCE and RLOO adopt the bandit formulation, while algorithms like PPO use a per-token MDP formulation. In other words, <em>both RL formulations are viable and used for training LLMs</em>. </p><h4>RL Training for LLMs</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CJn6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CJn6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 424w, https://substackcdn.com/image/fetch/$s_!CJn6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 848w, https://substackcdn.com/image/fetch/$s_!CJn6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 1272w, https://substackcdn.com/image/fetch/$s_!CJn6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CJn6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png" width="1456" height="430" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:430,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CJn6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 424w, https://substackcdn.com/image/fetch/$s_!CJn6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 848w, https://substackcdn.com/image/fetch/$s_!CJn6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 1272w, https://substackcdn.com/image/fetch/$s_!CJn6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Given the terminology and setup explained so far, we can now discuss how RL is actually used to train LLMs. There are two broad categories of RL training that are commonly used for LLMs today:</p><ul><li><p><em><a href="https://cameronrwolfe.substack.com/p/the-story-of-rlhf-origins-motivations">Reinforcement Learning from Human Feedback (RLHF)</a></em> trains the LLM using RL with rewards derived from a human preference <a href="https://cameronrwolfe.substack.com/p/reward-models">reward model</a>.</p></li><li><p><em><a href="https://cameronrwolfe.substack.com/i/153722335/reinforcement-learning-with-verifiable-rewards">Reinforcement Learning with Verifiable Rewards (RLVR)</a></em> trains the LLM using RL with rewards derived from rules-based or deterministic verifiers.</p></li></ul><p>These RL training techniques differ mainly in how they derive the reward for training, but other details of the algorithms are mostly similar. As depicted below, they both operate by generating completions over a set of prompts, computing the reward for these completions, and using the rewards to derive a <a href="https://cameronrwolfe.substack.com/p/policy-gradients-the-foundation-of">policy update</a>&#8212;<em>or an update to the LLM&#8217;s parameters</em>&#8212;with an RL optimizer.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uPv8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uPv8!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 424w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 848w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 1272w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uPv8!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif" width="1456" height="817" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:817,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;[animate output image]&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="[animate output image]" title="[animate output image]" srcset="https://substackcdn.com/image/fetch/$s_!uPv8!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 424w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 848w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 1272w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Visual depiction of RL for LLMs</figcaption></figure></div><p>The last step of this process is a gradient ascent step on the RL objective, just as we saw before. However, the actual objective used in RL training goes beyond maximizing cumulative reward. We try to maximize the reward while minimizing <a href="https://cameronrwolfe.substack.com/i/167254905/kullback-leibler-kl-divergence">KL divergence</a> of our policy with respect to a reference policy&#8212;<em>usually an LLM checkpoint from the start of RL training</em>. We want to maximize reward without making our new model significantly different from the reference; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kyeM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kyeM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png 424w, https://substackcdn.com/image/fetch/$s_!kyeM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png 848w, https://substackcdn.com/image/fetch/$s_!kyeM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png 1272w, https://substackcdn.com/image/fetch/$s_!kyeM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kyeM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png" width="1456" height="263" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:263,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kyeM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png 424w, https://substackcdn.com/image/fetch/$s_!kyeM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png 848w, https://substackcdn.com/image/fetch/$s_!kyeM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png 1272w, https://substackcdn.com/image/fetch/$s_!kyeM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">RL training objective with KL divergence</figcaption></figure></div><p>Computing the gradient of this objective with respect to the policy&#8217;s parameters is where most of the complexity lies in understanding RL. In the context of LLMs, we use policy gradient algorithms (e.g., PPO, GRPO, and REINFORCE) to compute this gradient. This overview will primarily focus on REINFORCE and its variants, but to learn how these algorithms work we need to first understand the simplest form of a policy gradient&#8212;<em>the vanilla policy gradient (VPG)</em>.</p><h4>Deriving the Vanilla Policy Gradient (VPG)</h4><p>We will cover the full derivation of the vanilla policy gradient (VPG) here for completeness. However, there are many existing overviews that explain VPG very well. A few great resources for further learning are as follows:</p><ul><li><p>Intro to Policy Optimization from OpenAI [<a href="https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html">link</a>]</p></li><li><p>RLHF Book from <a href="https://natolambert.com/">Nathan Lambert</a> [<a href="https://rlhfbook.com/c/11-policy-gradients.html">link</a>]</p></li><li><p>Policy Optimization Algorithms from <a href="https://lilianweng.github.io/">Lilian Weng</a> [<a href="https://lilianweng.github.io/posts/2018-04-08-policy-gradient/">link</a>]</p></li></ul><p>Additionally, the prior breakdown of VPG and policy optimization from this newsletter is linked below for easy reference. Our discussion in this section will largely be sampled from this more detailed exposition of policy gradients. </p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;63a640d1-843f-4a2b-baff-d5c0e879e35f&quot;,&quot;caption&quot;:&quot;A deep dive into policy gradients, how they are applied to training neural networks, and their derivation in the simplest-possible form. &quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Policy Gradients: The Foundation of RLHF&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:29736521,&quot;name&quot;:&quot;Cameron R. Wolfe, Ph.D.&quot;,&quot;bio&quot;:&quot;Research @ Netflix &#8226; Rice University PhD &#8226; I make AI understandable&quot;,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/69aba7df-b571-4609-aa47-fc2d031c11b8_1242x1595.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2023-10-02T09:22:08.195Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/65814056-1bac-4066-8da6-4c323e676060_2408x1352.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://cameronrwolfe.substack.com/p/policy-gradients-the-foundation-of&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:137421286,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:31,&quot;comment_count&quot;:1,&quot;publication_id&quot;:1092659,&quot;publication_name&quot;:&quot;Deep (Learning) Focus&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!87xa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab9b43fb-52d5-40da-995d-5b7cd3f91064_896x896.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p><strong>A basic policy gradient.</strong> Our goal in policy optimization is to compute the policy gradient, or the gradient of our RL objective&#8212;<em>here we will assume our objective is cumulative reward</em>&#8212;with respect to the parameters of our policy. As a first step in computing the policy gradient, we can perform the derivation shown below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GetI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1685ea69-1b2c-438c-87ed-dba51c4bee65_2406x1065.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GetI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1685ea69-1b2c-438c-87ed-dba51c4bee65_2406x1065.png 424w, https://substackcdn.com/image/fetch/$s_!GetI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1685ea69-1b2c-438c-87ed-dba51c4bee65_2406x1065.png 848w, https://substackcdn.com/image/fetch/$s_!GetI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1685ea69-1b2c-438c-87ed-dba51c4bee65_2406x1065.png 1272w, https://substackcdn.com/image/fetch/$s_!GetI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1685ea69-1b2c-438c-87ed-dba51c4bee65_2406x1065.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GetI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1685ea69-1b2c-438c-87ed-dba51c4bee65_2406x1065.png" width="1456" height="644" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1685ea69-1b2c-438c-87ed-dba51c4bee65_2406x1065.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:644,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:396498,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1685ea69-1b2c-438c-87ed-dba51c4bee65_2406x1065.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!GetI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1685ea69-1b2c-438c-87ed-dba51c4bee65_2406x1065.png 424w, https://substackcdn.com/image/fetch/$s_!GetI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1685ea69-1b2c-438c-87ed-dba51c4bee65_2406x1065.png 848w, https://substackcdn.com/image/fetch/$s_!GetI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1685ea69-1b2c-438c-87ed-dba51c4bee65_2406x1065.png 1272w, https://substackcdn.com/image/fetch/$s_!GetI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1685ea69-1b2c-438c-87ed-dba51c4bee65_2406x1065.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(<a href="https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html">source</a>)</figcaption></figure></div><p>This derivation starts with the gradient of our RL training objective (cumulative reward) and ends with a basic expression for the policy gradient. To arrive at the policy gradient, we use mostly simple steps like <em>i)</em> the definition of an expectation over a continuous random variable and <em>ii)</em> the <a href="https://andrewcharlesjones.github.io/journal/log-derivative.html">log-derivative trick</a>.</p><p>The most complicated step of this derivation is the final step, which transforms the gradient of the log probability of a trajectory into a sum over the gradients of log probabilities of actions. This step uses our prior expression for the probability of a trajectory, converts the product into a sum (i.e., because we are working with <a href="http://cuemath.com/algebra/properties-of-logarithms/">log probabilities</a>), and observes that the gradients of the initial state probability and transition function with respect to the policy parameters are always zero because neither of these components depend on the policy; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Rkmm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0f526be-55f2-4eae-abd8-fa4382d8335a_1564x432.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Rkmm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0f526be-55f2-4eae-abd8-fa4382d8335a_1564x432.png 424w, https://substackcdn.com/image/fetch/$s_!Rkmm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0f526be-55f2-4eae-abd8-fa4382d8335a_1564x432.png 848w, https://substackcdn.com/image/fetch/$s_!Rkmm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0f526be-55f2-4eae-abd8-fa4382d8335a_1564x432.png 1272w, https://substackcdn.com/image/fetch/$s_!Rkmm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0f526be-55f2-4eae-abd8-fa4382d8335a_1564x432.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Rkmm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0f526be-55f2-4eae-abd8-fa4382d8335a_1564x432.png" width="604" height="166.76373626373626" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b0f526be-55f2-4eae-abd8-fa4382d8335a_1564x432.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:402,&quot;width&quot;:1456,&quot;resizeWidth&quot;:604,&quot;bytes&quot;:59832,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0f526be-55f2-4eae-abd8-fa4382d8335a_1564x432.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Rkmm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0f526be-55f2-4eae-abd8-fa4382d8335a_1564x432.png 424w, https://substackcdn.com/image/fetch/$s_!Rkmm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0f526be-55f2-4eae-abd8-fa4382d8335a_1564x432.png 848w, https://substackcdn.com/image/fetch/$s_!Rkmm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0f526be-55f2-4eae-abd8-fa4382d8335a_1564x432.png 1272w, https://substackcdn.com/image/fetch/$s_!Rkmm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0f526be-55f2-4eae-abd8-fa4382d8335a_1564x432.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(<a href="https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html">source</a>)</figcaption></figure></div><p><strong>Implementing a basic policy gradient.</strong> The basic policy gradient expression that we derived above is actually pretty easy to compute. Specifically, this expression contains two key quantities that we already know how to compute:</p><ul><li><p>The reward comes directly from a verifier or reward model.</p></li><li><p>Log probabilities of actions can be computed with our LLM (i.e., these are just the token probabilities from the LLM&#8217;s output).</p></li></ul><p>To make the process of computing the basic policy gradient more concrete, a step-by-step implementation in PyTorch pseudocode has been provided below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PYzF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e4bdafe-cd71-48b7-8a10-abdc895432f7_1920x1076.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PYzF!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e4bdafe-cd71-48b7-8a10-abdc895432f7_1920x1076.gif 424w, https://substackcdn.com/image/fetch/$s_!PYzF!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e4bdafe-cd71-48b7-8a10-abdc895432f7_1920x1076.gif 848w, https://substackcdn.com/image/fetch/$s_!PYzF!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e4bdafe-cd71-48b7-8a10-abdc895432f7_1920x1076.gif 1272w, https://substackcdn.com/image/fetch/$s_!PYzF!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e4bdafe-cd71-48b7-8a10-abdc895432f7_1920x1076.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PYzF!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e4bdafe-cd71-48b7-8a10-abdc895432f7_1920x1076.gif" width="1456" height="816" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3e4bdafe-cd71-48b7-8a10-abdc895432f7_1920x1076.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:816,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;[animate output image]&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="[animate output image]" title="[animate output image]" srcset="https://substackcdn.com/image/fetch/$s_!PYzF!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e4bdafe-cd71-48b7-8a10-abdc895432f7_1920x1076.gif 424w, https://substackcdn.com/image/fetch/$s_!PYzF!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e4bdafe-cd71-48b7-8a10-abdc895432f7_1920x1076.gif 848w, https://substackcdn.com/image/fetch/$s_!PYzF!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e4bdafe-cd71-48b7-8a10-abdc895432f7_1920x1076.gif 1272w, https://substackcdn.com/image/fetch/$s_!PYzF!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e4bdafe-cd71-48b7-8a10-abdc895432f7_1920x1076.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The core intuition behind the structure of this basic policy gradient is that we are increasing the probability of actions from trajectories with high rewards.</p><blockquote><p><em>&#8220;Taking a step with this gradient pushes up the log-probabilities of each action in proportion to </em><code>R(&#120591;)</code><em>, the sum of all rewards ever obtained.&#8221;</em> - <a href="https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html">Spinning up in Deep RL</a></p></blockquote><p>This form of the policy gradient is simple, but it still appears in practice! For example, Cursor uses this exact expression in their <a href="https://cursor.com/blog/tab-rl">recent blog on online RL</a>. However, the expression in their blog assumes a bandit formulation, which causes the sum in the expression to be removed (i.e., because there is only one action). </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yMfv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653838ee-1b3c-4740-be48-e51e04192c99_1366x380.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yMfv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653838ee-1b3c-4740-be48-e51e04192c99_1366x380.png 424w, https://substackcdn.com/image/fetch/$s_!yMfv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653838ee-1b3c-4740-be48-e51e04192c99_1366x380.png 848w, https://substackcdn.com/image/fetch/$s_!yMfv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653838ee-1b3c-4740-be48-e51e04192c99_1366x380.png 1272w, https://substackcdn.com/image/fetch/$s_!yMfv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653838ee-1b3c-4740-be48-e51e04192c99_1366x380.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yMfv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653838ee-1b3c-4740-be48-e51e04192c99_1366x380.png" width="610" height="169.69253294289896" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/653838ee-1b3c-4740-be48-e51e04192c99_1366x380.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:380,&quot;width&quot;:1366,&quot;resizeWidth&quot;:610,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:&quot;Screenshot 2025-06-13 at 10.21.06&#8239;AM.png&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="Screenshot 2025-06-13 at 10.21.06&#8239;AM.png" srcset="https://substackcdn.com/image/fetch/$s_!yMfv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653838ee-1b3c-4740-be48-e51e04192c99_1366x380.png 424w, https://substackcdn.com/image/fetch/$s_!yMfv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653838ee-1b3c-4740-be48-e51e04192c99_1366x380.png 848w, https://substackcdn.com/image/fetch/$s_!yMfv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653838ee-1b3c-4740-be48-e51e04192c99_1366x380.png 1272w, https://substackcdn.com/image/fetch/$s_!yMfv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653838ee-1b3c-4740-be48-e51e04192c99_1366x380.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(<a href="https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html#baselines-in-policy-gradients">source</a>)</figcaption></figure></div><p><strong>Reducing variance.</strong> Our current policy gradient expression is simple, but it suffers from a few notable issues:</p><ul><li><p>The gradients can have high variance.</p></li><li><p>There is no protection against large, unstable policy updates.</p></li></ul><p>Most subsequent policy gradient algorithms aim to solve these problems by reducing variance of the policy gradient and enforcing a trust region on policy updates&#8212;<em>or, in other words, restricting how much we can change the model in a single update</em>. To do this, we usually replace the reward term in our policy gradient with a slightly different term; see below for some of the common options.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EZ-T!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc797ee60-f90a-4c41-ae9b-b9da6d68096f_1804x716.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EZ-T!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc797ee60-f90a-4c41-ae9b-b9da6d68096f_1804x716.png 424w, https://substackcdn.com/image/fetch/$s_!EZ-T!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc797ee60-f90a-4c41-ae9b-b9da6d68096f_1804x716.png 848w, https://substackcdn.com/image/fetch/$s_!EZ-T!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc797ee60-f90a-4c41-ae9b-b9da6d68096f_1804x716.png 1272w, https://substackcdn.com/image/fetch/$s_!EZ-T!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc797ee60-f90a-4c41-ae9b-b9da6d68096f_1804x716.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EZ-T!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc797ee60-f90a-4c41-ae9b-b9da6d68096f_1804x716.png" width="727" height="288.60302197802196" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c797ee60-f90a-4c41-ae9b-b9da6d68096f_1804x716.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:578,&quot;width&quot;:1456,&quot;resizeWidth&quot;:727,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:&quot;Screenshot 2025-06-13 at 11.04.17&#8239;AM.png&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="Screenshot 2025-06-13 at 11.04.17&#8239;AM.png" srcset="https://substackcdn.com/image/fetch/$s_!EZ-T!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc797ee60-f90a-4c41-ae9b-b9da6d68096f_1804x716.png 424w, https://substackcdn.com/image/fetch/$s_!EZ-T!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc797ee60-f90a-4c41-ae9b-b9da6d68096f_1804x716.png 848w, https://substackcdn.com/image/fetch/$s_!EZ-T!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc797ee60-f90a-4c41-ae9b-b9da6d68096f_1804x716.png 1272w, https://substackcdn.com/image/fetch/$s_!EZ-T!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc797ee60-f90a-4c41-ae9b-b9da6d68096f_1804x716.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [4])</figcaption></figure></div><p>As we can see, this expression is nearly identical to what we saw before. The only difference is that we have switched <code>R(&#120591;)</code> with the generic <code>&#936;_t</code> term, which can be set equal to a couple of different things. For example, we can:</p><ul><li><p>Set <code>&#936;_t = R(&#120591;)</code> to recover our basic policy gradient expression.</p></li><li><p>Set <code>&#936;_t</code> equal to rewards received after time <code>t</code> (i.e., the reward-to-go policy gradient) to avoid crediting actions with rewards that came before them.</p></li><li><p>Set <code>&#936;_t</code> to a <a href="https://cameronrwolfe.substack.com/i/137421286/variants-of-the-basic-policy-gradient">baselined</a> version of the reward.</p></li><li><p>Set <code>&#936;_t</code> equal to the state-action (<code>Q</code>) or advantage function (<code>A</code>).</p></li></ul><p>A full overview of these choices and how they are derived can be found <a href="https://cameronrwolfe.substack.com/i/137421286/variants-of-the-basic-policy-gradient">here</a>. A common theme among these algorithms is the use of baselines, or extra terms&#8212;<em>which must only depend on the state </em><code>s_t</code>&#8212;that we subtract from the reward as shown below. Baselines serve the purpose of normalizing the reward (or value) for a state and can be shown to reduce the variance of policy gradients<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>.  </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LPFt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd202d1cb-9ee8-4540-9360-be8dea93a14b_2046x538.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LPFt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd202d1cb-9ee8-4540-9360-be8dea93a14b_2046x538.png 424w, https://substackcdn.com/image/fetch/$s_!LPFt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd202d1cb-9ee8-4540-9360-be8dea93a14b_2046x538.png 848w, https://substackcdn.com/image/fetch/$s_!LPFt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd202d1cb-9ee8-4540-9360-be8dea93a14b_2046x538.png 1272w, https://substackcdn.com/image/fetch/$s_!LPFt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd202d1cb-9ee8-4540-9360-be8dea93a14b_2046x538.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LPFt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd202d1cb-9ee8-4540-9360-be8dea93a14b_2046x538.png" width="628" height="165.19505494505495" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d202d1cb-9ee8-4540-9360-be8dea93a14b_2046x538.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:383,&quot;width&quot;:1456,&quot;resizeWidth&quot;:628,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LPFt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd202d1cb-9ee8-4540-9360-be8dea93a14b_2046x538.png 424w, https://substackcdn.com/image/fetch/$s_!LPFt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd202d1cb-9ee8-4540-9360-be8dea93a14b_2046x538.png 848w, https://substackcdn.com/image/fetch/$s_!LPFt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd202d1cb-9ee8-4540-9360-be8dea93a14b_2046x538.png 1272w, https://substackcdn.com/image/fetch/$s_!LPFt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd202d1cb-9ee8-4540-9360-be8dea93a14b_2046x538.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Adding a baseline to rewards in the policy gradient</figcaption></figure></div><div class="pullquote"><p>A common problem with vanilla policy gradient algorithms is the high variance in gradient updates&#8230; In order to alleviate this, various techniques are used to normalize the value estimation, called <em>baselines</em>. Baselines accomplish this in multiple ways, effectively normalizing by the value of the state relative to the downstream action (e.g. in the case of Advantage, which is the difference between the Q value and the value). The simplest baselines are averages over the batch of rewards or a moving average. - <a href="https://rlhfbook.com/c/11-policy-gradients.html">RLHF book</a></p></div><p>Most of the algorithms we will see focus on setting <code>&#936;_t</code> equal to the advantage function&#8212;<em>this is known as the vanilla policy gradient (VPG) algorithm</em>. The advantage function is commonly used because it yields the lowest-variance policy gradient. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1PL6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbd6ad6-4d9e-4085-b4a7-849b29789350_1662x470.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1PL6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbd6ad6-4d9e-4085-b4a7-849b29789350_1662x470.png 424w, https://substackcdn.com/image/fetch/$s_!1PL6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbd6ad6-4d9e-4085-b4a7-849b29789350_1662x470.png 848w, https://substackcdn.com/image/fetch/$s_!1PL6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbd6ad6-4d9e-4085-b4a7-849b29789350_1662x470.png 1272w, https://substackcdn.com/image/fetch/$s_!1PL6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbd6ad6-4d9e-4085-b4a7-849b29789350_1662x470.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1PL6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbd6ad6-4d9e-4085-b4a7-849b29789350_1662x470.png" width="482" height="136.3901098901099" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3dbd6ad6-4d9e-4085-b4a7-849b29789350_1662x470.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:412,&quot;width&quot;:1456,&quot;resizeWidth&quot;:482,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1PL6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbd6ad6-4d9e-4085-b4a7-849b29789350_1662x470.png 424w, https://substackcdn.com/image/fetch/$s_!1PL6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbd6ad6-4d9e-4085-b4a7-849b29789350_1662x470.png 848w, https://substackcdn.com/image/fetch/$s_!1PL6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbd6ad6-4d9e-4085-b4a7-849b29789350_1662x470.png 1272w, https://substackcdn.com/image/fetch/$s_!1PL6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbd6ad6-4d9e-4085-b4a7-849b29789350_1662x470.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The vanilla policy gradient</figcaption></figure></div><p><strong>Actor-critic.</strong> We should recall that the advantage function is the difference between the state-action value function and the value function. In other words, <em>the VPG algorithm effectively uses the value function as a baseline in the policy gradient</em>. The value function is on-policy, meaning that it depends on the exact parameters of our policy in the current training iteration. Usually, we estimate the value function with a neural network. For LLMs, the value function is approximated with a separate value head<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a> (or model) that is initialized from the weights of the LLM and trained to predict the value function. </p><p>The LLM used to estimate the value function is referred to as a value model or critic. The critic predicts the value function&#8212;<em>or the expected reward starting from a given token or state</em>&#8212;for every token within a sequence. During RL training, the critic is actively updated alongside the LLM for each policy update&#8212;<em>this is referred to as an actor-critic setup</em><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a>. Unlike <a href="https://cameronrwolfe.substack.com/p/reward-models">reward models</a> which are fixed at the beginning of RL training, the critic is dependent upon the current parameters of the policy. Therefore, to remain on-policy and avoid its predictions becoming stale, the critic must be updated along with the LLM itself. PPO is a notable example of a policy gradient algorithm that adopts such an actor-critic setup. </p><p>The critic is usually updated using a <a href="https://en.wikipedia.org/wiki/Mean_squared_error">mean-squared error (MSE) loss</a> between the predicted and actual rewards. A pseudocode implementation of an actor-critic algorithm is provided below. Although this is a common setup, the use of a value model can be quite expensive&#8212;<em>this requires keeping an entire additional copy of the LLM in memory</em>! In fact, using a critic is part of the reason why PPO has high computational overhead. Next, we will learn about algorithms that adopt simpler and more efficient approaches for estimating the value function. </p><pre><code>import torch
import torch.nn.functional as F

# sample prompt completions and rewards
with torch.no_grad():
    completions = LLM(prompts)  # (B*G, L)
    rewards = RM(completions)  # (B*G, 1)

# compute value function / critic output
values = CRITIC(completions)  # (B*G, L) - per token!
advantage = rewards - values.detach()

# get logprobs for each action
completion_mask = &lt;... mask out padding tokens ...&gt;
llm_out = LLM(completions)
token_logp = F.log_softmax(llm_out, dim=-1)

# loss includes a weighted combination of the policy gradient
# loss and the MSE loss for the critic
loss = (- token_logp * advantage) * completion_mask
loss += _beta * (0.5 * (values - rewards)**2)

# aggregate the loss (many options exist here)
loss = (loss.sum(axis=-1) /
        completion_mask.sum(axis=-1)).mean()

# gradient update
optimizer.zero_grad()
loss.backward()
optimizer.step()</code></pre><h2>REINFORCE and RLOO for LLMs</h2><p>So far, we have learned about basic concepts in policy optimization and RL for LLMs. The basic policy gradient that we derived is easy to compute practically, but such a formulation leads to high-variance policy gradients and unstable training. To reduce variance, we need an RL optimizer that incorporates an advantage estimate into the policy gradient. However, popular algorithms like PPO accomplish this with a complicated actor-critic framework that introduces substantial overhead. Given this added complexity, we might wonder: <em>Should we just avoid online RL techniques altogether when training LLMs?</em> </p><blockquote><p><em>&#8220;Recent works propose RL-free methods such as DPO or iterative fine-tuning approaches to LLM preference training. However, these works fail to question whether a simpler solution within an RL paradigm exists.&#8221;</em> - from [3]</p></blockquote><p>Although many <a href="https://cameronrwolfe.substack.com/p/online-rl">offline and RL-free training alternatives</a> exist, there are also simple online RL algorithms that can be used to train LLMs. In this section, we will learn about REINFORCE and a slightly modified version of this algorithm called REINFORCE leave one out (RLOO). These online RL algorithms eliminate the need for a critic by estimating the value function with the average of rewards observed throughout training. In theory, such an approach yields higher-variance policy gradients compared to actor-critic algorithms like PPO. However, recent research [3, 5] has found that this increase in variance does not impact LLM training, <em>yielding easy-to-use and highly-performant options for online RL training</em>.</p><h4><a href="https://link.springer.com/article/10.1007/BF00992696">REward Increment = Nonnegative Factor x Offset Reinforcement x Characteristic Eligibility (REINFORCE)</a> [1]</h4><p>REINFORCE is a particular implementation of VPG that has low overhead, is simple to understand, and tends to be effective for training LLMs. The structure of the policy gradient used by REINFORCE is similar to the baselined policy gradient estimate we covered before. However, REINFORCE specifically uses the average of rewards observed during RL training as a baseline. This average can be computed in a few different ways; e.g., a moving average of rewards throughout training or an average of rewards present in the current batch.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yDWw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8725dec-4bf1-4790-89cd-4287d7fbbf33_1978x448.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yDWw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8725dec-4bf1-4790-89cd-4287d7fbbf33_1978x448.png 424w, https://substackcdn.com/image/fetch/$s_!yDWw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8725dec-4bf1-4790-89cd-4287d7fbbf33_1978x448.png 848w, https://substackcdn.com/image/fetch/$s_!yDWw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8725dec-4bf1-4790-89cd-4287d7fbbf33_1978x448.png 1272w, https://substackcdn.com/image/fetch/$s_!yDWw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8725dec-4bf1-4790-89cd-4287d7fbbf33_1978x448.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yDWw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8725dec-4bf1-4790-89cd-4287d7fbbf33_1978x448.png" width="1456" height="330" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a8725dec-4bf1-4790-89cd-4287d7fbbf33_1978x448.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:330,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:156949,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8725dec-4bf1-4790-89cd-4287d7fbbf33_1978x448.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yDWw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8725dec-4bf1-4790-89cd-4287d7fbbf33_1978x448.png 424w, https://substackcdn.com/image/fetch/$s_!yDWw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8725dec-4bf1-4790-89cd-4287d7fbbf33_1978x448.png 848w, https://substackcdn.com/image/fetch/$s_!yDWw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8725dec-4bf1-4790-89cd-4287d7fbbf33_1978x448.png 1272w, https://substackcdn.com/image/fetch/$s_!yDWw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8725dec-4bf1-4790-89cd-4287d7fbbf33_1978x448.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The expression for the policy gradient in REINFORCE is shown above. To compute a gradient update over a batch, we perform the following steps:</p><ul><li><p>Generate a completion for each prompt using the current policy.</p></li><li><p>Store the log probabilities for the tokens in each completion</p></li><li><p>Assign a reward to each completion (usually with a <a href="https://cameronrwolfe.substack.com/p/reward-models">reward model</a>).</p></li><li><p>Obtain a baseline by taking an average of rewards.</p></li><li><p>Compute the advantage by subtracting the baseline from the reward.</p></li><li><p>Compute the sum of log probabilities multiplied by the advantage for each completion, then average over the batch to form a <a href="https://en.wikipedia.org/wiki/Monte_Carlo_method">Monte Carlo</a> estimate.</p></li></ul><p><strong>What does the acronym mean?</strong> The REINFORCE acronym is composed of three key components:</p><ol><li><p>Reward Increment.</p></li><li><p>Non-negative factor.</p></li><li><p>Offset reinforcement.</p></li><li><p>Characteristic eligibility.</p></li></ol><p>The first component is simply our update&#8212;<em>or increment</em>&#8212;to the policy&#8217;s parameters (i.e, the policy gradient), which is a product of the three other components. The manner in which these components are combined to form the policy gradient is shown below (top term). To clarify the meaning of each term, we also map the components of REINFORCE to the more familiar expression for a policy gradient. As we can see, these are the same terms we have learned about before (e.g., log probabilities, reward, and baseline)! Additionally, REINFORCE includes the learning rate&#8212;<em>a &#8220;non-negative factor&#8221; because we are performing gradient ascent and trying to maximize rewards</em>&#8212;within its expression.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VIDL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bf32dae-2f08-46f4-ac67-0f9fbafd0a53_1730x916.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VIDL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bf32dae-2f08-46f4-ac67-0f9fbafd0a53_1730x916.png 424w, https://substackcdn.com/image/fetch/$s_!VIDL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bf32dae-2f08-46f4-ac67-0f9fbafd0a53_1730x916.png 848w, https://substackcdn.com/image/fetch/$s_!VIDL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bf32dae-2f08-46f4-ac67-0f9fbafd0a53_1730x916.png 1272w, https://substackcdn.com/image/fetch/$s_!VIDL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bf32dae-2f08-46f4-ac67-0f9fbafd0a53_1730x916.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VIDL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bf32dae-2f08-46f4-ac67-0f9fbafd0a53_1730x916.png" width="547" height="289.654532967033" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8bf32dae-2f08-46f4-ac67-0f9fbafd0a53_1730x916.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:771,&quot;width&quot;:1456,&quot;resizeWidth&quot;:547,&quot;bytes&quot;:217483,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bf32dae-2f08-46f4-ac67-0f9fbafd0a53_1730x916.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!VIDL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bf32dae-2f08-46f4-ac67-0f9fbafd0a53_1730x916.png 424w, https://substackcdn.com/image/fetch/$s_!VIDL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bf32dae-2f08-46f4-ac67-0f9fbafd0a53_1730x916.png 848w, https://substackcdn.com/image/fetch/$s_!VIDL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bf32dae-2f08-46f4-ac67-0f9fbafd0a53_1730x916.png 1272w, https://substackcdn.com/image/fetch/$s_!VIDL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bf32dae-2f08-46f4-ac67-0f9fbafd0a53_1730x916.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Mapping REINFORCE components to a familiar policy gradient expression</figcaption></figure></div><p>The term &#8220;offset reinforcement&#8221; is straightforward to understand. The baseline is directly subtracted from the reward in our policy gradient expression. In other words, the baseline is used to offset the reward, which is the reinforcement signal in RL (i.e., the reward determines whether actions are good or bad). <em>The baseline is, therefore, an offset to the reinforcement signal</em>. Unpacking the term &#8220;characteristic eligibility&#8221; requires a slightly deeper understanding of RL terminology. </p><blockquote><p><em>&#8220;Characteristic Eligibility: This is how the learning becomes attributed per token. It can be a general value, per parameter, but is often log probabilities of the policy in modern equations.&#8221;</em> - <a href="https://rlhfbook.com/c/11-policy-gradients">RLHF book</a></p></blockquote><p>&#8220;Eligibility&#8221; is a jargon term in RL related to the <a href="https://courses.csail.mit.edu/6.803/pdf/steps.pdf">credit assignment problem</a>&#8212;<em>or the problem of determining which specific actions contributed to the reward received by the policy</em>. Specifically, eligibility refers to whether a particular action taken by the LLM is actually responsible for a given reward. In the policy gradient expression, credit assignment is handled by the log probabilities of actions under the policy.</p><p><strong>Incorporating KL divergence.</strong> As with most other RL training algorithms, we also incorporate the <a href="https://cameronrwolfe.substack.com/i/167254905/kullback-leibler-kl-divergence">Kullback-Leibler (KL) Divergence</a> with respect to a reference policy&#8212;<em>usually a prior <a href="https://cameronrwolfe.substack.com/p/understanding-and-using-supervised">SFT</a>-trained checkpoint of our model</em>&#8212;into REINFORCE. We have several different approaches for <a href="http://joschu.net/blog/kl-approx.html">approximating KL divergence</a>. A common approach is to approximate KL divergence as the difference in log probabilities between the policy and reference policy. Once we&#8217;ve made this approximation, the KL divergence is directly incorporated into the reward as shown below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8KwS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F193a7b06-64ee-4be1-b8b2-02849548b5bf_2186x566.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8KwS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F193a7b06-64ee-4be1-b8b2-02849548b5bf_2186x566.png 424w, https://substackcdn.com/image/fetch/$s_!8KwS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F193a7b06-64ee-4be1-b8b2-02849548b5bf_2186x566.png 848w, https://substackcdn.com/image/fetch/$s_!8KwS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F193a7b06-64ee-4be1-b8b2-02849548b5bf_2186x566.png 1272w, https://substackcdn.com/image/fetch/$s_!8KwS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F193a7b06-64ee-4be1-b8b2-02849548b5bf_2186x566.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8KwS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F193a7b06-64ee-4be1-b8b2-02849548b5bf_2186x566.png" width="1456" height="377" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/193a7b06-64ee-4be1-b8b2-02849548b5bf_2186x566.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:377,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:219798,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F193a7b06-64ee-4be1-b8b2-02849548b5bf_2186x566.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8KwS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F193a7b06-64ee-4be1-b8b2-02849548b5bf_2186x566.png 424w, https://substackcdn.com/image/fetch/$s_!8KwS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F193a7b06-64ee-4be1-b8b2-02849548b5bf_2186x566.png 848w, https://substackcdn.com/image/fetch/$s_!8KwS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F193a7b06-64ee-4be1-b8b2-02849548b5bf_2186x566.png 1272w, https://substackcdn.com/image/fetch/$s_!8KwS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F193a7b06-64ee-4be1-b8b2-02849548b5bf_2186x566.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This approach of subtracting the KL penalty from the reward varies depending on the RL training algorithm or implementation. For ex ample, <a href="https://arxiv.org/abs/2402.03300">GRPO</a> incorporates the KL divergence into the loss function rather than directly into the reward. Adding the KL divergence into RL regularizes the training process and allows us to ensure that our policy does not deviate significantly from the reference policy.</p><p><strong>Efficiency &amp; overhead.</strong> Compared to algorithms like PPO, REINFORCE has reduced overhead, as it does not require the use of a value (or critic) model to compute the advantage estimate&#8212;<em>the average of rewards is used in place of the critic</em>. Therefore, there are only three LLM involved in the training process (i.e., policy, reference policy, and reward model), rather than four; see below. The downside of estimating the advantage in this way is higher variance. As we will see, however, the high variance of REINFORCE is not always a problem in the domain of finetuning LLMs&#8212;<em>this simple algorithm is actually quite effective in practice</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1owh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01ef1923-61fa-4a60-8ab3-79aaca2573a4_2146x824.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1owh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01ef1923-61fa-4a60-8ab3-79aaca2573a4_2146x824.png 424w, https://substackcdn.com/image/fetch/$s_!1owh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01ef1923-61fa-4a60-8ab3-79aaca2573a4_2146x824.png 848w, https://substackcdn.com/image/fetch/$s_!1owh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01ef1923-61fa-4a60-8ab3-79aaca2573a4_2146x824.png 1272w, https://substackcdn.com/image/fetch/$s_!1owh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01ef1923-61fa-4a60-8ab3-79aaca2573a4_2146x824.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1owh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01ef1923-61fa-4a60-8ab3-79aaca2573a4_2146x824.png" width="1456" height="559" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/01ef1923-61fa-4a60-8ab3-79aaca2573a4_2146x824.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:559,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:190814,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01ef1923-61fa-4a60-8ab3-79aaca2573a4_2146x824.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1owh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01ef1923-61fa-4a60-8ab3-79aaca2573a4_2146x824.png 424w, https://substackcdn.com/image/fetch/$s_!1owh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01ef1923-61fa-4a60-8ab3-79aaca2573a4_2146x824.png 848w, https://substackcdn.com/image/fetch/$s_!1owh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01ef1923-61fa-4a60-8ab3-79aaca2573a4_2146x824.png 1272w, https://substackcdn.com/image/fetch/$s_!1owh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01ef1923-61fa-4a60-8ab3-79aaca2573a4_2146x824.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Key models involved in training with REINFORCE</figcaption></figure></div><p><strong>Modeling full completions.</strong> There is one final detail missing from the image above: <em>How do we aggregate the log probabilities, KL divergences, and rewards to form the policy gradient update?</em> One of the key distinguishing aspects of REINFORCE is that is uses a bandit formulation. The policy is trained by considering the full completion, rather than each token in the completion, as a single action.</p><blockquote><p><em>&#8220;[REINFORCE] treats the entire model completion as a single action, whereas regular PPO treats <strong>each completion token</strong> as individual actions. Typically, only the EOS token gets a true reward, which is very sparse. Regular PPO would attribute a reward to the EOS token, whereas [REINFORCE] would attribute that EOS reward to the entire completion.&#8221;</em> - from [5]</p></blockquote><p>As we&#8217;ve learned, most LLMs are trained using an outcome reward setting, meaning that only the final <code>&lt;eos&gt;</code> token generated by the LLM is assigned a reward. However, the KL divergence is computed on the per-token basis, and&#8212;<em>as mentioned before</em>&#8212;the KL divergence is directly subtracted from the reward in REINFORCE. Therefore, we end up with a setup where the reward for all tokens in the completion is just the KL divergence, but the final token in the completion receives an additional reward from the reward model; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QIGq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92525dd7-7c62-47e9-9131-3522e0d61864_1864x1286.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QIGq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92525dd7-7c62-47e9-9131-3522e0d61864_1864x1286.png 424w, https://substackcdn.com/image/fetch/$s_!QIGq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92525dd7-7c62-47e9-9131-3522e0d61864_1864x1286.png 848w, https://substackcdn.com/image/fetch/$s_!QIGq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92525dd7-7c62-47e9-9131-3522e0d61864_1864x1286.png 1272w, https://substackcdn.com/image/fetch/$s_!QIGq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92525dd7-7c62-47e9-9131-3522e0d61864_1864x1286.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QIGq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92525dd7-7c62-47e9-9131-3522e0d61864_1864x1286.png" width="578" height="398.9629120879121" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/92525dd7-7c62-47e9-9131-3522e0d61864_1864x1286.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1005,&quot;width&quot;:1456,&quot;resizeWidth&quot;:578,&quot;bytes&quot;:268570,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92525dd7-7c62-47e9-9131-3522e0d61864_1864x1286.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QIGq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92525dd7-7c62-47e9-9131-3522e0d61864_1864x1286.png 424w, https://substackcdn.com/image/fetch/$s_!QIGq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92525dd7-7c62-47e9-9131-3522e0d61864_1864x1286.png 848w, https://substackcdn.com/image/fetch/$s_!QIGq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92525dd7-7c62-47e9-9131-3522e0d61864_1864x1286.png 1272w, https://substackcdn.com/image/fetch/$s_!QIGq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92525dd7-7c62-47e9-9131-3522e0d61864_1864x1286.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Bandit formulation in REINFORCE</figcaption></figure></div><p>We create a completion-level (bandit formulation) reward by summing per-token KL divergences and rewards over the sequence. Similarly, we can sum token-level log probabilities to get the log probability of the completion (or trajectory)<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a>. As shown above, we can then use these completion-level components to compute the policy gradient similarly to before:</p><ol><li><p>Subtract the baseline (average reward) from the completion-level reward.</p></li><li><p>Multiply this difference by the completion log probability.</p></li><li><p>Run a backward pass to compute the final policy gradient<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-9" href="#footnote-9" target="_self">9</a>.</p></li></ol><p>This process computes the policy gradient for a single prompt and completion pair, but we generally average this gradient over a batch of completions.</p><p><strong>Pseudocode.</strong> As a final step, we will make this discussion more concrete by implementing the computation of the policy gradient for REINFORCE in basic PyTorch<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-10" href="#footnote-10" target="_self">10</a>. We will assume that the baseline is computed by taking an average of rewards in the batch (i.e., rather than by using a moving average) so that the entire gradient update can be outlined within a single script; see below.</p><pre><code>import torch

# constants
kl_beta = 0.1

# batch of two completions with three tokens each
per_token_logprobs = torch.tensor(
    [
        [-12.3, -8.3, -2.3],
        [-10.0, -7.0, -3.0],
    ],
    requires_grad=True,
)
reference_per_token_logprobs = torch.tensor([
    [-11.3, -8.4, -2.0],
    [-9.5, -7.2, -2.8],
])

# compute KL divergence approximation
kl_div = per_token_logprobs - reference_per_token_logprobs
kl_div = -kl_beta * kl_div

# get reward for each completion (e.g., from reward model)
score_from_rm = torch.tensor([1.0, 0.5])

# reward is attributed to final &lt;eos&gt; token
per_token_reward = kl_div.clone()
per_token_reward[range(per_token_reward.size(0)), -1] += score_from_rm

# compute REINFORCE update over full sequence
entire_completion_reward = per_token_reward.sum(dim=1)
baseline = entire_completion_reward.mean().detach()

# compute advantage
advantage = entire_completion_reward - baseline

# compute loss and gradient update
reinforce_loss = -per_token_logprobs.sum(dim=1) * advantage
reinforce_loss.mean().backward()</code></pre><h4><a href="https://openreview.net/forum?id=r1lgTGL5DE">REINFORCE Leave One Out (RLOO)</a> [2]</h4><p>In REINFORCE, we generate a single on-policy completion per prompt during training and use the rewards from these completions to form our baseline via a moving average or an average of rewards in the batch. REINFORCE leave-one-out (RLOO) [2] changes this approach by:</p><ol><li><p>Sampling multiple (<code>K</code>) completions per prompt.</p></li><li><p>Using these multiple completions to compute the reward average separately for each individual prompt. </p></li></ol><p>Given <code>K</code> completions <code>{y_1, y_2, &#8230;, y_K}</code> to a prompt <code>x</code>, RLOO defines the baseline for completion <code>y_i</code> as shown below, which is simply an average over all rewards for completions to prompt <code>x</code> excluding the completion itself <code>y_i</code>. We &#8220;leave out&#8221; the reward of the completion for which the policy gradient is being computed and average over the rewards of other completions to the same prompt.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QwPe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c5886bd-c7b0-461c-928b-d81e070fa7a2_1362x718.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QwPe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c5886bd-c7b0-461c-928b-d81e070fa7a2_1362x718.png 424w, https://substackcdn.com/image/fetch/$s_!QwPe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c5886bd-c7b0-461c-928b-d81e070fa7a2_1362x718.png 848w, https://substackcdn.com/image/fetch/$s_!QwPe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c5886bd-c7b0-461c-928b-d81e070fa7a2_1362x718.png 1272w, https://substackcdn.com/image/fetch/$s_!QwPe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c5886bd-c7b0-461c-928b-d81e070fa7a2_1362x718.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QwPe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c5886bd-c7b0-461c-928b-d81e070fa7a2_1362x718.png" width="448" height="236.17033773861968" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6c5886bd-c7b0-461c-928b-d81e070fa7a2_1362x718.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:718,&quot;width&quot;:1362,&quot;resizeWidth&quot;:448,&quot;bytes&quot;:146352,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c5886bd-c7b0-461c-928b-d81e070fa7a2_1362x718.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QwPe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c5886bd-c7b0-461c-928b-d81e070fa7a2_1362x718.png 424w, https://substackcdn.com/image/fetch/$s_!QwPe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c5886bd-c7b0-461c-928b-d81e070fa7a2_1362x718.png 848w, https://substackcdn.com/image/fetch/$s_!QwPe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c5886bd-c7b0-461c-928b-d81e070fa7a2_1362x718.png 1272w, https://substackcdn.com/image/fetch/$s_!QwPe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c5886bd-c7b0-461c-928b-d81e070fa7a2_1362x718.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Computing the baseline for RLOO</figcaption></figure></div><p>From here, we can compute the advantage estimate in RLOO by <em>i)</em> computing this baseline for every completion in the batch and <em>ii)</em> subtracting the baseline from the reward received by the completion; see below (first equation). To efficiently compute the baseline for RLOO, we can first compute a fixed average reward over the <code>K</code> completions and reformulate the advantage as in the second equation below. This approach allows us to compute the average reward once and avoid re-computing the leave one out average for all <code>K</code> completions to the prompt <code>x</code>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!B9wg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7e7191e-5a8f-4db8-ad1d-da6be1a66c54_1808x996.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!B9wg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7e7191e-5a8f-4db8-ad1d-da6be1a66c54_1808x996.png 424w, https://substackcdn.com/image/fetch/$s_!B9wg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7e7191e-5a8f-4db8-ad1d-da6be1a66c54_1808x996.png 848w, https://substackcdn.com/image/fetch/$s_!B9wg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7e7191e-5a8f-4db8-ad1d-da6be1a66c54_1808x996.png 1272w, https://substackcdn.com/image/fetch/$s_!B9wg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7e7191e-5a8f-4db8-ad1d-da6be1a66c54_1808x996.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!B9wg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7e7191e-5a8f-4db8-ad1d-da6be1a66c54_1808x996.png" width="530" height="291.9368131868132" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f7e7191e-5a8f-4db8-ad1d-da6be1a66c54_1808x996.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:802,&quot;width&quot;:1456,&quot;resizeWidth&quot;:530,&quot;bytes&quot;:278681,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7e7191e-5a8f-4db8-ad1d-da6be1a66c54_1808x996.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!B9wg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7e7191e-5a8f-4db8-ad1d-da6be1a66c54_1808x996.png 424w, https://substackcdn.com/image/fetch/$s_!B9wg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7e7191e-5a8f-4db8-ad1d-da6be1a66c54_1808x996.png 848w, https://substackcdn.com/image/fetch/$s_!B9wg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7e7191e-5a8f-4db8-ad1d-da6be1a66c54_1808x996.png 1272w, https://substackcdn.com/image/fetch/$s_!B9wg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7e7191e-5a8f-4db8-ad1d-da6be1a66c54_1808x996.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Advantage estimate in RLOO</figcaption></figure></div><p>This modified advantage estimate can be plugged into the same policy gradient expression used by REINFORCE. Similarly to REINFORCE, RLOO uses a per-completion&#8212;<em>as opposed to per-token</em>&#8212;loss, and we have no learned value model. However, the leave one out baseline used by RLOO lowers variance relative to the standard REINFORCE algorithm by using multiple samples per prompt to derive the policy gradient estimate. Compared to a single-sample approach, taking multiple samples per prompt benefits training stability, speed, and performance.</p><blockquote><p><em>&#8220;The common case of sampling one prediction per datapoint is data-inefficient. We show that by drawing multiple samples per datapoint, we can learn with significantly less data, as we freely obtain a REINFORCE baseline to reduce variance.&#8221;</em> - from [2]</p></blockquote><p><strong>Practical usage.</strong> After the popularization of RLOO for LLMs, a great blog on this topic was published by HuggingFace [5] exploring the implementation and practical performance of RLOO. This analysis extends authors&#8217; prior work on correctly implementing and tuning PPO-based RLHF on summarization tasks [6]&#8212;<em>OpenAI&#8217;s <a href="https://huggingface.co/datasets/openai/summarize_from_feedback">TL;DR summarization dataset</a> in particular.</em> In [5], these results are extended by training Pythia 1B and 6.9B models with RLOO, starting from the same SFT checkpoints and reward models from [6]. Models are evaluated by comparing their output to a reference summary with a GPT-4 judge; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Fkqw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43a3270c-bd5c-4a96-9e87-019ea3d70082_1520x560.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Fkqw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43a3270c-bd5c-4a96-9e87-019ea3d70082_1520x560.png 424w, https://substackcdn.com/image/fetch/$s_!Fkqw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43a3270c-bd5c-4a96-9e87-019ea3d70082_1520x560.png 848w, https://substackcdn.com/image/fetch/$s_!Fkqw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43a3270c-bd5c-4a96-9e87-019ea3d70082_1520x560.png 1272w, https://substackcdn.com/image/fetch/$s_!Fkqw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43a3270c-bd5c-4a96-9e87-019ea3d70082_1520x560.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Fkqw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43a3270c-bd5c-4a96-9e87-019ea3d70082_1520x560.png" width="1456" height="536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/43a3270c-bd5c-4a96-9e87-019ea3d70082_1520x560.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:536,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:198793,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43a3270c-bd5c-4a96-9e87-019ea3d70082_1520x560.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Fkqw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43a3270c-bd5c-4a96-9e87-019ea3d70082_1520x560.png 424w, https://substackcdn.com/image/fetch/$s_!Fkqw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43a3270c-bd5c-4a96-9e87-019ea3d70082_1520x560.png 848w, https://substackcdn.com/image/fetch/$s_!Fkqw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43a3270c-bd5c-4a96-9e87-019ea3d70082_1520x560.png 1272w, https://substackcdn.com/image/fetch/$s_!Fkqw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43a3270c-bd5c-4a96-9e87-019ea3d70082_1520x560.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [5])</figcaption></figure></div><p>As we can see, RLOO uses 50-70% less memory than PPO and runs 2-3&#215; faster. These savings increase with the size of the model. In addition to these gains in efficiency, RLOO performs competitively to PPO and consistently outperforms offline algorithms like DPO. These results demonstrate the key value proposition of RLOO (and REINFORCE)&#8212;<em>these algorithms maintain the performance benefits of online RL algorithms while being simpler to implement and less costly to run</em>. </p><p><strong>Pseudocode. </strong>To implement RLOO, we can modify our original REINFORCE example as shown below. Here, we assume that three completions are sampled per prompt (i.e., <code>K = 3</code>) and that our batch is composed of three prompts. For more production-ready code, both REINFORCE and RLOO are also supported within the volcano engine reinforcement learning (verl) library [7]; see <a href="https://github.com/volcengine/verl">here</a>.</p><pre><code>import torch

# constants
K = 3  # completions per prompt
kl_beta = 0.1

# batch of three prompts with three completions each
per_token_logprobs = torch.tensor(
    [
        # prompt 1
        [
            [-12.3, -8.3, -2.3], # completion 1
            [-10.0, -7.0, -3.0], # completion 2
            [-10.5, -12.2, -9.1], # completion 3
        ],

        # prompt 2
        [
            [-11.0, -10.3, -1.3],
            [-11.1, -11.1, -0.8],   
            [-8.2, -11.9, -0.1],        

        ],
        
        # prompt 3
        [
            [-1.8, -2.1, -0.2],
            [-0.7, -3.5, -0.1],
            [-1.0, -2.2, -1.1],
        ],
    ],
    requires_grad=True,
)
reference_per_token_logprobs = torch.tensor([
    [
        [-11.8, -8.4, -2.3], 
        [-10.1, -7.2, -3.1],
        [-10.3, -12.9, -9.1],
    ],
    [
        [-11.8, -9.7, -1.3],
        [-12.3, -11.9, -0.2],
        [-8.1, -12.0, -0.5],
    ],
    [
        [-2.7, -2.0, -1.2],
        [-0.7, -3.6, -0.2],
        [-0.7, -1.2, -0.9],
    ],
])

# compute KL divergence approximation
kl_div = per_token_logprobs - reference_per_token_logprobs
kl_div = -kl_beta * kl_div

# reward for each completion (grouped by prompt)
score_from_rm = torch.tensor([
    [1, 2, 3], # rewards for completions to prompt 1
    [2, 3, 4], # rewards for completions to prompt 2
    [3, 4, 5], # rewards for completions to prompt 3
]).float()

# reward attributed to final &lt;eos&gt; token
per_token_reward = kl_div.clone()
per_token_reward[:, :, -1] += score_from_rm

# compute full sequence reward
entire_completion_reward = per_token_reward.sum(dim=-1)

# compute RLOO baseline in vectorized fashion
baseline = (
    entire_completion_reward.sum(dim=-1)[:, None]
    - entire_completion_reward
) / (K - 1)
baseline = baseline.detach()

# compute advantage and loss
advantage = entire_completion_reward - baseline
rloo_loss = -per_token_logprobs.sum(dim=-1) * advantage
rloo_loss.mean().backward()</code></pre><h4><strong><a href="https://arxiv.org/abs/2402.14740">Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs</a> [3]</strong></h4><blockquote><p><em>&#8220;We posit that most of the motivational principles that led to the development of PPO are less of a practical concern in RLHF and advocate for a less computationally expensive method that preserves and even increases performance.&#8221;</em> - from [3]</p></blockquote><p>Although PPO is the de facto RL optimizer for RLHF, authors in [3] argue that the original motivations for PPO (i.e., avoiding large and unstable policy updates) are less relevant in the context of LLMs. Instead, we can use simpler RL optimizers&#8212;<em>REINFORCE in particular</em>&#8212;to save on compute and memory costs without sacrificing performance. In particular, we learn that aligning LLMS with a basic REINFORCE algorithm can achieve results that match or exceed those of PPO-based RLHF, as well as other algorithms like <a href="https://cameronrwolfe.substack.com/p/direct-preference-optimization">DPO</a> and <a href="https://arxiv.org/abs/2304.06767">RAFT</a>. This paper was a key contribution that popularized the use of simpler RL optimizers for LLMs.</p><p><strong>LLMs versus DeepRL.</strong> The crux of the argument in [3] revolves around the fact that LLM finetuning is a unique setting for RL that differs significantly from the <a href="https://spinningup.openai.com/en/latest/">traditional DeepRL setting</a> in which algorithms like PPO were proposed. The most notable difference between these two settings is that LLMs are not trained with RL from scratch. Rather, <em>we are finetuning an LLM that has already undergone extensive pretraining</em>. This difference has two key implications:</p><ul><li><p>The risk of policy updates with catastrophically large variance is lower in LLM finetuning relative to the traditional DeepRL setting.</p></li><li><p>The LLM finetuning setting has less of a need for regularizing the learning process relative to the traditional DeepRL setting.</p></li></ul><p>We can concretely test this hypothesis by tweaking the settings of PPO. Namely, most implementations of PPO use <a href="https://danieltakeshi.github.io/2017/04/02/notes-on-the-generalized-advantage-estimation-paper/">Generalized Advantage Estimation (GAE)</a> [4] to estimate the advantage function. The details of GAE are beyond the scope of this post. However, GAE contains the <code>&#955; &#8712; [0.0, 1.0]</code> hyperparameter that can be used to control the tradeoff between bias and variance in the advantage estimate.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TXwn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7224cd65-a3fd-4593-8a48-0abcfeed30bf_1266x654.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TXwn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7224cd65-a3fd-4593-8a48-0abcfeed30bf_1266x654.png 424w, https://substackcdn.com/image/fetch/$s_!TXwn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7224cd65-a3fd-4593-8a48-0abcfeed30bf_1266x654.png 848w, https://substackcdn.com/image/fetch/$s_!TXwn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7224cd65-a3fd-4593-8a48-0abcfeed30bf_1266x654.png 1272w, https://substackcdn.com/image/fetch/$s_!TXwn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7224cd65-a3fd-4593-8a48-0abcfeed30bf_1266x654.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TXwn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7224cd65-a3fd-4593-8a48-0abcfeed30bf_1266x654.png" width="1266" height="654" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7224cd65-a3fd-4593-8a48-0abcfeed30bf_1266x654.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:654,&quot;width&quot;:1266,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:198158,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7224cd65-a3fd-4593-8a48-0abcfeed30bf_1266x654.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TXwn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7224cd65-a3fd-4593-8a48-0abcfeed30bf_1266x654.png 424w, https://substackcdn.com/image/fetch/$s_!TXwn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7224cd65-a3fd-4593-8a48-0abcfeed30bf_1266x654.png 848w, https://substackcdn.com/image/fetch/$s_!TXwn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7224cd65-a3fd-4593-8a48-0abcfeed30bf_1266x654.png 1272w, https://substackcdn.com/image/fetch/$s_!TXwn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7224cd65-a3fd-4593-8a48-0abcfeed30bf_1266x654.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><p>Lowering <code>&#955;</code> reduces variance at the cost of increased bias, but this is a worthwhile tradeoff for domains&#8212;<em>like DeepRL</em>&#8212;with excessive variance in policy updates. As shown above, optimal performance in LLM alignment is achieved with a setting of <code>&#955; = 1.0</code>, which induces maximum possible variance in the policy gradient. Such a finding indicates that the level of variance in policy updates observed for LLM alignment is not detrimental to the LLM&#8217;s learning process.</p><blockquote><p><em>&#8220;Large off-policy updates in our optimization regime are rare and do not have catastrophic effects on learning as they do in traditional DeepRL.&#8221;</em> - from [3]</p></blockquote><p><strong>Effective action space.</strong> In addition to high variance, one complicating factor of RL training is the presence of a large action space. If there are many possible actions for the policy to take and rewards from these actions are noisy, learning a high-quality policy is difficult. Theoretically, the action space of an LLM is very large&#8212;<em>it includes all completions that the LLM can generate for a given prompt</em>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7xYY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc836c3b8-7c81-4c3c-bf67-3682aad92a86_1274x748.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7xYY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc836c3b8-7c81-4c3c-bf67-3682aad92a86_1274x748.png 424w, https://substackcdn.com/image/fetch/$s_!7xYY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc836c3b8-7c81-4c3c-bf67-3682aad92a86_1274x748.png 848w, https://substackcdn.com/image/fetch/$s_!7xYY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc836c3b8-7c81-4c3c-bf67-3682aad92a86_1274x748.png 1272w, https://substackcdn.com/image/fetch/$s_!7xYY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc836c3b8-7c81-4c3c-bf67-3682aad92a86_1274x748.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7xYY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc836c3b8-7c81-4c3c-bf67-3682aad92a86_1274x748.png" width="1274" height="748" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c836c3b8-7c81-4c3c-bf67-3682aad92a86_1274x748.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:748,&quot;width&quot;:1274,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:214107,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc836c3b8-7c81-4c3c-bf67-3682aad92a86_1274x748.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7xYY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc836c3b8-7c81-4c3c-bf67-3682aad92a86_1274x748.png 424w, https://substackcdn.com/image/fetch/$s_!7xYY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc836c3b8-7c81-4c3c-bf67-3682aad92a86_1274x748.png 848w, https://substackcdn.com/image/fetch/$s_!7xYY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc836c3b8-7c81-4c3c-bf67-3682aad92a86_1274x748.png 1272w, https://substackcdn.com/image/fetch/$s_!7xYY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc836c3b8-7c81-4c3c-bf67-3682aad92a86_1274x748.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><p>Practically speaking, however, the effective action space of an LLM&#8212;<em>the set of completions that the model is likely to generate</em>&#8212;is actually quite small. When an LLM is performing generation, this process is conditioned upon the prompt provided to the LLM, which is shown in [3] to be a strong conditioning. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7xYY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc836c3b8-7c81-4c3c-bf67-3682aad92a86_1274x748.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7xYY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc836c3b8-7c81-4c3c-bf67-3682aad92a86_1274x748.png 424w, https://substackcdn.com/image/fetch/$s_!7xYY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc836c3b8-7c81-4c3c-bf67-3682aad92a86_1274x748.png 848w, https://substackcdn.com/image/fetch/$s_!7xYY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc836c3b8-7c81-4c3c-bf67-3682aad92a86_1274x748.png 1272w, https://substackcdn.com/image/fetch/$s_!7xYY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc836c3b8-7c81-4c3c-bf67-3682aad92a86_1274x748.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7xYY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc836c3b8-7c81-4c3c-bf67-3682aad92a86_1274x748.png" width="1274" height="748" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c836c3b8-7c81-4c3c-bf67-3682aad92a86_1274x748.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:748,&quot;width&quot;:1274,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:214107,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc836c3b8-7c81-4c3c-bf67-3682aad92a86_1274x748.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!7xYY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc836c3b8-7c81-4c3c-bf67-3682aad92a86_1274x748.png 424w, https://substackcdn.com/image/fetch/$s_!7xYY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc836c3b8-7c81-4c3c-bf67-3682aad92a86_1274x748.png 848w, https://substackcdn.com/image/fetch/$s_!7xYY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc836c3b8-7c81-4c3c-bf67-3682aad92a86_1274x748.png 1272w, https://substackcdn.com/image/fetch/$s_!7xYY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc836c3b8-7c81-4c3c-bf67-3682aad92a86_1274x748.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><p>More specifically, we see in the figure above that probability mass in an LLM&#8217;s completions is highly-concentrated amongst a small number of tokens after the first step of the generation process (i.e., the first token that is outputted). Such an observation demonstrates that an LLM&#8217;s prompt provides strong conditioning for the generation process, <em>which makes the mode&#8217;s effective action space quite small</em>. </p><p><strong>From PPO to REINFORCE.</strong> Given that variance is less of a concern for LLMs, authors in [3] perform RLHF experiments that use much simpler REINFORCE and RLOO algorithms as the RL optimizer in place of PPO. REINFORCE and RLOO make significant changes to the RL formulation used in PPO. Namely, PPO uses a per-token MDP formulation, while both REINFORCE and RLOO adopt a bandit formulation&#8212;<em>the entire completion is modeled as a single action</em>. </p><blockquote><p><em>&#8220;We show that the modeling of partial sequences is unnecessary in this setting where rewards are only attributed to full generations&#8230; it is more appropriate and efficient to model the entire generation as a single action with the initial state determined by the prompt.&#8221;</em> - from [3]</p></blockquote><p>In addition to being simpler than the MDP formulation, modeling the full generation as a single action preserves the LLM&#8217;s performance and even speeds up learning, <em>indicating that formulating each token as its own action is an unnecessary complexity in an outcome reward setting</em>. </p><p><strong>Experimental setup. </strong>Experiments in [3] are conducted on the <a href="https://huggingface.co/datasets/CarperAI/openai_summarize_tldr">TL;DR summarize</a> and <a href="https://huggingface.co/datasets/Anthropic/hh-rlhf">Anthropic HH</a> datasets with <a href="https://huggingface.co/EleutherAI/pythia-6.9b">Pythia-6.9b</a> and <a href="https://huggingface.co/meta-llama/Llama-2-7b">Llama-7b</a> models. Both reward models and policies are initialized using a model checkpoint obtained by running SFT on a curated dataset of high-quality completions for each respective dataset. During RL, training prompts are sampled from the SFT dataset. For evaluation, authors report each model&#8217;s average reward&#8212;<em>from the fixed reward model used for RL training</em>&#8212;on a hold out test set, as well as win-rates against GPT-4 using the <a href="https://arxiv.org/abs/2305.14387">AlpacaFarm framework</a> (i.e., open-ended evaluation on chat-style prompts). </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dMtu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d5c4571-e684-4930-ad3f-e82ff9454f76_1582x894.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dMtu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d5c4571-e684-4930-ad3f-e82ff9454f76_1582x894.png 424w, https://substackcdn.com/image/fetch/$s_!dMtu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d5c4571-e684-4930-ad3f-e82ff9454f76_1582x894.png 848w, https://substackcdn.com/image/fetch/$s_!dMtu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d5c4571-e684-4930-ad3f-e82ff9454f76_1582x894.png 1272w, https://substackcdn.com/image/fetch/$s_!dMtu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d5c4571-e684-4930-ad3f-e82ff9454f76_1582x894.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dMtu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d5c4571-e684-4930-ad3f-e82ff9454f76_1582x894.png" width="1456" height="823" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3d5c4571-e684-4930-ad3f-e82ff9454f76_1582x894.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:823,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:278282,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d5c4571-e684-4930-ad3f-e82ff9454f76_1582x894.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dMtu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d5c4571-e684-4930-ad3f-e82ff9454f76_1582x894.png 424w, https://substackcdn.com/image/fetch/$s_!dMtu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d5c4571-e684-4930-ad3f-e82ff9454f76_1582x894.png 848w, https://substackcdn.com/image/fetch/$s_!dMtu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d5c4571-e684-4930-ad3f-e82ff9454f76_1582x894.png 1272w, https://substackcdn.com/image/fetch/$s_!dMtu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d5c4571-e684-4930-ad3f-e82ff9454f76_1582x894.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><p><strong>Is REINFORCE effective?</strong> As shown above, both REINFORCE and RLOO&#8212;<em>in addition to being less memory intensive due to their lack of a learned critic model</em>&#8212;consistently outperform PPO, confirming that modeling partial sequences is unnecessary for the RLHF setting in [3]. RLOO is also found to be more sample efficient than the <a href="https://arxiv.org/abs/2304.06767">RAFT algorithm</a> [9]&#8212;<em>given the same number of on-policy samples generated during training, RLOO tends to achieve better performance</em>; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yEoc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdead7897-21dd-45ac-a63d-e8e25b922989_1666x744.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yEoc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdead7897-21dd-45ac-a63d-e8e25b922989_1666x744.png 424w, https://substackcdn.com/image/fetch/$s_!yEoc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdead7897-21dd-45ac-a63d-e8e25b922989_1666x744.png 848w, https://substackcdn.com/image/fetch/$s_!yEoc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdead7897-21dd-45ac-a63d-e8e25b922989_1666x744.png 1272w, https://substackcdn.com/image/fetch/$s_!yEoc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdead7897-21dd-45ac-a63d-e8e25b922989_1666x744.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yEoc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdead7897-21dd-45ac-a63d-e8e25b922989_1666x744.png" width="1456" height="650" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dead7897-21dd-45ac-a63d-e8e25b922989_1666x744.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:650,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:197758,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdead7897-21dd-45ac-a63d-e8e25b922989_1666x744.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yEoc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdead7897-21dd-45ac-a63d-e8e25b922989_1666x744.png 424w, https://substackcdn.com/image/fetch/$s_!yEoc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdead7897-21dd-45ac-a63d-e8e25b922989_1666x744.png 848w, https://substackcdn.com/image/fetch/$s_!yEoc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdead7897-21dd-45ac-a63d-e8e25b922989_1666x744.png 1272w, https://substackcdn.com/image/fetch/$s_!yEoc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdead7897-21dd-45ac-a63d-e8e25b922989_1666x744.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [4])</figcaption></figure></div><p>This finding holds true for all models and data tested in [3]. The superior sample efficiency of RLOO makes intuitive sense given that all samples&#8212;<em>even those with poor or negative reward</em>&#8212;are used during training. In contrast, RAFT filters samples based on their reward and only trains on those with the best rewards. </p><p>When we evaluate models in terms of simulated win-rates on AlpacaFarm, many of the results above continue to be true, but we can compare the performance of each technique in a more human-understandable manner. As shown below, the best performance is consistently achieved with RLOO, and both REINFORCE and RLOO consistently outperform PPO. Notably, RLOO&#8212;<em>with four on-policy samples per prompt</em>&#8212;outperforms PPO by an absolute increase in win-rate of 10.4% and 14.5% for TL;DR and HH datasets. When used to align Llama, RLOO sees an even larger absolute win-rate improvement of 32.1% over PPO.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EFho!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe75831a0-6467-4f19-9f48-793b78cdf1ee_2472x800.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EFho!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe75831a0-6467-4f19-9f48-793b78cdf1ee_2472x800.png 424w, https://substackcdn.com/image/fetch/$s_!EFho!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe75831a0-6467-4f19-9f48-793b78cdf1ee_2472x800.png 848w, https://substackcdn.com/image/fetch/$s_!EFho!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe75831a0-6467-4f19-9f48-793b78cdf1ee_2472x800.png 1272w, https://substackcdn.com/image/fetch/$s_!EFho!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe75831a0-6467-4f19-9f48-793b78cdf1ee_2472x800.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EFho!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe75831a0-6467-4f19-9f48-793b78cdf1ee_2472x800.png" width="1456" height="471" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e75831a0-6467-4f19-9f48-793b78cdf1ee_2472x800.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:471,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:388079,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe75831a0-6467-4f19-9f48-793b78cdf1ee_2472x800.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EFho!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe75831a0-6467-4f19-9f48-793b78cdf1ee_2472x800.png 424w, https://substackcdn.com/image/fetch/$s_!EFho!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe75831a0-6467-4f19-9f48-793b78cdf1ee_2472x800.png 848w, https://substackcdn.com/image/fetch/$s_!EFho!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe75831a0-6467-4f19-9f48-793b78cdf1ee_2472x800.png 1272w, https://substackcdn.com/image/fetch/$s_!EFho!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe75831a0-6467-4f19-9f48-793b78cdf1ee_2472x800.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><p><strong>Improved robustness.</strong> Authors in [3] conclude by studying the robustness of RLOO relative to RAFT in two areas:</p><ul><li><p>How does increasing the &#946; term for KL divergence impact performance?</p></li><li><p>How does adding noise to the reward estimate impact performance?</p></li></ul><p>Interestingly, RLOO is found to be noticeably more robust to noise relative to RAFT; see below. When increasing &#946;, RAFT performs worse than RLOO and produces a policy with a larger KL divergence relative to the reference policy. Additionally, the performance of RAFT sees a larger negative impact from noisy reward estimates relative to RLOO. Such degraded robustness to noise is caused by the fact that RAFT only trains on the highest-reward completions, <em>leading any perturbation to reward estimates to significantly impact training</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YWO-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74163b82-ab60-4220-a908-1dfa70f2268b_1250x1204.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YWO-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74163b82-ab60-4220-a908-1dfa70f2268b_1250x1204.png 424w, https://substackcdn.com/image/fetch/$s_!YWO-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74163b82-ab60-4220-a908-1dfa70f2268b_1250x1204.png 848w, https://substackcdn.com/image/fetch/$s_!YWO-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74163b82-ab60-4220-a908-1dfa70f2268b_1250x1204.png 1272w, https://substackcdn.com/image/fetch/$s_!YWO-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74163b82-ab60-4220-a908-1dfa70f2268b_1250x1204.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YWO-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74163b82-ab60-4220-a908-1dfa70f2268b_1250x1204.png" width="1250" height="1204" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/74163b82-ab60-4220-a908-1dfa70f2268b_1250x1204.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1204,&quot;width&quot;:1250,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:458912,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/173306894?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74163b82-ab60-4220-a908-1dfa70f2268b_1250x1204.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YWO-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74163b82-ab60-4220-a908-1dfa70f2268b_1250x1204.png 424w, https://substackcdn.com/image/fetch/$s_!YWO-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74163b82-ab60-4220-a908-1dfa70f2268b_1250x1204.png 848w, https://substackcdn.com/image/fetch/$s_!YWO-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74163b82-ab60-4220-a908-1dfa70f2268b_1250x1204.png 1272w, https://substackcdn.com/image/fetch/$s_!YWO-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74163b82-ab60-4220-a908-1dfa70f2268b_1250x1204.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><h2>Conclusion</h2><p>We now have a foundational understanding of RL for LLMs that spans from basic terminology to functional implementations of online RL algorithms. Most work on RL training for LLMs uses actor-critic algorithms like PPO as the underlying optimizer. But, these algorithms introduce complexity and overhead to reduce the variance of policy gradients. In the context of LLMs, we have learned that much simpler online RL algorithms are available! REINFORCE and RLOO adopt a completion-level bandit setup for RL and normalize rewards using either:</p><ul><li><p>The average of rewards during training (for REINFORCE), or</p></li><li><p>The average of rewards for other completions to a prompt (for RLOO).</p></li></ul><p>Because they estimate the value function in this way, neither REINFORCE or RLOO require a learned critic, which reduces memory overhead and speeds up the training process. If we want to avoid the complexity of algorithms like PPO, these simpler online RL algorithms offer an effective alternative, rather than immediately turning to approaches that are completely offline or RL-free.</p><h4>New to the newsletter?</h4><p>Hi! I&#8217;m <a href="https://cameronrwolfe.me/">Cameron R. Wolfe</a>, Deep Learning Ph.D. and Senior Research Scientist at <a href="https://research.netflix.com/research-area/nlp-and-conversations">Netflix</a>. This is the Deep (Learning) Focus newsletter, where I help readers better understand important topics in AI research. The newsletter will always be free and open to read. If you like the newsletter, please subscribe, consider a paid subscription, share it, or follow me on <a href="https://twitter.com/cwolferesearch">X</a> and <a href="https://www.linkedin.com/in/cameron-r-wolfe-ph-d-04744a238/">LinkedIn</a>!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://cameronrwolfe.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://cameronrwolfe.substack.com/subscribe?"><span>Subscribe now</span></a></p><h4>Bibliography</h4><p>[1] Williams, Ronald J. &#8220;Simple statistical gradient-following algorithms for connectionist reinforcement learning.&#8221; <em>Machine learning</em> 8.3 (1992): 229-256.</p><p>[2] Kool, Wouter, Herke van Hoof, and Max Welling. &#8220;Buy 4 reinforce samples, get a baseline for free!.&#8221; (2019).</p><p>[3] Ahmadian, Arash, et al. &#8220;Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms.&#8221; <em>arXiv preprint arXiv:2402.14740</em> (2024).</p><p>[4] Schulman, John, et al. &#8220;High-dimensional continuous control using generalized advantage estimation.&#8221; <em>arXiv preprint arXiv:1506.02438</em> (2015).</p><p>[5] Costa Huang, Shengyi, et al. &#8220;Putting RL back in RLHF&#8221; <a href="https://huggingface.co/blog/putting_rl_back_in_rlhf_with_rloo">https://huggingface.co/blog/putting_rl_back_in_rlhf_with_rloo</a> (2024).</p><p>[6] Huang, Shengyi, et al. &#8220;The n+ implementation details of rlhf with ppo: A case study on tl; dr summarization.&#8221; <em>arXiv preprint arXiv:2403.17031</em> (2024).<br>[7] Sheng, Guangming, et al. &#8220;Hybridflow: A flexible and efficient rlhf framework.&#8221; <em>Proceedings of the Twentieth European Conference on Computer Systems</em>. 2025.</p><p>[8] Lightman, Hunter, et al. &#8220;Let&#8217;s verify step by step.&#8221; <em>The Twelfth International Conference on Learning Representations</em>. 2023.</p><p>[9] Dong, Hanze, et al. &#8220;Raft: Reward ranked finetuning for generative foundation model alignment.&#8221; <em>arXiv preprint arXiv:2304.06767</em> (2023).</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>In other words, the output of our policy is not just a discrete action. Rather, it is a probability distribution over a set of possible actions. For example, LLMs output a probability distribution over the set of potential next tokens.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Additionally, we can have a finite or infinite-horizon setup in this return. However, in the context of LLMs, we usually assume a finite-horizon setup (i.e., the LLM does not continue generating tokens forever). </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Here, we use gradient ascent (as opposed to descent) because we are trying to maximize a function. However, gradient ascent and descent are nearly identical. The only change is whether we subtract&#8212;<em>if minimizing a function in gradient descent</em>&#8212;or add&#8212;<em>if maximizing a function in gradient ascent</em>&#8212;the gradient to our model&#8217;s parameters. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>Process supervision is possible and has been explored in research on large reasoning models (LRMs), but it is less common than the outcome reward setting.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>Additionally, adding baselines to the policy gradient does not bias our gradient estimate. This fact can be proven by using the <a href="https://cameronrwolfe.substack.com/i/137421286/variants-of-the-basic-policy-gradient">EGLP lemma</a>, which also mandates that the baseline must only depend on the state <code>s_t</code>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>By &#8220;head&#8221;, we mean an extra small layer added to the end of the LLM that is trainable.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>The &#8220;actor&#8221; refers to the LLM&#8212;<em>or the model that is taking actions</em>&#8212;and the &#8220;critic&#8221; refers to the value model. The value model is called a critic due to the fact that it is predicting the reward associated with each action (i.e., effectively critiquing the action).</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>This stems from basic concepts in language modeling. Namely, we can take the product of probabilities for all tokens in a completion (or the sum of log probabilities) to get the probability of the full completion. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-9" href="#footnote-anchor-9" class="footnote-number" contenteditable="false" target="_self">9</a><div class="footnote-content"><p>Our policy gradient term contains the gradient of log probabilities, but we have access to log probabilities (not the gradient of log probabilities) in our example. So, we need to take the gradient of these log probabilities&#8212;<em>usually by running </em><code>loss.backward()</code><em> in PyTorch</em>&#8212;to get the final policy gradient. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-10" href="#footnote-anchor-10" class="footnote-number" contenteditable="false" target="_self">10</a><div class="footnote-content"><p>This implementation, as well as our later implementation of RLOO, is just a modified version of the code from <a href="https://huggingface.co/blog/putting_rl_back_in_rlhf_with_rloo">this blog post</a>. </p></div></div>]]></content:encoded></item><item><title><![CDATA[Online versus Offline RL for LLMs]]></title><description><![CDATA[A deep dive into the online-offline performance gap in LLM alignment...]]></description><link>https://cameronrwolfe.substack.com/p/online-rl</link><guid isPermaLink="false">https://cameronrwolfe.substack.com/p/online-rl</guid><dc:creator><![CDATA[Cameron R. Wolfe, Ph.D.]]></dc:creator><pubDate>Mon, 08 Sep 2025 09:33:21 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/496476e6-d878-4cb2-9e11-948ac7e2e443_2240x1258.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1FDy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b982657-5755-490c-9097-9fc68c9199c9_2484x1394.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1FDy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b982657-5755-490c-9097-9fc68c9199c9_2484x1394.png 424w, https://substackcdn.com/image/fetch/$s_!1FDy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b982657-5755-490c-9097-9fc68c9199c9_2484x1394.png 848w, https://substackcdn.com/image/fetch/$s_!1FDy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b982657-5755-490c-9097-9fc68c9199c9_2484x1394.png 1272w, https://substackcdn.com/image/fetch/$s_!1FDy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b982657-5755-490c-9097-9fc68c9199c9_2484x1394.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1FDy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b982657-5755-490c-9097-9fc68c9199c9_2484x1394.png" width="1456" height="817" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2b982657-5755-490c-9097-9fc68c9199c9_2484x1394.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:817,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1724317,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/169926007?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b982657-5755-490c-9097-9fc68c9199c9_2484x1394.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1FDy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b982657-5755-490c-9097-9fc68c9199c9_2484x1394.png 424w, https://substackcdn.com/image/fetch/$s_!1FDy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b982657-5755-490c-9097-9fc68c9199c9_2484x1394.png 848w, https://substackcdn.com/image/fetch/$s_!1FDy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b982657-5755-490c-9097-9fc68c9199c9_2484x1394.png 1272w, https://substackcdn.com/image/fetch/$s_!1FDy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b982657-5755-490c-9097-9fc68c9199c9_2484x1394.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2, 5, 7, 9, 10])</figcaption></figure></div><p>The alignment process teaches large language models (LLMs) how to generate completions that receive high human preference scores. The traditional strategy for alignment includes supervised finetuning and proximal policy optimization (PPO)-based reinforcement learning from human feedback (RLHF). Although this approach works well, PPO-based RLHF is an online RL training algorithm that is complex to implement for a variety of reasons:</p><ul><li><p>PPO actively runs inference to generate samples with the current LLM&#8212;<em>known as &#8220;on-policy&#8221; samples</em>&#8212;during training. The real-time generation of on-policy data is what makes PPO an online algorithm.</p></li><li><p>Online RL training is difficult to efficiently orchestrate&#8212;<em>especially in <a href="https://rlhfbook.com/c/11-policy-gradients.html#asynchronicity">synchronous</a> training setups</em>&#8212;and often suffers from stability issues</p></li><li><p>PPO requires storing multiple copies of the LLM during training, leading to significant memory overhead and high hardware requirements.</p></li><li><p>PPO involves a wide range of training settings and design decisions that must be managed for successful training [21].</p></li></ul><p>We can try to avoid the complexities of online RL by <em>i)</em> using lower-overhead online RL algorithms, <em>ii)</em> developing offline algorithms, or even <em>iii)</em> eliminating RL from the alignment process altogether. However, online RL is highly performant, and simpler alignment algorithms tend to come at cost in performance. </p><blockquote><p><em>&#8220;Some results show that online RL is quite important to attain good fine-tuning results, while others find (offline) contrastive or even purely supervised methods sufficient.&#8221;</em> - from [5]<em> </em> </p></blockquote><p>In this overview, we will explore alternatives to online, PPO-based reinforcement learning from human feedback for LLM alignment. In particular, our focus will be on analyzing the performance gap between online algorithms that perform on-policy sampling and offline algorithms that train the LLM over a fixed dataset. By studying papers in this area, we will answer the following questions:</p><ul><li><p>Is reinforcement learning needed for high-quality LLM alignment?</p></li><li><p>Is sampling on-policy training data important for alignment?</p></li></ul><p>As we will see, on-policy sampling provides a clear performance advantage, creating a gap between online and offline alignment algorithms. However, offline (or RL-free) approaches can still be effective despite this online-offline gap. In particular, enhancing offline algorithms with on-policy data can form semi-online algorithms that are effective and easier to implement relative to full online RL.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://cameronrwolfe.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Join 50,000 others who use Deep (Learning) Focus to stay up-to-date with AI research.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Alignment Algorithms for LLMs</h2><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OuS0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d7f58-c669-4a1c-bbe5-989f6ca31b48_2424x528.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OuS0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d7f58-c669-4a1c-bbe5-989f6ca31b48_2424x528.png 424w, https://substackcdn.com/image/fetch/$s_!OuS0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d7f58-c669-4a1c-bbe5-989f6ca31b48_2424x528.png 848w, https://substackcdn.com/image/fetch/$s_!OuS0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d7f58-c669-4a1c-bbe5-989f6ca31b48_2424x528.png 1272w, https://substackcdn.com/image/fetch/$s_!OuS0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d7f58-c669-4a1c-bbe5-989f6ca31b48_2424x528.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OuS0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d7f58-c669-4a1c-bbe5-989f6ca31b48_2424x528.png" width="1456" height="317" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/989d7f58-c669-4a1c-bbe5-989f6ca31b48_2424x528.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:317,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OuS0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d7f58-c669-4a1c-bbe5-989f6ca31b48_2424x528.png 424w, https://substackcdn.com/image/fetch/$s_!OuS0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d7f58-c669-4a1c-bbe5-989f6ca31b48_2424x528.png 848w, https://substackcdn.com/image/fetch/$s_!OuS0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d7f58-c669-4a1c-bbe5-989f6ca31b48_2424x528.png 1272w, https://substackcdn.com/image/fetch/$s_!OuS0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d7f58-c669-4a1c-bbe5-989f6ca31b48_2424x528.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>To begin, we will quickly delve into the role of alignment in LLM training and outline the many variants of online and offline alignment algorithms that currently exist. Modern LLMs are trained in several stages, as depicted in the figure above. The key training stages for an LLM are as follows:</p><ol><li><p><strong>Pretraining</strong> is a large-scale training procedure that trains the LLM from scratch over internet-scale text data using a <a href="https://cameronrwolfe.substack.com/i/136638774/understanding-next-token-prediction">next token prediction</a> training objective; see <a href="https://cameronrwolfe.substack.com/p/llm-scaling-laws">here</a>.</p></li><li><p><strong>Supervised finetuning (SFT)</strong> or <strong>instruction finetuning (IFT)</strong> also uses a (supervised) next token prediction training objective to train the LLM over a smaller set of high-quality completions that it learns to emulate; see <a href="https://cameronrwolfe.substack.com/p/understanding-and-using-supervised">here</a>.</p></li><li><p><strong>Reinforcement learning from human feedback (RLHF)</strong> or <strong>preference finetuning (PreFT)</strong> uses <a href="https://cameronrwolfe.substack.com/p/basics-of-reinforcement-learning">reinforcement learning (RL)</a> to train the LLM over human preference data; see <a href="https://cameronrwolfe.substack.com/p/the-story-of-rlhf-origins-motivations">here</a>.</p></li><li><p><strong>Reinforcement learning from verifiable rewards (RLVR)</strong> or <strong>reinforcement finetuning (RFT) </strong>trains the LLM with RL on <a href="https://cameronrwolfe.substack.com/i/153722335/reinforcement-learning-with-verifiable-rewards">verifiable tasks</a>, where a reward can be derived deterministically from rules or heuristics.</p></li></ol><p>We can group the training strategies outlined above into distinct stages; see below. The pretraining (and <a href="https://arxiv.org/abs/2506.20512">midtraining</a>) process focuses on building the core knowledge base of the LLM, while alignment teaches the LLM correct formatting and style for maximizing human preference scores. Reasoning training is a final step that yields an additional boost in performance on verifiable tasks.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_ctK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bc3b656-3e0c-4211-bc8b-e19fd692b0e2_2364x1022.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_ctK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bc3b656-3e0c-4211-bc8b-e19fd692b0e2_2364x1022.png 424w, https://substackcdn.com/image/fetch/$s_!_ctK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bc3b656-3e0c-4211-bc8b-e19fd692b0e2_2364x1022.png 848w, https://substackcdn.com/image/fetch/$s_!_ctK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bc3b656-3e0c-4211-bc8b-e19fd692b0e2_2364x1022.png 1272w, https://substackcdn.com/image/fetch/$s_!_ctK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bc3b656-3e0c-4211-bc8b-e19fd692b0e2_2364x1022.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_ctK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bc3b656-3e0c-4211-bc8b-e19fd692b0e2_2364x1022.png" width="1456" height="629" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8bc3b656-3e0c-4211-bc8b-e19fd692b0e2_2364x1022.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:629,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:737479,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/169926007?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bc3b656-3e0c-4211-bc8b-e19fd692b0e2_2364x1022.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_ctK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bc3b656-3e0c-4211-bc8b-e19fd692b0e2_2364x1022.png 424w, https://substackcdn.com/image/fetch/$s_!_ctK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bc3b656-3e0c-4211-bc8b-e19fd692b0e2_2364x1022.png 848w, https://substackcdn.com/image/fetch/$s_!_ctK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bc3b656-3e0c-4211-bc8b-e19fd692b0e2_2364x1022.png 1272w, https://substackcdn.com/image/fetch/$s_!_ctK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bc3b656-3e0c-4211-bc8b-e19fd692b0e2_2364x1022.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Grouping training steps into distinct stages</figcaption></figure></div><p>This overview focuses on LLM alignment and the many algorithms&#8212;<em>including SFT and many forms of RL-based and RL-free RLHF</em>&#8212;that have been proposed. We will focus especially on the role and necessity of online RL&#8212;<em>as opposed to using simpler, offline alignment algorithms</em>&#8212;in the RLHF training process. In this section, we will kick off this discussion by explaining the many options that exist for alignment algorithms, including both online and offline algorithms. </p><h4>Supervised Finetuning (SFT)</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rAN6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc13900bb-c2f1-4188-96f1-8acc377a8692_1342x1144.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rAN6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc13900bb-c2f1-4188-96f1-8acc377a8692_1342x1144.png 424w, https://substackcdn.com/image/fetch/$s_!rAN6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc13900bb-c2f1-4188-96f1-8acc377a8692_1342x1144.png 848w, https://substackcdn.com/image/fetch/$s_!rAN6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc13900bb-c2f1-4188-96f1-8acc377a8692_1342x1144.png 1272w, https://substackcdn.com/image/fetch/$s_!rAN6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc13900bb-c2f1-4188-96f1-8acc377a8692_1342x1144.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rAN6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc13900bb-c2f1-4188-96f1-8acc377a8692_1342x1144.png" width="544" height="463.73770491803276" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c13900bb-c2f1-4188-96f1-8acc377a8692_1342x1144.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1144,&quot;width&quot;:1342,&quot;resizeWidth&quot;:544,&quot;bytes&quot;:203045,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/169926007?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc13900bb-c2f1-4188-96f1-8acc377a8692_1342x1144.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rAN6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc13900bb-c2f1-4188-96f1-8acc377a8692_1342x1144.png 424w, https://substackcdn.com/image/fetch/$s_!rAN6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc13900bb-c2f1-4188-96f1-8acc377a8692_1342x1144.png 848w, https://substackcdn.com/image/fetch/$s_!rAN6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc13900bb-c2f1-4188-96f1-8acc377a8692_1342x1144.png 1272w, https://substackcdn.com/image/fetch/$s_!rAN6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc13900bb-c2f1-4188-96f1-8acc377a8692_1342x1144.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Next-token prediction training objective</figcaption></figure></div><p>One of the simplest LLM alignment strategies is supervised finetuning (SFT), which adopts the same <a href="https://cameronrwolfe.substack.com/i/136638774/understanding-next-token-prediction">next token prediction training objective</a> used during pretraining. We train the LLM to predict the next token in a sequence given all prior tokens as context (shown above)&#8212;<em>this is a <a href="https://cameronrwolfe.substack.com/i/76273144/self-supervised-learning">self-supervised</a> training objective that can be applied efficiently to large volumes of raw text data</em>. A basic implementation of the next token prediction training objective is provided below for reference.</p><pre><code>import torch
import torch.nn.functional as F

# token_indices: (batch_size, seq_length)
logits = LLM(token_indices)  # (batch_size, seq_length, vocab_size)

# shift to predict next token at each position
logits = logits[:, :-1, :]  # (batch_size, seq_length - 1, vocab_size)
targets = token_indices[:, 1:]  # (batch_size, seq_length - 1)

# resize tensors for cross-entropy loss
logits = logits.reshape(-1, logits.size(-1))
targets = targets.reshape(-1)

# compute cross-entropy loss
loss = F.cross_entropy(logits, targets)</code></pre><p>During pretraining, this training objective is applied over a <a href="https://cameronrwolfe.substack.com/i/152758713/scaling-and-the-age-of-pretraining">massive corpus of text</a> scraped from the internet. In contrast, SFT focuses upon curating a smaller set of high-quality prompt-response pairs for aligning the LLM. For example, <a href="https://arxiv.org/abs/2305.11206">LIMA</a> is a popular paper that aligned an LLM using SFT with a curated dataset of only 1K examples. Recent LLMs use a larger number of samples in the SFT dataset; e.g., <a href="https://arxiv.org/abs/2411.15124">Tulu-3</a> is trained with <a href="https://huggingface.co/datasets/allenai/tulu-3-sft-mixture">~1M SFT examples</a>. Put simply, <em>SFT aligns an LLM by training the model over concrete demonstrations of preferable responses</em>. </p><p>In most cases, we can achieve better performance by using a completion-only loss in SFT, meaning that the cross-entropy loss is masked for all prompt tokens and only applied to tokens within the response or completion<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>. For a more detailed exposition of SFT, please see my prior overview on this topic linked below.</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;b76ea1fd-2ddc-4a08-8348-04eb38931acb&quot;,&quot;caption&quot;:&quot;One of the most widely-used alignment techniques for LLMs is supervised fine-tuning (SFT), which trains the model over a curated dataset of high-quality demonstrations using a standard language modeling objective. SFT is simple / cheap to use and is a useful tool for aligning LLMs.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Understanding and Using Supervised Fine-Tuning (SFT) for Language Models&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:29736521,&quot;name&quot;:&quot;Cameron R. Wolfe, Ph.D.&quot;,&quot;bio&quot;:&quot;Research @ Netflix &#8226; Rice University PhD &#8226; I make AI understandable&quot;,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/69aba7df-b571-4609-aa47-fc2d031c11b8_1242x1595.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2023-09-11T09:02:08.451Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/68686a01-2b31-4694-8c04-a562ffd725ad_2210x1244.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://cameronrwolfe.substack.com/p/understanding-and-using-supervised&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:136815345,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:64,&quot;comment_count&quot;:5,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Deep (Learning) Focus&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!87xa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab9b43fb-52d5-40da-995d-5b7cd3f91064_896x896.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p><strong>Rejection sampling</strong> is an online variant of SFT that is an extremely effective and easy to use. The standard formulation for SFT is offline&#8212;<em>we train the model over a fixed dataset of prompt-response pairs.</em> <a href="https://rlhfbook.com/c/10-rejection-sampling.html">Rejection sampling</a> changes this setup by:</p><ul><li><p>Starting with a dataset of prompts.</p></li><li><p>Generating completions for each prompt with the current LLM.</p></li><li><p>Scoring all of these completions using a <a href="https://cameronrwolfe.substack.com/p/reward-models">reward model</a> or <a href="https://cameronrwolfe.substack.com/p/llm-as-a-judge">LLM judge</a>.</p></li><li><p>Selecting (or filtering) the top-scoring prompt-completion pairs<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>.</p></li><li><p>Performing SFT over these top examples. </p></li></ul><p>The rejection sampling process is depicted below. This approach trains the LLM in a similar fashion to SFT, <em>but the difference lies in the data</em>. We are using the LLM itself to sample SFT training data in a semi-online fashion. The reward model is used to ensure that we are training over the highest-quality completions. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9417!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bef5ae4-50d1-4c2d-a97b-a482dc7fe583_2856x675.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9417!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bef5ae4-50d1-4c2d-a97b-a482dc7fe583_2856x675.png 424w, https://substackcdn.com/image/fetch/$s_!9417!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bef5ae4-50d1-4c2d-a97b-a482dc7fe583_2856x675.png 848w, https://substackcdn.com/image/fetch/$s_!9417!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bef5ae4-50d1-4c2d-a97b-a482dc7fe583_2856x675.png 1272w, https://substackcdn.com/image/fetch/$s_!9417!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bef5ae4-50d1-4c2d-a97b-a482dc7fe583_2856x675.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9417!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bef5ae4-50d1-4c2d-a97b-a482dc7fe583_2856x675.png" width="1456" height="344" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0bef5ae4-50d1-4c2d-a97b-a482dc7fe583_2856x675.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:344,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 1: Rejection sampling overview.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 1: Rejection sampling overview." title="Figure 1: Rejection sampling overview." srcset="https://substackcdn.com/image/fetch/$s_!9417!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bef5ae4-50d1-4c2d-a97b-a482dc7fe583_2856x675.png 424w, https://substackcdn.com/image/fetch/$s_!9417!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bef5ae4-50d1-4c2d-a97b-a482dc7fe583_2856x675.png 848w, https://substackcdn.com/image/fetch/$s_!9417!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bef5ae4-50d1-4c2d-a97b-a482dc7fe583_2856x675.png 1272w, https://substackcdn.com/image/fetch/$s_!9417!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bef5ae4-50d1-4c2d-a97b-a482dc7fe583_2856x675.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(from <a href="https://rlhfbook.com/c/10-rejection-sampling.html">RLHF book</a>, <a href="https://github.com/natolambert/rlhf-book/blob/main/LICENSE-Content.md">license</a>)</figcaption></figure></div><p>We typically perform rejection sampling iteratively. For example, the <a href="https://arxiv.org/abs/2307.09288">Llama-2</a> alignment process uses four rounds of rejection sampling before RL-based RLHF. </p><p>In the discussion above, we described rejection sampling as a variant of SFT, since both use the same training objective. However, rejection sampling is actually a preference tuning technique and is most often used as a simpler alternative to RLHF&#8212;<em>not as an alternative to SFT</em>. In practice, rejection sampling is usually applied after SFT, rather than in place of it.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!taX3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ee77ee0-b4bf-4222-99c1-b0ec75b54bac_1566x680.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!taX3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ee77ee0-b4bf-4222-99c1-b0ec75b54bac_1566x680.png 424w, https://substackcdn.com/image/fetch/$s_!taX3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ee77ee0-b4bf-4222-99c1-b0ec75b54bac_1566x680.png 848w, https://substackcdn.com/image/fetch/$s_!taX3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ee77ee0-b4bf-4222-99c1-b0ec75b54bac_1566x680.png 1272w, https://substackcdn.com/image/fetch/$s_!taX3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ee77ee0-b4bf-4222-99c1-b0ec75b54bac_1566x680.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!taX3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ee77ee0-b4bf-4222-99c1-b0ec75b54bac_1566x680.png" width="1456" height="632" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1ee77ee0-b4bf-4222-99c1-b0ec75b54bac_1566x680.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:632,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:167621,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/169926007?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ee77ee0-b4bf-4222-99c1-b0ec75b54bac_1566x680.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!taX3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ee77ee0-b4bf-4222-99c1-b0ec75b54bac_1566x680.png 424w, https://substackcdn.com/image/fetch/$s_!taX3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ee77ee0-b4bf-4222-99c1-b0ec75b54bac_1566x680.png 848w, https://substackcdn.com/image/fetch/$s_!taX3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ee77ee0-b4bf-4222-99c1-b0ec75b54bac_1566x680.png 1272w, https://substackcdn.com/image/fetch/$s_!taX3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ee77ee0-b4bf-4222-99c1-b0ec75b54bac_1566x680.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [13])</figcaption></figure></div><p><strong>SFT variants.</strong> Beyond rejection sampling (also called Best-of-<code>N</code> sampling), there are several online or iterative variants of SFT that have been proposed. Some notable examples that we will encounter in this overview include:</p><ul><li><p><em><a href="https://arxiv.org/abs/2310.16763">Supervised Iterative Learning from Human Feedback (SuperHF)</a></em> [13] is an online learning technique that samples a batch of on-policy outputs from a model, filters these outputs with a reward model, and optimizes the model using a supervised objective under a <a href="https://cameronrwolfe.substack.com/i/167254905/kullback-leibler-kl-divergence">KL divergence</a> constraint; see above. </p></li><li><p><em><a href="https://arxiv.org/abs/2308.08998">Reinforced Self-Training (ReST)</a></em> [14] uses the rejection sampling formulation outlined above, in which we iteratively sample on-policy data from the LLM, score each sample with a reward model, and train on the best samples.</p></li><li><p><em><a href="https://arxiv.org/abs/2308.12050">Reward-Weighted Regression (RWR)</a></em> [15] similarly uses the LLM to generate on-policy samples that are scored with a reward model. But, these scores are used to weight each sample in the training loss instead of for filtering.</p></li><li><p><em><a href="https://arxiv.org/abs/2304.06767">Reward Ranked Finetuning (RAFT)</a></em> [16] again adopts the standard rejection sampling setup that samples online completions from the LLM and filters these completions for use in SFT with scores from a reward model.</p></li></ul><h4>Reinforcement Learning (RL) Training</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CJn6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CJn6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 424w, https://substackcdn.com/image/fetch/$s_!CJn6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 848w, https://substackcdn.com/image/fetch/$s_!CJn6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 1272w, https://substackcdn.com/image/fetch/$s_!CJn6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CJn6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png" width="1456" height="430" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:430,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CJn6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 424w, https://substackcdn.com/image/fetch/$s_!CJn6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 848w, https://substackcdn.com/image/fetch/$s_!CJn6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 1272w, https://substackcdn.com/image/fetch/$s_!CJn6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [16])</figcaption></figure></div><p>There are two different types of reinforcement learning (RL) training that are commonly used to train LLMs (shown above):</p><ul><li><p><em><a href="https://cameronrwolfe.substack.com/p/the-story-of-rlhf-origins-motivations">Reinforcement Learning from Human Feedback (RLHF)</a></em> trains the LLM using RL with rewards derived from a human preference <a href="https://cameronrwolfe.substack.com/p/reward-models">reward model</a>.</p></li><li><p><em><a href="https://cameronrwolfe.substack.com/i/153722335/reinforcement-learning-with-verifiable-rewards">Reinforcement Learning with Verifiable Rewards (RLVR)</a></em> trains the LLM using RL with rewards derived from rules-based or deterministic verifiers.</p></li></ul><p>These RL training techniques differ mainly in how they derive the reward for training, but other details of the algorithms are mostly similar. As depicted below, they both operate by generating completions over a set of prompts, computing the reward for these completions, and using the rewards to derive a <a href="https://cameronrwolfe.substack.com/p/policy-gradients-the-foundation-of">policy update</a>&#8212;<em>or an update to the LLM&#8217;s parameters</em>&#8212;with an RL optimizer. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uPv8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uPv8!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 424w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 848w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 1272w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uPv8!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif" width="1456" height="817" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:817,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;[animate output image]&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="[animate output image]" title="[animate output image]" srcset="https://substackcdn.com/image/fetch/$s_!uPv8!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 424w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 848w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 1272w, https://substackcdn.com/image/fetch/$s_!uPv8!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56eba05c-359c-400d-920f-38a36dd4690a_1920x1078.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Visual walkthrough of RL training for LLMs</figcaption></figure></div><p>When we are optimizing an LLM with RL, we are trying to solve the objective shown below. This objective maximizes the reward received by the LLM&#8217;s completions while minimizing the <a href="https://cameronrwolfe.substack.com/i/167254905/kullback-leibler-kl-divergence">KL divergence</a> of the model with respect to a reference model&#8212;<em>usually an LLM checkpoint from the start of RL training</em>. Put simply, this means that we want to maximize reward without making our new model significantly different from the original (reference) model.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kyeM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kyeM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png 424w, https://substackcdn.com/image/fetch/$s_!kyeM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png 848w, https://substackcdn.com/image/fetch/$s_!kyeM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png 1272w, https://substackcdn.com/image/fetch/$s_!kyeM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kyeM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png" width="657" height="118.67513736263736" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:263,&quot;width&quot;:1456,&quot;resizeWidth&quot;:657,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kyeM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png 424w, https://substackcdn.com/image/fetch/$s_!kyeM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png 848w, https://substackcdn.com/image/fetch/$s_!kyeM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png 1272w, https://substackcdn.com/image/fetch/$s_!kyeM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">RL training objective</figcaption></figure></div><p><strong>On-policy sampling.</strong> As shown above, we perform on-policy sampling when training an LLM with RL. By &#8220;on-policy&#8221; sampling, we mean that completions used to train our LLM in the core RL training loop are generated in real-time by the LLM itself&#8212;<em>the completions are not generated by another model or stored in an offline, pre-computed dataset</em>. In the context of LLMs, training algorithms that use on-policy sampling are typically referred to as &#8220;online&#8221; training algorithms. On-policy sampling is not only used within the context of RL training; e.g., we learned about several online variants of SFT in the prior section. </p><p><strong>More on RLHF.</strong> This overview is focused upon LLM alignment, so we will mostly encounter RLHF-style training. Early approaches to LLM alignment used the three-stage technique (shown below) that combines SFT with RLHF. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Dtl3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bde3170-7f57-4f2f-aebb-3af9eb7b6a62_1556x948.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Dtl3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bde3170-7f57-4f2f-aebb-3af9eb7b6a62_1556x948.png 424w, https://substackcdn.com/image/fetch/$s_!Dtl3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bde3170-7f57-4f2f-aebb-3af9eb7b6a62_1556x948.png 848w, https://substackcdn.com/image/fetch/$s_!Dtl3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bde3170-7f57-4f2f-aebb-3af9eb7b6a62_1556x948.png 1272w, https://substackcdn.com/image/fetch/$s_!Dtl3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bde3170-7f57-4f2f-aebb-3af9eb7b6a62_1556x948.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Dtl3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bde3170-7f57-4f2f-aebb-3af9eb7b6a62_1556x948.png" width="1456" height="887" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6bde3170-7f57-4f2f-aebb-3af9eb7b6a62_1556x948.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:887,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Dtl3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bde3170-7f57-4f2f-aebb-3af9eb7b6a62_1556x948.png 424w, https://substackcdn.com/image/fetch/$s_!Dtl3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bde3170-7f57-4f2f-aebb-3af9eb7b6a62_1556x948.png 848w, https://substackcdn.com/image/fetch/$s_!Dtl3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bde3170-7f57-4f2f-aebb-3af9eb7b6a62_1556x948.png 1272w, https://substackcdn.com/image/fetch/$s_!Dtl3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bde3170-7f57-4f2f-aebb-3af9eb7b6a62_1556x948.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [7])</figcaption></figure></div><p>In RLHF, we begin by collecting a dataset of preference pairs, where each preference pair contains:</p><ul><li><p>A prompt.</p></li><li><p>A chosen (or winning) completion.</p></li><li><p>A rejected (or losing) completion.</p></li></ul><p>We then train a <a href="https://cameronrwolfe.substack.com/p/reward-models">reward model</a> over the preference dataset and optimize our LLM with the RL training loop described above. The completions in this preference dataset can come from a variety of sources; e.g., the reference model, prior model checkpoints, or even completely different models. The preference annotation&#8212;<em>or selection of the chosen and rejected completion in the pair</em>&#8212;is usually provided either by a human annotator or LLM judge (i.e., <a href="https://cameronrwolfe.substack.com/p/rlaif-reinforcement-learning-from">AI feedback</a>). Notably, the preference data and reward model are fixed at the beginning of RL training. Making this a bit more formal, LLMs are trained with a variant of offline <a href="https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html">model-based RL</a>.</p><p><strong>RL optimizers.</strong> There is one detail missing from the above explanation of RL training: <em>how do we compute the policy update?</em> We will briefly address this question here, but interested readers should see <a href="https://rlhfbook.com/c/11-policy-gradients.html">this in-depth overview</a> for full details. Usually, a <a href="https://cameronrwolfe.substack.com/p/policy-gradients-the-foundation-of">policy gradient</a>-based RL optimizer (e.g., <a href="https://arxiv.org/abs/2402.14740">REINFORCE</a>, <a href="https://arxiv.org/abs/1707.06347">PPO</a>, or <a href="https://arxiv.org/abs/2402.03300">GRPO</a>) is used.  PPO-based RLHF has been the de facto choice in the past, but PPO is computationally expensive due to estimating the value function with an LLM. In fact, PPO-based RLHF stores four different copies of the LLM during training (i.e., the policy, reference policy, value model, and reward model). </p><p>To reduce overhead, REINFORCE derives a monte carlo estimate of the policy gradient by approximating the value function with an average of rewards received by the model throughout training (i.e., instead of with an LLM). In a similar vein, GRPO approximates the value function with an average of rewards from multiple completions to the same prompt&#8212;<em>referred to as a group</em>. Because GRPO is the most common RL optimizer for RLVR, it is also commonly used without a reward model. In this case, we only store two copies of the LLM&#8212;<em>the policy and reference policy</em>&#8212;for RL training. However, the lack of a reward model is a byproduct of RLVR (i.e., GRPO can be used with or without a reward model).</p><h4>Direct Alignment Techniques</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vj3B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cae8395-d481-49f9-b630-f5120b9abe7e_1642x576.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vj3B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cae8395-d481-49f9-b630-f5120b9abe7e_1642x576.png 424w, https://substackcdn.com/image/fetch/$s_!vj3B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cae8395-d481-49f9-b630-f5120b9abe7e_1642x576.png 848w, https://substackcdn.com/image/fetch/$s_!vj3B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cae8395-d481-49f9-b630-f5120b9abe7e_1642x576.png 1272w, https://substackcdn.com/image/fetch/$s_!vj3B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cae8395-d481-49f9-b630-f5120b9abe7e_1642x576.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vj3B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cae8395-d481-49f9-b630-f5120b9abe7e_1642x576.png" width="1456" height="511" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8cae8395-d481-49f9-b630-f5120b9abe7e_1642x576.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:511,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vj3B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cae8395-d481-49f9-b630-f5120b9abe7e_1642x576.png 424w, https://substackcdn.com/image/fetch/$s_!vj3B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cae8395-d481-49f9-b630-f5120b9abe7e_1642x576.png 848w, https://substackcdn.com/image/fetch/$s_!vj3B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cae8395-d481-49f9-b630-f5120b9abe7e_1642x576.png 1272w, https://substackcdn.com/image/fetch/$s_!vj3B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cae8395-d481-49f9-b630-f5120b9abe7e_1642x576.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [18])</figcaption></figure></div><p>Because online RL training is so expensive, researchers have also proposed offline alignment techniques like direct preference optimization (DPO) [18]. Compared to PPO-based RLHF, DPO avoids training an explicit reward model and instead derives a reward signal implicitly from the LLM itself. Using this implicit reward, the LLM is trained with the <a href="https://lilianweng.github.io/posts/2021-05-31-contrastive/">contrastive learning</a> objective shown below, which can be optimized using standard gradient descent (i.e., without any RL training). </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yQz2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7107abbb-358e-48d4-a200-64ca6b5d1d72_2050x1092.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yQz2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7107abbb-358e-48d4-a200-64ca6b5d1d72_2050x1092.png 424w, https://substackcdn.com/image/fetch/$s_!yQz2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7107abbb-358e-48d4-a200-64ca6b5d1d72_2050x1092.png 848w, https://substackcdn.com/image/fetch/$s_!yQz2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7107abbb-358e-48d4-a200-64ca6b5d1d72_2050x1092.png 1272w, https://substackcdn.com/image/fetch/$s_!yQz2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7107abbb-358e-48d4-a200-64ca6b5d1d72_2050x1092.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yQz2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7107abbb-358e-48d4-a200-64ca6b5d1d72_2050x1092.png" width="1456" height="776" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7107abbb-358e-48d4-a200-64ca6b5d1d72_2050x1092.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:776,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yQz2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7107abbb-358e-48d4-a200-64ca6b5d1d72_2050x1092.png 424w, https://substackcdn.com/image/fetch/$s_!yQz2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7107abbb-358e-48d4-a200-64ca6b5d1d72_2050x1092.png 848w, https://substackcdn.com/image/fetch/$s_!yQz2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7107abbb-358e-48d4-a200-64ca6b5d1d72_2050x1092.png 1272w, https://substackcdn.com/image/fetch/$s_!yQz2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7107abbb-358e-48d4-a200-64ca6b5d1d72_2050x1092.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">DPO training loss (from [18])</figcaption></figure></div><p>Intuitively, this contrastive loss increases the probability margin between chosen and rejected responses in a preference dataset. The LLM is trained on a fixed preference dataset&#8212;<em>the same data that is used to train the reward model in RLHF</em>. For this reason, DPO is characterized as an offline&#8212;<em>meaning the training data is fixed and there is no on-policy sampling</em>&#8212;<a href="https://rlhfbook.com/c/12-direct-alignment.html">direct alignment algorithm</a>. Compared to RL-based alignment algorithms, DPO requires much less computational overhead, is easier to tune, and still tends to perform well; see below for more details.</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;3a352534-39f3-4329-867d-2f495b41cda6&quot;,&quot;caption&quot;:&quot;Alignment techniques like RLHF led to massive improvements in LLM quality, but they are computationally expensive and hard to use. This overview covers a simpler approach to LLM alignment, called DPO, that avoids these complexities by aligning LLMs with an objective that can be optimized with gradient descent.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Direct Preference Optimization (DPO)&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:29736521,&quot;name&quot;:&quot;Cameron R. Wolfe, Ph.D.&quot;,&quot;bio&quot;:&quot;Research @ Netflix &#8226; Rice University PhD &#8226; I make AI understandable&quot;,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/69aba7df-b571-4609-aa47-fc2d031c11b8_1242x1595.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-07-28T09:33:20.635Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cdfcbd2e-ac10-4767-8a84-d54b07eeed2b_2488x1402.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://cameronrwolfe.substack.com/p/direct-preference-optimization&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:167254905,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:97,&quot;comment_count&quot;:17,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Deep (Learning) Focus&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!87xa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab9b43fb-52d5-40da-995d-5b7cd3f91064_896x896.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p><strong>Variants of DPO.</strong> Because DPO was so much simpler to use relative to PPO-based RLHF, this technique quickly became popular within LLM research. As a result, many variants of DPO were proposed, such as Identity Preference Optimization (IPO) [8],  Kahneman-Tversky Optimization (KTO) [19], or Contrastive Preference Optimization (CPO) [20]. Many of these techniques make slight modifications to DPO that yield a <a href="https://huggingface.co/blog/pref-tuning">mild boost in performance</a>, but the core idea behind them&#8212;<em>in terms of using direct alignment with a contrastive objective</em>&#8212;is similar. Some of these techniques, however, are meaningfully different from DPO; e.g., KTO formulates a DPO-style loss than can be applied to a single completion with a binary (good or bad) rating as opposed to a preference pair.</p><p><strong>Online or iterative DPO.</strong> In its standard formulation, DPO is a completely offline alignment algorithm. The preference dataset is fixed throughout DPO training, but we can create online (or semi-online) DPO variants by introducing on-policy samples into the training process. As depicted below, one example of this idea is self-rewarding language models [10]. In this framework, we periodically sample fresh data for DPO training as follows:</p><ol><li><p>Start with a set of prompts.</p></li><li><p>Sample multiple completions to these prompts with the current LLM.</p></li><li><p>Rank these completions (e.g., using an LLM judge or a reward model) to create a preference dataset.</p></li><li><p>Train the LLM over this data using DPO.</p></li><li><p>Return to step one and repeat for several rounds. </p></li></ol><p>In this process, we iteratively train the model with DPO, but the training data is periodically re-sampled from the current policy&#8212;<em>this is a semi-online training setup</em>. We can make this approach more on-policy by sampling completions from the current policy more regularly. In fact, we can even create a fully-online DPO variant by sampling on-policy completions for every batch of training data!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UQAm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1b71d60-3997-44e1-b186-ef6511b97599_1076x478.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UQAm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1b71d60-3997-44e1-b186-ef6511b97599_1076x478.png 424w, https://substackcdn.com/image/fetch/$s_!UQAm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1b71d60-3997-44e1-b186-ef6511b97599_1076x478.png 848w, https://substackcdn.com/image/fetch/$s_!UQAm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1b71d60-3997-44e1-b186-ef6511b97599_1076x478.png 1272w, https://substackcdn.com/image/fetch/$s_!UQAm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1b71d60-3997-44e1-b186-ef6511b97599_1076x478.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UQAm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1b71d60-3997-44e1-b186-ef6511b97599_1076x478.png" width="1076" height="478" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e1b71d60-3997-44e1-b186-ef6511b97599_1076x478.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:478,&quot;width&quot;:1076,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:151299,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/169926007?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1b71d60-3997-44e1-b186-ef6511b97599_1076x478.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UQAm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1b71d60-3997-44e1-b186-ef6511b97599_1076x478.png 424w, https://substackcdn.com/image/fetch/$s_!UQAm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1b71d60-3997-44e1-b186-ef6511b97599_1076x478.png 848w, https://substackcdn.com/image/fetch/$s_!UQAm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1b71d60-3997-44e1-b186-ef6511b97599_1076x478.png 1272w, https://substackcdn.com/image/fetch/$s_!UQAm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1b71d60-3997-44e1-b186-ef6511b97599_1076x478.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [10])</figcaption></figure></div><h2>The Online-Offline Performance Gap</h2><p>Although PPO-based RLHF was the standard choice for LLM alignment for some time, this approach is expensive, complex, and difficult to replicate outside of top LLM labs. As a result, researchers have developed a variety of simpler alignment algorithms based on offline and RL-free training strategies. In this section, we aim to answer the following question: <em>Does using offline alignment techniques come at a cost in performance?</em> To address this, we will review a range of papers that study the impact of offline training, the use of on-policy samples, contrastive training objectives, and other factors on LLM performance.</p><h4><a href="https://arxiv.org/abs/2404.10719">Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study</a> [6]</h4><blockquote><p><em>&#8220;Experiment results demonstrate that PPO is able to surpass other alignment methods in all&#8230; Particularly, in the most challenging code competition tasks, PPO achieves state-of-the-art results.&#8221;</em> - from [6]</p></blockquote><p>We see several different avenues of comparing PPO-based RLHF and (offline) DPO in [6], including theoretical analysis, synthetic experiments and practical training of LLMs. The goal of this work is to find and explain the limitations of DPO in LLM alignment. First, authors confirm there is a performance gap between DPO and PPO-based RLHF. Then, they provide analysis that uncovers the key reason for this trend&#8212;<em>the performance of DPO is significantly impacted by the presence of out-of-distribution examples in its underlying preference dataset.</em></p><p><strong>Reward hacking.</strong> When training an LLM with PPO-based RLHF, we generate completions to prompts in our prompt dataset in an online fashion and score them with a reward model. Given that our reward model is an LLM that is trained over a fixed (and biased) preference dataset, this model is an imperfect proxy for the actual, ground-truth reward&#8212;<em>it can make mistakes in the scores that it provides</em>! Going further, the LLM being trained by PPO can also learn to exploit these mistakes by finding a way to erroneously maximize rewards provided by the reward model without actually meeting human preference expectations. </p><p>This phenomenon&#8212;<em>commonly referred to as &#8220;reward hacking&#8221;</em>&#8212;has a <a href="https://lilianweng.github.io/posts/2024-11-28-reward-hacking/">long history of study</a> within the RL literature. However, we see in [6] that similar issues can occur even when using RL-free, offline alignment algorithms like DPO. In particular, authors make the statement quoted below, which tells us that:</p><ul><li><p>Any solution found by PPO also minimizes the training objective for DPO (i.e., the set of solutions to PPO is a subset of the solutions to DPO).</p></li><li><p>It is possible for PPO to find erroneous (or reward-hacked) solutions.</p></li><li><p>Therefore, <em>the same erroneous solutions can also be discovered with DPO</em>.</p></li></ul><div class="pullquote"><p>Given a ground-truth reward <code>r</code> and a preference dataset <code>D</code>, let <code>&#928;_PPO</code> be the class of policies induced by training reward model <code>R_&#934;</code> over <code>D</code> and running PPO. Let <code>&#928;_DPO</code> be the class of policies induced by running DPO. We have the following conclusion: <strong>&#928;_PPO is a proper subset of &#928;_DPO.</strong> - from [6]</p></div><p>Due to not using an explicit reward model, DPO cannot be reward hacked in a similar manner to PPO. However, DPO still suffers from similar issues with out-of-distribution data in a different manner. Specifically, DPO learns a bias towards unseen&#8212;<em>or out-of-distribution</em>&#8212;completions as explained below.</p><blockquote><p>&#8220;<em>DPO can develop a biased distribution favoring unseen responses, directly impacting quality of the learned policy&#8230; DPO is prone to generating a biased policy that favors out-of-distribution responses, leading to unpredictable behaviors.&#8221;</em> - from [6]</p></blockquote><p>This bias is most pronounced when there is a large distribution shift between the reference model used in DPO and the model used to generate completions within the preference dataset. Ideally, these completions should be generated with the reference model used in DPO. While online algorithms like PPO generate on-policy completions during training, offline algorithms like DPO are trained over a fixed preference dataset, where completions can come from an arbitrary LLM. </p><p><strong>Synthetic example.</strong> To validate DPO&#8217;s issues with out-of-distribution data, a simple synthetic training example is constructed in [6]. In this setup, the policy is a basic <a href="https://en.wikipedia.org/wiki/Multilayer_perceptron">multi-layer perceptron</a> that takes a one-hot vector as input (i.e., the prompt) and produces an eight-dimensional categorical distribution as output<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>. We assume that the optimal policy is diagonal as illustrated in the plots below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fCpF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0877e98a-369b-4694-8032-eca9015252a1_904x1514.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fCpF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0877e98a-369b-4694-8032-eca9015252a1_904x1514.png 424w, https://substackcdn.com/image/fetch/$s_!fCpF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0877e98a-369b-4694-8032-eca9015252a1_904x1514.png 848w, https://substackcdn.com/image/fetch/$s_!fCpF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0877e98a-369b-4694-8032-eca9015252a1_904x1514.png 1272w, https://substackcdn.com/image/fetch/$s_!fCpF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0877e98a-369b-4694-8032-eca9015252a1_904x1514.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fCpF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0877e98a-369b-4694-8032-eca9015252a1_904x1514.png" width="376" height="629.7168141592921" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0877e98a-369b-4694-8032-eca9015252a1_904x1514.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1514,&quot;width&quot;:904,&quot;resizeWidth&quot;:376,&quot;bytes&quot;:890026,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/169926007?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0877e98a-369b-4694-8032-eca9015252a1_904x1514.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fCpF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0877e98a-369b-4694-8032-eca9015252a1_904x1514.png 424w, https://substackcdn.com/image/fetch/$s_!fCpF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0877e98a-369b-4694-8032-eca9015252a1_904x1514.png 848w, https://substackcdn.com/image/fetch/$s_!fCpF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0877e98a-369b-4694-8032-eca9015252a1_904x1514.png 1272w, https://substackcdn.com/image/fetch/$s_!fCpF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0877e98a-369b-4694-8032-eca9015252a1_904x1514.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [6])</figcaption></figure></div><p>Using this toy setup, we can create synthetic preference datasets that purposely omit certain preference pairs from the training data, thus testing the behavior of both DPO and PPO in handling out-of-distribution data. As shown above, PPO handles this coverage issue correctly and recovers the optimal policy. In contrast, <em>DPO incorrectly learns to assign high probability to data that is out-of-distribution, </em>which validates&#8212;<em>at least at a small scale</em>&#8212;the argument in [6] that DPO develops an erroneous bias towards out-of-distribution data in the preference dataset.</p><p><strong>Practical experiments.</strong> Following this synthetic test, larger-scale preference tuning experiments are performed with various Llama-2-derived LLMs on the <a href="https://arxiv.org/abs/2310.12773">SafeRLHF</a> dataset. Experiments begin with an SFT model trained on the <a href="https://huggingface.co/datasets/tatsu-lab/alpaca">Alpaca dataset</a>, creating a distribution shift between the SFT model and the preference data&#8212;<em>completions in the Alpaca dataset are much different than those of SafeRLHF</em>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!C_ku!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84ab1ae9-7323-4111-b2d7-b1fa3bc21989_1490x422.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!C_ku!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84ab1ae9-7323-4111-b2d7-b1fa3bc21989_1490x422.png 424w, https://substackcdn.com/image/fetch/$s_!C_ku!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84ab1ae9-7323-4111-b2d7-b1fa3bc21989_1490x422.png 848w, https://substackcdn.com/image/fetch/$s_!C_ku!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84ab1ae9-7323-4111-b2d7-b1fa3bc21989_1490x422.png 1272w, https://substackcdn.com/image/fetch/$s_!C_ku!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84ab1ae9-7323-4111-b2d7-b1fa3bc21989_1490x422.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!C_ku!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84ab1ae9-7323-4111-b2d7-b1fa3bc21989_1490x422.png" width="1456" height="412" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/84ab1ae9-7323-4111-b2d7-b1fa3bc21989_1490x422.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:412,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:149644,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/169926007?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84ab1ae9-7323-4111-b2d7-b1fa3bc21989_1490x422.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!C_ku!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84ab1ae9-7323-4111-b2d7-b1fa3bc21989_1490x422.png 424w, https://substackcdn.com/image/fetch/$s_!C_ku!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84ab1ae9-7323-4111-b2d7-b1fa3bc21989_1490x422.png 848w, https://substackcdn.com/image/fetch/$s_!C_ku!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84ab1ae9-7323-4111-b2d7-b1fa3bc21989_1490x422.png 1272w, https://substackcdn.com/image/fetch/$s_!C_ku!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84ab1ae9-7323-4111-b2d7-b1fa3bc21989_1490x422.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [6])</figcaption></figure></div><p>As shown above, using the Alpaca SFT model directly as the starting point for DPO performs poorly, but performance improves drastically when we first finetune the Alpaca SFT model over preferred completions in the SafeRLHF dataset prior to performing DPO training. <em>These results indicate that a distribution shift between the reference model and preference data in DPO is indeed detrimental to LLM performance in practical alignment scenarios</em>. Notably, the approach of running additional SFT over preferred completions in the preference dataset prior to DPO was also recommended in the original DPO paper [1]!</p><blockquote><p><em>&#8220;We generate new responses with SFT (Safe) and use a learned reward model for preference labeling. We further repeat this process and iteratively set the reference model as the latest DPO model in the last iteration.&#8221;</em> - from [6]</p></blockquote><p>A new approach for avoiding out-of-distribution data via iterative DPO is also proposed in [6]. We can run several rounds of DPO, where at each round we use the current reference policy to generate fresh completions that are automatically scored by reward model to create a preference dataset. After each round, our current policy becomes the new reference policy, and we repeat this process, <em>thus ensuring there is no distribution shift between our reference policy and the preference dataset</em>. Using this approach, we can train a model with comparable safety (but not helpfulness) ratings to those obtained with PPO, thus narrowing the performance gap between online and offline alignment algorithms.</p><h4><strong><a href="https://arxiv.org/abs/2404.14367">Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data</a> [7]</strong></h4><p>By conducting a comprehensive study that covers nearly every possible alignment strategy for an LLM, authors in [7] discover two key characteristics that create a successful alignment algorithm:</p><ol><li><p>The use of on-policy sampling.</p></li><li><p>The presence of a &#8220;negative gradient&#8221; that decreases probability of bad responses; i.e., <em>instead of only increasing the probability of good responses</em>. </p></li></ol><p>For example, SFT purely trains the LLM using a <a href="https://en.wikipedia.org/wiki/Maximum_likelihood_estimation">maximum likelihood</a> objective over a set of high-quality completions, while DPO leverages a contrastive objective that both <em>i)</em> increases the probability of the chosen response and <em>ii)</em> decreases the probability of the rejected response. However, the training data is fixed for both of these strategies&#8212;<em>they perform no on-policy sampling</em>. We can fix these issues by using an online RL algorithm like PPO or adopting an iterative DPO strategy that periodically samples new data from the current policy.</p><blockquote><p><em>&#8220;Our main finding is that, in general, approaches that use on-policy sampling or attempt to push down the likelihood on certain responses outperform offline and maximum likelihood objectives.&#8221; </em>- from [7]</p></blockquote><p>We also learn in [7] that on-policy sampling and negative gradients are most useful in difficult alignment cases, where the responses that receive high rewards are unlikely within the reference policy. In such cases, the alignment process must train the LLM by &#8220;moving&#8221; probability mass away from low-reward responses and toward high-reward responses. Offline and purely supervised alignment methods perform especially poor in these complex scenarios.</p><p><strong>Alignment algorithms.</strong> Authors in [5] begin by characterizing a wide set of potential alignment algorithms (shown below) based on their use of on-policy sampling, negative gradients, and sample reuse (i.e., performing multiple gradient updates over the same data). As a concrete example of sample reuse, PPO executes two to four sequential gradient updates over each batch of training data, while GRPO<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a> and REINFORCE typically avoid such sample reuse [9]. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Y2yZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25548677-2f6c-4cdf-b8bd-337b9456ff47_1662x534.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Y2yZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25548677-2f6c-4cdf-b8bd-337b9456ff47_1662x534.png 424w, https://substackcdn.com/image/fetch/$s_!Y2yZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25548677-2f6c-4cdf-b8bd-337b9456ff47_1662x534.png 848w, https://substackcdn.com/image/fetch/$s_!Y2yZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25548677-2f6c-4cdf-b8bd-337b9456ff47_1662x534.png 1272w, https://substackcdn.com/image/fetch/$s_!Y2yZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25548677-2f6c-4cdf-b8bd-337b9456ff47_1662x534.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Y2yZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25548677-2f6c-4cdf-b8bd-337b9456ff47_1662x534.png" width="1456" height="468" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/25548677-2f6c-4cdf-b8bd-337b9456ff47_1662x534.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:468,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:117798,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/169926007?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25548677-2f6c-4cdf-b8bd-337b9456ff47_1662x534.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Y2yZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25548677-2f6c-4cdf-b8bd-337b9456ff47_1662x534.png 424w, https://substackcdn.com/image/fetch/$s_!Y2yZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25548677-2f6c-4cdf-b8bd-337b9456ff47_1662x534.png 848w, https://substackcdn.com/image/fetch/$s_!Y2yZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25548677-2f6c-4cdf-b8bd-337b9456ff47_1662x534.png 1272w, https://substackcdn.com/image/fetch/$s_!Y2yZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25548677-2f6c-4cdf-b8bd-337b9456ff47_1662x534.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [7])</figcaption></figure></div><p>All SFT and rejection sampling variants lack the negative gradient that is present in RL-based and direct alignment methods, where we explicitly decrease the probability of responses that are either rejected (for direct alignment) or receive a low reward (for RL). Finally, on-policy sampling may or may not be used by most techniques depending on the training setup. Direct alignment methods like DPO or IPO run contrastive training on a fixed preference dataset with no on-policy sampling, <em>but we can create an online version of an offline algorithm by periodically sampling new training data from the current policy</em>. However, some algorithms like PPO and REINFORCE are naturally based upon on-policy sampling. </p><p><strong>Unified alignment algorithm.</strong> To capture the scope of possible alignment algorithms, authors in [7] create the framework shown below. This framework enables the systematic study of different settings within the underlying alignment algorithm. For example, steps one and two can be performed either:</p><ol><li><p>With on-policy data collection (i.e., by generating responses from the current policy and automatically scoring them with a reward model). </p></li><li><p>By directly using offline preference data without any on-policy sampling (e.g., as in standard DPO). </p></li></ol><p>Going further, we can vary the extent of on-policy sampling by changing the total number of samples <code>B</code> or varying total gradient steps <code>T</code> performed on a set of samples. Notably, increasing <code>T</code> introduces sample reuse while increasing <code>B</code> does not, thus allowing us to isolate the impact of reusing on-policy samples. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pVI8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed8f40ec-5197-4961-abc2-2598f1f1f131_1653x700.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pVI8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed8f40ec-5197-4961-abc2-2598f1f1f131_1653x700.png 424w, https://substackcdn.com/image/fetch/$s_!pVI8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed8f40ec-5197-4961-abc2-2598f1f1f131_1653x700.png 848w, https://substackcdn.com/image/fetch/$s_!pVI8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed8f40ec-5197-4961-abc2-2598f1f1f131_1653x700.png 1272w, https://substackcdn.com/image/fetch/$s_!pVI8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed8f40ec-5197-4961-abc2-2598f1f1f131_1653x700.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pVI8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed8f40ec-5197-4961-abc2-2598f1f1f131_1653x700.png" width="1653" height="700" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ed8f40ec-5197-4961-abc2-2598f1f1f131_1653x700.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:700,&quot;width&quot;:1653,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:179608,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/169926007?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec554b78-2fe6-464b-bcae-964152ce6be1_1676x726.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!pVI8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed8f40ec-5197-4961-abc2-2598f1f1f131_1653x700.png 424w, https://substackcdn.com/image/fetch/$s_!pVI8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed8f40ec-5197-4961-abc2-2598f1f1f131_1653x700.png 848w, https://substackcdn.com/image/fetch/$s_!pVI8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed8f40ec-5197-4961-abc2-2598f1f1f131_1653x700.png 1272w, https://substackcdn.com/image/fetch/$s_!pVI8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed8f40ec-5197-4961-abc2-2598f1f1f131_1653x700.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [7])</figcaption></figure></div><p>Notably, this unified algorithm does not capture any of the maximum likelihood alignment algorithms, though these algorithms are still considered in [7].</p><p><strong>Training setup.</strong> The properties of these different alignment algorithms are analyzed using several experimental setups including:</p><ul><li><p>Small-scale (didactic) <a href="https://en.wikipedia.org/wiki/Multi-armed_bandit">bandit</a> problems. </p></li><li><p>Synthetic LLM problems.</p></li><li><p>Full-scale LLM alignment.</p></li></ul><p>In the synthetic alignment scenario, we use hand-crafted rewards based on the length of the LLM&#8217;s response. Specifically, two reward settings are considered&#8212;<em>minimizing the response length and matching the average response length</em>; see below. These reward scenarios test cases in which high-reward responses lie both within and outside of the region of probable completions for the reference policy<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KNAl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f30dcfa-9d59-4af4-8532-386c8c31866b_1576x578.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KNAl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f30dcfa-9d59-4af4-8532-386c8c31866b_1576x578.png 424w, https://substackcdn.com/image/fetch/$s_!KNAl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f30dcfa-9d59-4af4-8532-386c8c31866b_1576x578.png 848w, https://substackcdn.com/image/fetch/$s_!KNAl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f30dcfa-9d59-4af4-8532-386c8c31866b_1576x578.png 1272w, https://substackcdn.com/image/fetch/$s_!KNAl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f30dcfa-9d59-4af4-8532-386c8c31866b_1576x578.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KNAl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f30dcfa-9d59-4af4-8532-386c8c31866b_1576x578.png" width="1456" height="534" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7f30dcfa-9d59-4af4-8532-386c8c31866b_1576x578.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:534,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:304455,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/169926007?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f30dcfa-9d59-4af4-8532-386c8c31866b_1576x578.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KNAl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f30dcfa-9d59-4af4-8532-386c8c31866b_1576x578.png 424w, https://substackcdn.com/image/fetch/$s_!KNAl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f30dcfa-9d59-4af4-8532-386c8c31866b_1576x578.png 848w, https://substackcdn.com/image/fetch/$s_!KNAl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f30dcfa-9d59-4af4-8532-386c8c31866b_1576x578.png 1272w, https://substackcdn.com/image/fetch/$s_!KNAl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f30dcfa-9d59-4af4-8532-386c8c31866b_1576x578.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [7])</figcaption></figure></div><p>The didactic bandit problems also test multiple reward setups that change the optimum of the reward function. By changing the reward setup, we test each algorithm&#8217;s ability to assign probability to high-reward responses, even if these responses have low probability in the original reference policy; see above.</p><blockquote><p><em>&#8220;The optimum of the reward function R1 is located in low likelihood regions of the reference policy, whereas the optimum of R2 is roughly aligned with the mode of the reference policy. We hypothesize that on-policy sampling will be crucial to optimize reward function R1, whereas offline or maximum likelihood methods could be sufficient for the optimization of R2.&#8221;</em> - Bandit problem description from [7]</p></blockquote><p>The full-scale alignment scenario uses public preference data from <a href="https://arxiv.org/abs/2305.14387">AlpacaFarm</a>, <a href="https://arxiv.org/abs/2305.14233">UltraChat</a> and <a href="https://arxiv.org/abs/2310.01377">UltraFeedback</a> to align smaller-scale LLMs like <a href="https://huggingface.co/EleutherAI/pythia-1.4b">Pythia-1.4B</a> and <a href="https://huggingface.co/mistralai/Mistral-7B-v0.1">Mistral-7B</a>. This training setup is a more standard LLM alignment scenario, and models are evaluated using a golden human preference <a href="https://cameronrwolfe.substack.com/p/reward-models">reward model</a>.</p><p><strong>The role of on-policy sampling.</strong> We learn from experiments in [7] that sampling on-policy data more frequently and in smaller batches&#8212;<em>the most strictly on-policy setup possible</em>&#8212;leads to the best performance. The impact of on-policy sampling is most noticeable in complex alignment scenarios, where high-reward responses do not already lie within the probable region of the reference policy.</p><blockquote><p><em>&#8220;[We] observe strong and clear trends supporting that on-policy sampling with a smaller but frequently sampled batch results in better performance&#8230;</em> <em>The performance degradation with more off-policy updates is substantially milder for &#119877;2, indicating that when the peak in the reward function lies in the likely regions of the reference policy, a higher degree of off-policy updates is tolerable.&#8221;</em> - from [7]</p></blockquote><p>In simper alignment cases where responses that receive high rewards are already probable within the reference policy, the model can better tolerate the use of offline training algorithms. This phenomenon is confirmed in both synthetic and didactic problem setups. Additionally, we observe the same trend in full-scale LLM alignment experiments, where the highest reward comes from decreasing the batch size <code>B</code> to make the training process more on-policy; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!w1tP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2adfae66-8baf-485c-a4da-3fcf118e16b0_1664x566.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!w1tP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2adfae66-8baf-485c-a4da-3fcf118e16b0_1664x566.png 424w, https://substackcdn.com/image/fetch/$s_!w1tP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2adfae66-8baf-485c-a4da-3fcf118e16b0_1664x566.png 848w, https://substackcdn.com/image/fetch/$s_!w1tP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2adfae66-8baf-485c-a4da-3fcf118e16b0_1664x566.png 1272w, https://substackcdn.com/image/fetch/$s_!w1tP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2adfae66-8baf-485c-a4da-3fcf118e16b0_1664x566.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!w1tP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2adfae66-8baf-485c-a4da-3fcf118e16b0_1664x566.png" width="1456" height="495" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2adfae66-8baf-485c-a4da-3fcf118e16b0_1664x566.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:495,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:219575,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/169926007?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2adfae66-8baf-485c-a4da-3fcf118e16b0_1664x566.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!w1tP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2adfae66-8baf-485c-a4da-3fcf118e16b0_1664x566.png 424w, https://substackcdn.com/image/fetch/$s_!w1tP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2adfae66-8baf-485c-a4da-3fcf118e16b0_1664x566.png 848w, https://substackcdn.com/image/fetch/$s_!w1tP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2adfae66-8baf-485c-a4da-3fcf118e16b0_1664x566.png 1272w, https://substackcdn.com/image/fetch/$s_!w1tP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2adfae66-8baf-485c-a4da-3fcf118e16b0_1664x566.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [7])</figcaption></figure></div><p><strong>The negative gradient.</strong> Similarly to on-policy sampling, the use of a negative gradient is found to benefit alignment. Algorithms that employ a negative gradient have a noticeable boost in performance relative to those that do not, especially in difficult alignment cases where we must increase the probability of responses that were originally assigned low probability by the reference policy. As shown below (top figure), algorithms that employ a negative gradient increase the probability margin between chosen and rejected responses during training. Such a trend is not observed for algorithms that lack a negative gradient.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!b4-g!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfe6b607-92ad-414a-974a-cf1ba3c7d57f_1776x998.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!b4-g!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfe6b607-92ad-414a-974a-cf1ba3c7d57f_1776x998.png 424w, https://substackcdn.com/image/fetch/$s_!b4-g!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfe6b607-92ad-414a-974a-cf1ba3c7d57f_1776x998.png 848w, https://substackcdn.com/image/fetch/$s_!b4-g!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfe6b607-92ad-414a-974a-cf1ba3c7d57f_1776x998.png 1272w, https://substackcdn.com/image/fetch/$s_!b4-g!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfe6b607-92ad-414a-974a-cf1ba3c7d57f_1776x998.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!b4-g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfe6b607-92ad-414a-974a-cf1ba3c7d57f_1776x998.png" width="1456" height="818" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bfe6b607-92ad-414a-974a-cf1ba3c7d57f_1776x998.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:818,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1163418,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/169926007?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfe6b607-92ad-414a-974a-cf1ba3c7d57f_1776x998.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!b4-g!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfe6b607-92ad-414a-974a-cf1ba3c7d57f_1776x998.png 424w, https://substackcdn.com/image/fetch/$s_!b4-g!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfe6b607-92ad-414a-974a-cf1ba3c7d57f_1776x998.png 848w, https://substackcdn.com/image/fetch/$s_!b4-g!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfe6b607-92ad-414a-974a-cf1ba3c7d57f_1776x998.png 1272w, https://substackcdn.com/image/fetch/$s_!b4-g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfe6b607-92ad-414a-974a-cf1ba3c7d57f_1776x998.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [7])</figcaption></figure></div><p>Interestingly, however, we see above (bottom plot) that the absolute probability of both chosen and rejected responses actually decreases during training despite an increasing margin. This same trend has also been observed in other papers [8].</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!P9VU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0079dbdf-caf4-4e6c-8fe6-87a9db649316_1664x626.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!P9VU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0079dbdf-caf4-4e6c-8fe6-87a9db649316_1664x626.png 424w, https://substackcdn.com/image/fetch/$s_!P9VU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0079dbdf-caf4-4e6c-8fe6-87a9db649316_1664x626.png 848w, https://substackcdn.com/image/fetch/$s_!P9VU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0079dbdf-caf4-4e6c-8fe6-87a9db649316_1664x626.png 1272w, https://substackcdn.com/image/fetch/$s_!P9VU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0079dbdf-caf4-4e6c-8fe6-87a9db649316_1664x626.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!P9VU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0079dbdf-caf4-4e6c-8fe6-87a9db649316_1664x626.png" width="1456" height="548" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0079dbdf-caf4-4e6c-8fe6-87a9db649316_1664x626.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:548,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:339477,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/169926007?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0079dbdf-caf4-4e6c-8fe6-87a9db649316_1664x626.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!P9VU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0079dbdf-caf4-4e6c-8fe6-87a9db649316_1664x626.png 424w, https://substackcdn.com/image/fetch/$s_!P9VU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0079dbdf-caf4-4e6c-8fe6-87a9db649316_1664x626.png 848w, https://substackcdn.com/image/fetch/$s_!P9VU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0079dbdf-caf4-4e6c-8fe6-87a9db649316_1664x626.png 1272w, https://substackcdn.com/image/fetch/$s_!P9VU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0079dbdf-caf4-4e6c-8fe6-87a9db649316_1664x626.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [7])</figcaption></figure></div><p>On-policy sampling and negative gradients yield compounding benefits when used in tandem. For example, on-policy IPO and DPO have faster convergence and better performance compared to offline variants in both didactic bandit and synthetic LLM experiments; see above. In full-scale LLM experiments, online versions of contrastive alignment algorithms outperform PPO in some cases despite having lower computational costs and wall-clock training time.</p><p><strong>Is sample reuse detrimental?</strong> Substantially increasing the value of <code>T</code> would trivially degrade performance due to the introduction of off-policy data into the training process. However, moderate settings of <code>T</code> could allow the model to incorporate off-policy updates into the training process without causing a large drop in performance. For example, the synthetic LLM setting with PPO has no noticeable degradation in performance when increasing <code>T</code> from 1 to 8; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bC7N!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85aae630-cf55-48ca-8032-9500ab95a915_1888x746.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bC7N!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85aae630-cf55-48ca-8032-9500ab95a915_1888x746.png 424w, https://substackcdn.com/image/fetch/$s_!bC7N!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85aae630-cf55-48ca-8032-9500ab95a915_1888x746.png 848w, https://substackcdn.com/image/fetch/$s_!bC7N!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85aae630-cf55-48ca-8032-9500ab95a915_1888x746.png 1272w, https://substackcdn.com/image/fetch/$s_!bC7N!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85aae630-cf55-48ca-8032-9500ab95a915_1888x746.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bC7N!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85aae630-cf55-48ca-8032-9500ab95a915_1888x746.png" width="1456" height="575" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/85aae630-cf55-48ca-8032-9500ab95a915_1888x746.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:575,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:297465,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/169926007?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85aae630-cf55-48ca-8032-9500ab95a915_1888x746.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bC7N!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85aae630-cf55-48ca-8032-9500ab95a915_1888x746.png 424w, https://substackcdn.com/image/fetch/$s_!bC7N!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85aae630-cf55-48ca-8032-9500ab95a915_1888x746.png 848w, https://substackcdn.com/image/fetch/$s_!bC7N!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85aae630-cf55-48ca-8032-9500ab95a915_1888x746.png 1272w, https://substackcdn.com/image/fetch/$s_!bC7N!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85aae630-cf55-48ca-8032-9500ab95a915_1888x746.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [7])</figcaption></figure></div><p>Maximum likelihood training objectives like rejection sampling (called Best-of-<code>N</code> in the figure above) are more sensitive to sample reuse<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a> but can still achieve good results with more moderate settings of T. Put simply, these results show that off-policy updates from sample reuse do not seem to hurt an LLM&#8217;s performance.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aCIm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01250b93-4320-4f11-830a-474ab70977de_1748x952.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aCIm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01250b93-4320-4f11-830a-474ab70977de_1748x952.png 424w, https://substackcdn.com/image/fetch/$s_!aCIm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01250b93-4320-4f11-830a-474ab70977de_1748x952.png 848w, https://substackcdn.com/image/fetch/$s_!aCIm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01250b93-4320-4f11-830a-474ab70977de_1748x952.png 1272w, https://substackcdn.com/image/fetch/$s_!aCIm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01250b93-4320-4f11-830a-474ab70977de_1748x952.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aCIm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01250b93-4320-4f11-830a-474ab70977de_1748x952.png" width="1456" height="793" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/01250b93-4320-4f11-830a-474ab70977de_1748x952.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:793,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:766856,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/169926007?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01250b93-4320-4f11-830a-474ab70977de_1748x952.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aCIm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01250b93-4320-4f11-830a-474ab70977de_1748x952.png 424w, https://substackcdn.com/image/fetch/$s_!aCIm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01250b93-4320-4f11-830a-474ab70977de_1748x952.png 848w, https://substackcdn.com/image/fetch/$s_!aCIm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01250b93-4320-4f11-830a-474ab70977de_1748x952.png 1272w, https://substackcdn.com/image/fetch/$s_!aCIm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01250b93-4320-4f11-830a-474ab70977de_1748x952.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [7])</figcaption></figure></div><p><strong>The key takeaways </strong>from alignment experiments in [7] are depicted in the figure above and can be summarized as follows:</p><ul><li><p>On-policy sampling is crucial for high-quality alignment, especially if responses with optimal reward are not likely in the reference policy.</p></li><li><p>Moderate amounts of sample reuse can introduce off-policy updates without causing a noticeable deterioration in alignment quality.</p></li><li><p>The use of negative gradients leads to faster convergence and has a complementary benefit to on-policy sampling.</p></li><li><p>For simple alignment cases where the peak in rewards is already likely in the reference policy, fully offline or supervised methods&#8212;<em>which use no on-policy sampling or negative gradient</em>&#8212;can still perform well. </p></li></ul><p>Each of these key points are also captured by the practical alignment takeaways presented in [7], which have been copied below for easier reference. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!poQi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5ccd2a6-8cbf-4acb-a23d-10caca9375fb_1578x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!poQi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5ccd2a6-8cbf-4acb-a23d-10caca9375fb_1578x1024.png 424w, https://substackcdn.com/image/fetch/$s_!poQi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5ccd2a6-8cbf-4acb-a23d-10caca9375fb_1578x1024.png 848w, https://substackcdn.com/image/fetch/$s_!poQi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5ccd2a6-8cbf-4acb-a23d-10caca9375fb_1578x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!poQi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5ccd2a6-8cbf-4acb-a23d-10caca9375fb_1578x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!poQi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5ccd2a6-8cbf-4acb-a23d-10caca9375fb_1578x1024.png" width="615" height="399.15865384615387" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a5ccd2a6-8cbf-4acb-a23d-10caca9375fb_1578x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:945,&quot;width&quot;:1456,&quot;resizeWidth&quot;:615,&quot;bytes&quot;:1721134,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/169926007?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5ccd2a6-8cbf-4acb-a23d-10caca9375fb_1578x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!poQi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5ccd2a6-8cbf-4acb-a23d-10caca9375fb_1578x1024.png 424w, https://substackcdn.com/image/fetch/$s_!poQi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5ccd2a6-8cbf-4acb-a23d-10caca9375fb_1578x1024.png 848w, https://substackcdn.com/image/fetch/$s_!poQi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5ccd2a6-8cbf-4acb-a23d-10caca9375fb_1578x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!poQi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5ccd2a6-8cbf-4acb-a23d-10caca9375fb_1578x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [7])</figcaption></figure></div><h4><strong><a href="https://arxiv.org/abs/2406.09279">Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback</a> [2]</strong></h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BdLr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07160243-3a0c-416e-b2d1-e664f8a02967_1860x872.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BdLr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07160243-3a0c-416e-b2d1-e664f8a02967_1860x872.png 424w, https://substackcdn.com/image/fetch/$s_!BdLr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07160243-3a0c-416e-b2d1-e664f8a02967_1860x872.png 848w, https://substackcdn.com/image/fetch/$s_!BdLr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07160243-3a0c-416e-b2d1-e664f8a02967_1860x872.png 1272w, https://substackcdn.com/image/fetch/$s_!BdLr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07160243-3a0c-416e-b2d1-e664f8a02967_1860x872.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BdLr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07160243-3a0c-416e-b2d1-e664f8a02967_1860x872.png" width="1456" height="683" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/07160243-3a0c-416e-b2d1-e664f8a02967_1860x872.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:683,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:282020,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/169926007?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07160243-3a0c-416e-b2d1-e664f8a02967_1860x872.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!BdLr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07160243-3a0c-416e-b2d1-e664f8a02967_1860x872.png 424w, https://substackcdn.com/image/fetch/$s_!BdLr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07160243-3a0c-416e-b2d1-e664f8a02967_1860x872.png 848w, https://substackcdn.com/image/fetch/$s_!BdLr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07160243-3a0c-416e-b2d1-e664f8a02967_1860x872.png 1272w, https://substackcdn.com/image/fetch/$s_!BdLr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07160243-3a0c-416e-b2d1-e664f8a02967_1860x872.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p>In [2], authors perform an empirical comparison between online and offline RL algorithms&#8212;<em>PPO-based RLHF and DPO in particular</em>&#8212;for aligning medium-scale LLMs. This analysis tries to maximize the performance of a single LLM across a wide set of benchmarks spanning several domains by varying:</p><ol><li><p>The type, source or scale of preference data being used.</p></li><li><p>The style of training algorithm (i.e., offline or online). </p></li></ol><p>Additionally, several hyperparameter settings and training setups are considered for improving the performance of PPO-based RLHF, providing useful intuition for maximizing results with online RL. From this analysis, we learn that:</p><ul><li><p>The choice of preference data has the greatest impact on LLM quality&#8212;<em>data quality and composition are the key determinants of success in alignment.</em></p></li><li><p>Online RL algorithms consistently outperform offline algorithms like DPO.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1TlM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F606c43fc-7f58-46d3-846a-9695f45aca7e_1836x680.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1TlM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F606c43fc-7f58-46d3-846a-9695f45aca7e_1836x680.png 424w, https://substackcdn.com/image/fetch/$s_!1TlM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F606c43fc-7f58-46d3-846a-9695f45aca7e_1836x680.png 848w, https://substackcdn.com/image/fetch/$s_!1TlM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F606c43fc-7f58-46d3-846a-9695f45aca7e_1836x680.png 1272w, https://substackcdn.com/image/fetch/$s_!1TlM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F606c43fc-7f58-46d3-846a-9695f45aca7e_1836x680.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1TlM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F606c43fc-7f58-46d3-846a-9695f45aca7e_1836x680.png" width="1456" height="539" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/606c43fc-7f58-46d3-846a-9695f45aca7e_1836x680.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:539,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:227407,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/169926007?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F606c43fc-7f58-46d3-846a-9695f45aca7e_1836x680.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1TlM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F606c43fc-7f58-46d3-846a-9695f45aca7e_1836x680.png 424w, https://substackcdn.com/image/fetch/$s_!1TlM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F606c43fc-7f58-46d3-846a-9695f45aca7e_1836x680.png 848w, https://substackcdn.com/image/fetch/$s_!1TlM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F606c43fc-7f58-46d3-846a-9695f45aca7e_1836x680.png 1272w, https://substackcdn.com/image/fetch/$s_!1TlM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F606c43fc-7f58-46d3-846a-9695f45aca7e_1836x680.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p><strong>The experimental setup</strong> in [2] adopts a standard approach for both PPO-based RLHF and DPO; see above. All experiments use Tulu-2-13B [3] as the starting model for both DPO and PPO. After preference tuning, models are evaluated over a wide set of benchmarks that measure performance in the following domains:</p><ul><li><p><em>Factuality</em> (e.g., <a href="https://huggingface.co/datasets/cais/mmlu">MMLU</a>)</p></li><li><p><em>Reasoning</em> (e.g., <a href="https://huggingface.co/datasets/openai/gsm8k">GSM8K</a>)</p></li><li><p><em>Truthfulness</em> (e.g., <a href="https://huggingface.co/datasets/domenicrosati/TruthfulQA">TruthfulQA</a>)</p></li><li><p><em>Coding</em> (e.g., <a href="https://huggingface.co/datasets/evalplus/humanevalplus">HumanEval+</a>)</p></li><li><p><em>Safety</em> (e.g., <a href="https://huggingface.co/datasets/toxigen/toxigen-data">ToxiGen</a>)</p></li><li><p><em>Instruction following</em> (e.g., <a href="https://huggingface.co/datasets/google/IFEval">IFEval</a>)</p></li></ul><p>From these diverse benchmarks, we can observe the performance of models in individual domains, as well as their general performance across domains. </p><p><strong>Data selection.</strong> Building upon recent work that leverages synthetic preferences for LLM alignment [4], we can derive preference data from three sources:</p><ol><li><p>Human preferences.</p></li><li><p>Web scraping<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a>.</p></li><li><p>Synthetic preferences.</p></li></ol><p>Interestingly, we learn in [2] that synthetic preference datasets&#8212;<em>and the <a href="https://huggingface.co/datasets/openbmb/UltraFeedback">UltraFeedback</a> dataset in particular</em>&#8212;yield the best results, even compared to human-annotated preference data. Going further, authors in [2] specifically mention the following important considerations for curating preference data:</p><ul><li><p>The quality of preferences (i.e., the choice of chosen or rejected completion within a preference pair) is actually more important than the quality of the completions themselves. </p></li><li><p>Collecting per-aspect preference feedback yields a clear performance benefit&#8212;<em>models trained on aggregated, per-aspect preferences outperform those trained on </em><code>15x</code><em> the amount of standard preference data</em>. </p></li><li><p>With the data that was considered in [1], preference tuning has the biggest impact on improving chat capabilities and output style, but the model does not seem to learn new facts or information. </p></li></ul><p>Per-aspect preference feedback is collected by asking a human or model to score each aspect of the data (e.g., helpfulness and harmlessness) independently, then aggregating these per-aspect scores to yield a final, aggregated preference score. Compared to just asking annotators for a single overall preference score, such an approach is found to improve the quality of preference feedback, which in turn improves the quality of resulting models after preference tuning. Authors in [2] consider various factors that impact the quality of post-training, but the source and quality of preference data are found to have the most significant impact.</p><blockquote><p><em>&#8220;PPO outperforms DPO by up to 2.5% in math and 1.2% in general domains. High-quality preference data leads to improvements of up to 8% in instruction following and truthfulness.&#8221;</em> - from [2]</p></blockquote><p><strong>PPO vs. DPO.</strong> When directly comparing models trained with an online or offline approach, we see in [2] that online training algorithms have a clear edge. In fact, nearly all models trained with PPO-based RLHF across all datasets are found to outperform those trained with DPO using identical settings. Results in [1] provide clear evidence that online RL benefits preference tuning for LLMs; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YnxD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F500ba667-1d6c-4cac-b114-24c987f4711c_1758x1002.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YnxD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F500ba667-1d6c-4cac-b114-24c987f4711c_1758x1002.png 424w, https://substackcdn.com/image/fetch/$s_!YnxD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F500ba667-1d6c-4cac-b114-24c987f4711c_1758x1002.png 848w, https://substackcdn.com/image/fetch/$s_!YnxD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F500ba667-1d6c-4cac-b114-24c987f4711c_1758x1002.png 1272w, https://substackcdn.com/image/fetch/$s_!YnxD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F500ba667-1d6c-4cac-b114-24c987f4711c_1758x1002.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YnxD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F500ba667-1d6c-4cac-b114-24c987f4711c_1758x1002.png" width="1456" height="830" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/500ba667-1d6c-4cac-b114-24c987f4711c_1758x1002.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:830,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:377202,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/169926007?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F500ba667-1d6c-4cac-b114-24c987f4711c_1758x1002.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YnxD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F500ba667-1d6c-4cac-b114-24c987f4711c_1758x1002.png 424w, https://substackcdn.com/image/fetch/$s_!YnxD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F500ba667-1d6c-4cac-b114-24c987f4711c_1758x1002.png 848w, https://substackcdn.com/image/fetch/$s_!YnxD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F500ba667-1d6c-4cac-b114-24c987f4711c_1758x1002.png 1272w, https://substackcdn.com/image/fetch/$s_!YnxD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F500ba667-1d6c-4cac-b114-24c987f4711c_1758x1002.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p><strong>Why is online training so beneficial?</strong> The answer to this questions is complex and multi-faceted, but authors in [2] make an interesting observation regarding the difference between models trained with DPO and PPO. Namely, PPO models are far more likely to perform <a href="https://cameronrwolfe.substack.com/p/chain-of-thought-prompting-for-llms">chain-of-thought reasoning</a> for solving complex problems, even without being provided any examples of this behavior. </p><blockquote><p><em>&#8220;Models trained with PPO are far more likely than DPO-trained models to perform chain-of-thought reasoning&#8230; even when not given in-context examples using chain-of-thought. This suggests that reasoning improvements from PPO may be due to increased chain-of-thought abilities.&#8221;</em> - from [2]</p></blockquote><p>Such behavior would be impossible for an LLM to learn with offline algorithms like DPO, as the completions from which the model learns are fixed within the preference dataset. On the other hand, PPO is able to learn such new behaviors because completions are sampled online during training, allowing the model to explore&#8212;<em>and learn from</em>&#8212;new behaviors like chain-of-though reasoning.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZfRv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96a83a07-9d28-406b-b1eb-16b99ce61594_920x398.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZfRv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96a83a07-9d28-406b-b1eb-16b99ce61594_920x398.png 424w, https://substackcdn.com/image/fetch/$s_!ZfRv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96a83a07-9d28-406b-b1eb-16b99ce61594_920x398.png 848w, https://substackcdn.com/image/fetch/$s_!ZfRv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96a83a07-9d28-406b-b1eb-16b99ce61594_920x398.png 1272w, https://substackcdn.com/image/fetch/$s_!ZfRv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96a83a07-9d28-406b-b1eb-16b99ce61594_920x398.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZfRv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96a83a07-9d28-406b-b1eb-16b99ce61594_920x398.png" width="585" height="253.07608695652175" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/96a83a07-9d28-406b-b1eb-16b99ce61594_920x398.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:398,&quot;width&quot;:920,&quot;resizeWidth&quot;:585,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZfRv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96a83a07-9d28-406b-b1eb-16b99ce61594_920x398.png 424w, https://substackcdn.com/image/fetch/$s_!ZfRv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96a83a07-9d28-406b-b1eb-16b99ce61594_920x398.png 848w, https://substackcdn.com/image/fetch/$s_!ZfRv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96a83a07-9d28-406b-b1eb-16b99ce61594_920x398.png 1272w, https://substackcdn.com/image/fetch/$s_!ZfRv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96a83a07-9d28-406b-b1eb-16b99ce61594_920x398.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p><strong>Other factors in online RL.</strong> Beyond the analysis of offline and online algorithms in [2], authors perform various other ablations to determine key factors to success in PPO-based RLHF. For example, increasing the size of the reward model&#8212;<em>and the size of the preference dataset over which the reward model is trained</em>&#8212;is found to improve the quality of the reward model. However, the impact of a better reward model on downstream evaluation benchmarks (i.e., after training the LLM with PPO-based RLHF) is less clear. The main performance benefits are observed in more complex domains like reasoning. Seemingly, <em>a more powerful reward model is only impactful in challenging domains that actually require a better reward model</em>.  </p><blockquote><p><em>&#8220;If we&#8217;re using a bigger reward model, we need to have data that is actually challenging the reward model.&#8221;</em> - <a href="https://www.youtube.com/watch?v=rDF7eFPeVto">source</a></p></blockquote><p>We can also boost the performance of the LLM in specific domains by curating a targeted prompt dataset for PPO that focuses on that domain&#8212;<em>this is a unique benefit that can be exploited by PPO but is not possible in offline algorithms like DPO</em>. However, such an approach does not yield performance improvements in general&#8212;<em>it is only useful for tailoring the LLM to specific domains like math</em>; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3q7k!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b44b76f-ef85-4130-9fe0-c8e1a6d3fe60_1378x760.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3q7k!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b44b76f-ef85-4130-9fe0-c8e1a6d3fe60_1378x760.png 424w, https://substackcdn.com/image/fetch/$s_!3q7k!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b44b76f-ef85-4130-9fe0-c8e1a6d3fe60_1378x760.png 848w, https://substackcdn.com/image/fetch/$s_!3q7k!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b44b76f-ef85-4130-9fe0-c8e1a6d3fe60_1378x760.png 1272w, https://substackcdn.com/image/fetch/$s_!3q7k!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b44b76f-ef85-4130-9fe0-c8e1a6d3fe60_1378x760.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3q7k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b44b76f-ef85-4130-9fe0-c8e1a6d3fe60_1378x760.png" width="1378" height="760" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3b44b76f-ef85-4130-9fe0-c8e1a6d3fe60_1378x760.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:760,&quot;width&quot;:1378,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:275190,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/169926007?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b44b76f-ef85-4130-9fe0-c8e1a6d3fe60_1378x760.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3q7k!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b44b76f-ef85-4130-9fe0-c8e1a6d3fe60_1378x760.png 424w, https://substackcdn.com/image/fetch/$s_!3q7k!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b44b76f-ef85-4130-9fe0-c8e1a6d3fe60_1378x760.png 848w, https://substackcdn.com/image/fetch/$s_!3q7k!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b44b76f-ef85-4130-9fe0-c8e1a6d3fe60_1378x760.png 1272w, https://substackcdn.com/image/fetch/$s_!3q7k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b44b76f-ef85-4130-9fe0-c8e1a6d3fe60_1378x760.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p><strong>The best training recipe.</strong> To conclude their analysis, authors in [2] emphasize the following aspects of LLM alignment:</p><ul><li><p>The importance of preference data quality.</p></li><li><p>The superiority of online RL.</p></li><li><p>The benefit of better reward models in complex domains like reasoning.</p></li><li><p>The ability of targeted prompt datasets for PPO to curate an LLM&#8217;s performance to a particular domain. </p></li></ul><p>The optimal approach for performing LLM alignment&#8212;<em>as discovered by experiments performed in [2]</em>&#8212;is summarized by the quote below.</p><div class="pullquote"><p>&#8220;We take a high-quality, synthetic preference dataset, a large reward model, and train it using PPO. If we additionally wish to focus on a specific domain, we can additionally collect domain-specific prompts for policy training.&#8221; - from [2]</p></div><h4><strong><a href="https://arxiv.org/abs/2405.08448">Understanding the performance gap between online and offline alignment algorithms</a> [5]</strong></h4><blockquote><p><em>&#8220;We show that on a suite of open source datasets, online algorithms generally outperform offline algorithms at the same optimization budget of KL divergence against the SFT policy&#8221;</em> - from [5]</p></blockquote><p>Authors in [5] analyze the importance of on-policy samples for aligning LLMs with RLHF. To begin, a clear gap in performance is demonstrated between online and offline alignment algorithms. Several intuitive explanations are proposed for this performance gap and investigated one-by-one via targeted data ablations. We learn from these experiments that the use of on-policy sampling seems to be the key performance differentiator for online alignment algorithms. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yNZ1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde96f668-f45d-47bc-944d-efcfe20c6c52_2408x716.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yNZ1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde96f668-f45d-47bc-944d-efcfe20c6c52_2408x716.png 424w, https://substackcdn.com/image/fetch/$s_!yNZ1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde96f668-f45d-47bc-944d-efcfe20c6c52_2408x716.png 848w, https://substackcdn.com/image/fetch/$s_!yNZ1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde96f668-f45d-47bc-944d-efcfe20c6c52_2408x716.png 1272w, https://substackcdn.com/image/fetch/$s_!yNZ1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde96f668-f45d-47bc-944d-efcfe20c6c52_2408x716.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yNZ1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde96f668-f45d-47bc-944d-efcfe20c6c52_2408x716.png" width="1456" height="433" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de96f668-f45d-47bc-944d-efcfe20c6c52_2408x716.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:433,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:320874,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/169926007?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde96f668-f45d-47bc-944d-efcfe20c6c52_2408x716.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yNZ1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde96f668-f45d-47bc-944d-efcfe20c6c52_2408x716.png 424w, https://substackcdn.com/image/fetch/$s_!yNZ1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde96f668-f45d-47bc-944d-efcfe20c6c52_2408x716.png 848w, https://substackcdn.com/image/fetch/$s_!yNZ1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde96f668-f45d-47bc-944d-efcfe20c6c52_2408x716.png 1272w, https://substackcdn.com/image/fetch/$s_!yNZ1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde96f668-f45d-47bc-944d-efcfe20c6c52_2408x716.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">IPO loss function (from [8])</figcaption></figure></div><p><strong>Experimental setup.</strong> All experiments in [5] evaluate models based on their win rate against a fixed policy and use the Identity Preference Optimization (IPO) algorithm, which uses the contrastive loss function shown above, for training. This algorithm is similar in nature to DPO. It can be used to align LLMs in an online or offline manner depending on how the training data is sampled. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bWAc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb36ae87-a272-4877-9446-75d45fb9e24e_1802x474.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bWAc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb36ae87-a272-4877-9446-75d45fb9e24e_1802x474.png 424w, https://substackcdn.com/image/fetch/$s_!bWAc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb36ae87-a272-4877-9446-75d45fb9e24e_1802x474.png 848w, https://substackcdn.com/image/fetch/$s_!bWAc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb36ae87-a272-4877-9446-75d45fb9e24e_1802x474.png 1272w, https://substackcdn.com/image/fetch/$s_!bWAc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb36ae87-a272-4877-9446-75d45fb9e24e_1802x474.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bWAc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb36ae87-a272-4877-9446-75d45fb9e24e_1802x474.png" width="1456" height="383" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fb36ae87-a272-4877-9446-75d45fb9e24e_1802x474.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:383,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:167714,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/169926007?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb36ae87-a272-4877-9446-75d45fb9e24e_1802x474.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bWAc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb36ae87-a272-4877-9446-75d45fb9e24e_1802x474.png 424w, https://substackcdn.com/image/fetch/$s_!bWAc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb36ae87-a272-4877-9446-75d45fb9e24e_1802x474.png 848w, https://substackcdn.com/image/fetch/$s_!bWAc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb36ae87-a272-4877-9446-75d45fb9e24e_1802x474.png 1272w, https://substackcdn.com/image/fetch/$s_!bWAc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb36ae87-a272-4877-9446-75d45fb9e24e_1802x474.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [5])</figcaption></figure></div><p>Specifically, we can use IPO in an online fashion by sampling on-policy data from the current policy during training, automatically scoring these completions with a reward model, and training the model over these online samples via the IPO training objective outlined above. A depiction of the differences between online and offline IPO is provided above. Online IPO is used as the online alignment technique in [5] instead of PPO-based RLHF for a few different reasons:</p><ul><li><p>Implementing PPO is complex and expensive due to the requirement of an additional value function.</p></li><li><p>There is no clear way to formulate the PPO optimization process in an offline manner (though DPO was derived as an offline equivalent of PPO).</p></li><li><p>As discussed above, formulating IPO in either an online or offline fashion is relatively straightforward.</p></li></ul><p>Given that PPO-based RLHF is the most widely-used online alignment algorithm, this choice to purely rely upon contrastive learning objectives is a clear deviation from mainstream alignment research. Additionally, analysis in [5] is performed over smaller (i.e., &lt;1B parameter) models. Despite these issues, the learnings from this work still provide useful intuition that helps us to better understand the key distinctions between online and offline alignment algorithms. </p><p>Relative to offline algorithms, online alignment algorithms perform inference during training and require an additional training procedure for the reward model. For these reasons, we cannot compare online and offline algorithms based on their total compute budget&#8212;<em>offline alignment in general will always be much cheaper</em>. Instead, authors in [5] choose to compare policies in terms of their <a href="http://joschu.net/blog/kl-approx.html">KL divergence</a> from the SFT model, <em>capturing how much the model changes during the alignment process (i.e., an optimization &#8220;budget&#8221;) in a compute-agnostic manner</em>. </p><blockquote><p><em>&#8220;Online algorithms tend to be more computationally intensive than offline algorithms, due to sampling and training an extra reward model&#8230; we do not prioritize compute as a main factor during comparison, and instead adopt the KL divergence between the RLHF policy and reference SFT policy as a measure of budget.&#8221;</em>- from [8]</p></blockquote><p><strong>Comparing online and offline RL.</strong> To begin their analysis, authors present the results of online and offline alignment depicted in the figure below. Here, we see that there is a clear gap between the performance of models trained with online and offline alignment algorithms across all levels of possible KL divergence. These results are consistent across several different open alignment datasets<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zykl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62de8bb1-3712-4fba-b9ec-682f90f81c2d_1162x1344.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zykl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62de8bb1-3712-4fba-b9ec-682f90f81c2d_1162x1344.png 424w, https://substackcdn.com/image/fetch/$s_!zykl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62de8bb1-3712-4fba-b9ec-682f90f81c2d_1162x1344.png 848w, https://substackcdn.com/image/fetch/$s_!zykl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62de8bb1-3712-4fba-b9ec-682f90f81c2d_1162x1344.png 1272w, https://substackcdn.com/image/fetch/$s_!zykl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62de8bb1-3712-4fba-b9ec-682f90f81c2d_1162x1344.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zykl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62de8bb1-3712-4fba-b9ec-682f90f81c2d_1162x1344.png" width="1162" height="1344" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/62de8bb1-3712-4fba-b9ec-682f90f81c2d_1162x1344.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1344,&quot;width&quot;:1162,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:316537,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/169926007?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62de8bb1-3712-4fba-b9ec-682f90f81c2d_1162x1344.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zykl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62de8bb1-3712-4fba-b9ec-682f90f81c2d_1162x1344.png 424w, https://substackcdn.com/image/fetch/$s_!zykl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62de8bb1-3712-4fba-b9ec-682f90f81c2d_1162x1344.png 848w, https://substackcdn.com/image/fetch/$s_!zykl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62de8bb1-3712-4fba-b9ec-682f90f81c2d_1162x1344.png 1272w, https://substackcdn.com/image/fetch/$s_!zykl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62de8bb1-3712-4fba-b9ec-682f90f81c2d_1162x1344.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [5])</figcaption></figure></div><p>Based on the observed superiority of online alignment, authors in [5] propose the following potential explanations for the existence of this performance gap:</p><ol><li><p><em>Data coverage</em>: online algorithms outperform offline algorithms simply because they have more diverse data than the offline algorithm.</p></li><li><p><em>Sub-optimal data</em>: offline algorithms perform worse because the completions in their dataset are generated by the SFT policy and are, therefore, of lower quality compared to on-policy samples generated during alignment<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-9" href="#footnote-9" target="_self">9</a>. </p></li><li><p><em>Better classification</em>: offline algorithms train the policy to classify preferred completions in a preference pair, while online algorithms accomplish this via an explicit reward model. The performance gap may be due to the online algorithm&#8217;s explicit reward model performing this classification more accurately relative to the offline policy.</p></li><li><p><em>Contrastive loss</em>: the contrastive objective used by offline algorithms like IPO and DPO&#8212;<em>not their lack of on-policy sampling</em>&#8212;may lead to the performance gap with online algorithms.</p></li><li><p><em>Scaling laws</em>: the performance gap could potentially disappear as we scale up the size of the underlying policy. </p></li></ol><p>Next, each of these hypotheses is studied in a series of ablation experiments that deeply analyze the difference between online and offline algorithms. </p><p><strong>Data coverage.</strong> To study the impact of data coverage on alignment quality, we can collect all of the completions generated via on-policy sampling during online training to form a dataset for offline alignment. If we preserve the exact order in which this data was sampled, then online and offline alignment are identical&#8212;<em>the models see the same data in the same order and, therefore, receive the same parameter updates</em>. If we shuffle this data and use it for offline alignment, however, we see in [5] that this new data does not yield noticeably better results. As shown in the figure below, the offline algorithm performs similarly using an offline dataset and the shuffled dataset generated via on-policy sampling during online training.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IlW3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456fb6e3-d1b2-4e8d-ac78-81698e0d8c20_1514x1572.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IlW3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456fb6e3-d1b2-4e8d-ac78-81698e0d8c20_1514x1572.png 424w, https://substackcdn.com/image/fetch/$s_!IlW3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456fb6e3-d1b2-4e8d-ac78-81698e0d8c20_1514x1572.png 848w, https://substackcdn.com/image/fetch/$s_!IlW3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456fb6e3-d1b2-4e8d-ac78-81698e0d8c20_1514x1572.png 1272w, https://substackcdn.com/image/fetch/$s_!IlW3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456fb6e3-d1b2-4e8d-ac78-81698e0d8c20_1514x1572.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IlW3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456fb6e3-d1b2-4e8d-ac78-81698e0d8c20_1514x1572.png" width="1456" height="1512" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/456fb6e3-d1b2-4e8d-ac78-81698e0d8c20_1514x1572.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1512,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:415525,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/169926007?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456fb6e3-d1b2-4e8d-ac78-81698e0d8c20_1514x1572.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IlW3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456fb6e3-d1b2-4e8d-ac78-81698e0d8c20_1514x1572.png 424w, https://substackcdn.com/image/fetch/$s_!IlW3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456fb6e3-d1b2-4e8d-ac78-81698e0d8c20_1514x1572.png 848w, https://substackcdn.com/image/fetch/$s_!IlW3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456fb6e3-d1b2-4e8d-ac78-81698e0d8c20_1514x1572.png 1272w, https://substackcdn.com/image/fetch/$s_!IlW3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F456fb6e3-d1b2-4e8d-ac78-81698e0d8c20_1514x1572.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [5])</figcaption></figure></div><p>These results show that improving data coverage is not enough to overcome the performance limitations of offline alignment&#8212;<em>data ordering is also important. </em>However, this ordering need not be perfect. As we gradually increase the amount of shuffling in the on-policy samples, model performance remains stable up to a point, then rapidly deteriorates to the level observed with offline alignment.</p><blockquote><p><em>&#8220;Offline algorithms, even when augmented with the same data coverage as the online algorithm, cannot obtain the same level of performance. This alludes to the importance of the exact sampling order obtained via on-policy sampling by a constantly evolving policy.&#8221;</em> - from [5]</p></blockquote><p><strong>Sub-optimal data.</strong> We can easily test the impact of data quality on offline alignment algorithms by generating a preference dataset using a policy that is known to be high-quality. In [5], authors generate an offline training dataset using the final policy obtained via online alignment. When policies are trained over this dataset, there is only a slight improvement in quality; see below. Such a result indicates that the limitations of offline alignment algorithms are not purely due to the presence of lower-quality completions in their preference datasets.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5x1u!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecaf9b7e-417f-41b1-9ee9-ecc80f30d1f8_1512x730.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5x1u!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecaf9b7e-417f-41b1-9ee9-ecc80f30d1f8_1512x730.png 424w, https://substackcdn.com/image/fetch/$s_!5x1u!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecaf9b7e-417f-41b1-9ee9-ecc80f30d1f8_1512x730.png 848w, https://substackcdn.com/image/fetch/$s_!5x1u!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecaf9b7e-417f-41b1-9ee9-ecc80f30d1f8_1512x730.png 1272w, https://substackcdn.com/image/fetch/$s_!5x1u!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecaf9b7e-417f-41b1-9ee9-ecc80f30d1f8_1512x730.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5x1u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecaf9b7e-417f-41b1-9ee9-ecc80f30d1f8_1512x730.png" width="1456" height="703" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ecaf9b7e-417f-41b1-9ee9-ecc80f30d1f8_1512x730.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:703,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:222286,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/169926007?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecaf9b7e-417f-41b1-9ee9-ecc80f30d1f8_1512x730.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5x1u!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecaf9b7e-417f-41b1-9ee9-ecc80f30d1f8_1512x730.png 424w, https://substackcdn.com/image/fetch/$s_!5x1u!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecaf9b7e-417f-41b1-9ee9-ecc80f30d1f8_1512x730.png 848w, https://substackcdn.com/image/fetch/$s_!5x1u!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecaf9b7e-417f-41b1-9ee9-ecc80f30d1f8_1512x730.png 1272w, https://substackcdn.com/image/fetch/$s_!5x1u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fecaf9b7e-417f-41b1-9ee9-ecc80f30d1f8_1512x730.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [5])</figcaption></figure></div><p><strong>Classification accuracy.</strong> Authors in [5] demonstrate that explicit reward models used by online alignment algorithms achieve higher preference classification accuracy compared to the implicit reward estimate of an offline policy. However, little correlation is found between preference classification accuracy and model performance; in fact, the only observed correlation is slightly negative. Based on these findings, the authors conclude that the superior preference classification accuracy of online algorithms' explicit reward models is unlikely to be the primary factor behind the improved performance of online alignment methods.</p><p><strong>Contrastive objective.</strong> To study whether the sub-par performance of offline alignment algorithms stems from their use of a contrastive loss function, authors derive a non-contrastive loss for offline alignment called Best-of-2. Put simply, the Best-of-2 training algorithm takes chosen completions for each preference pair in a dataset and runs SFT over these completions. When we train a model using the Best-of-2 loss over our standard offline preference dataset, there is no noticeable change in performance. However, adding online samples to Best-of-2 training&#8212;<em>even when these samples are shuffled to remove the ordering from online alignment</em>&#8212;nearly closes the performance gap with online techniques; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!q8ge!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbea0edff-6693-4de9-9baa-e0fd52286b5c_1776x1006.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!q8ge!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbea0edff-6693-4de9-9baa-e0fd52286b5c_1776x1006.png 424w, https://substackcdn.com/image/fetch/$s_!q8ge!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbea0edff-6693-4de9-9baa-e0fd52286b5c_1776x1006.png 848w, https://substackcdn.com/image/fetch/$s_!q8ge!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbea0edff-6693-4de9-9baa-e0fd52286b5c_1776x1006.png 1272w, https://substackcdn.com/image/fetch/$s_!q8ge!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbea0edff-6693-4de9-9baa-e0fd52286b5c_1776x1006.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!q8ge!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbea0edff-6693-4de9-9baa-e0fd52286b5c_1776x1006.png" width="1456" height="825" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bea0edff-6693-4de9-9baa-e0fd52286b5c_1776x1006.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:825,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:366345,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/169926007?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbea0edff-6693-4de9-9baa-e0fd52286b5c_1776x1006.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!q8ge!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbea0edff-6693-4de9-9baa-e0fd52286b5c_1776x1006.png 424w, https://substackcdn.com/image/fetch/$s_!q8ge!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbea0edff-6693-4de9-9baa-e0fd52286b5c_1776x1006.png 848w, https://substackcdn.com/image/fetch/$s_!q8ge!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbea0edff-6693-4de9-9baa-e0fd52286b5c_1776x1006.png 1272w, https://substackcdn.com/image/fetch/$s_!q8ge!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbea0edff-6693-4de9-9baa-e0fd52286b5c_1776x1006.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [5])</figcaption></figure></div><p>Such a result clearly demonstrates that data coverage is the key indicator of success for SFT, which motivates the inclusion of on-policy samples in SFT (i.e., rejection sampling). We can achieve impressive alignment results by simply including some level of on-policy data in offline training algorithms, <em>forming practically effective LLM alignment baselines that are easy to implement</em>. </p><p><strong>Does scaling up help?</strong> Authors end their analysis in [5] by studying the impact of model scale on the gap between online and offline alignment algorithms. In these experiments, we see that the gap between offline and online algorithms:</p><ul><li><p>Decreases at larger scales.</p></li><li><p>Is more heavily related to data coverage at large scales.</p></li></ul><p>More specifically, training larger models over a shuffled dataset of on-policy samples nearly closes the online-offline performance gap; see below. Such a finding did not hold in data coverage experiments with smaller models.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MKD_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F068cc1e8-ef92-457a-b6c0-008ff107445f_1778x1058.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MKD_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F068cc1e8-ef92-457a-b6c0-008ff107445f_1778x1058.png 424w, https://substackcdn.com/image/fetch/$s_!MKD_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F068cc1e8-ef92-457a-b6c0-008ff107445f_1778x1058.png 848w, https://substackcdn.com/image/fetch/$s_!MKD_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F068cc1e8-ef92-457a-b6c0-008ff107445f_1778x1058.png 1272w, https://substackcdn.com/image/fetch/$s_!MKD_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F068cc1e8-ef92-457a-b6c0-008ff107445f_1778x1058.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MKD_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F068cc1e8-ef92-457a-b6c0-008ff107445f_1778x1058.png" width="1456" height="866" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/068cc1e8-ef92-457a-b6c0-008ff107445f_1778x1058.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:866,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:310094,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/169926007?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F068cc1e8-ef92-457a-b6c0-008ff107445f_1778x1058.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MKD_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F068cc1e8-ef92-457a-b6c0-008ff107445f_1778x1058.png 424w, https://substackcdn.com/image/fetch/$s_!MKD_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F068cc1e8-ef92-457a-b6c0-008ff107445f_1778x1058.png 848w, https://substackcdn.com/image/fetch/$s_!MKD_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F068cc1e8-ef92-457a-b6c0-008ff107445f_1778x1058.png 1272w, https://substackcdn.com/image/fetch/$s_!MKD_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F068cc1e8-ef92-457a-b6c0-008ff107445f_1778x1058.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [5])</figcaption></figure></div><p><strong>Key takeaway.</strong> The detailed alignment analysis in [5] leaves us with one key finding: <em>on-policy sampling is important for high-quality alignment</em>. There are many alternative explanations for the superiority of online alignment algorithms (e.g., data coverage or quality). However, these theories are debunked&#8212;<em>at least at a smaller scale</em>&#8212;by the many data ablations in [5], revealing that on-policy samples are the key contributor to the online-offline performance gap. This finding is very powerful, as it allows us to rethink the data sampling process used for offline alignment algorithms&#8212;<em>we can improve the performance of offline techniques by incorporating (semi-)online data samples as described below</em>!</p><div class="pullquote"><p>&#8220;The dichotomy of online vs. offline is often inaccurate in practice, since an offline algorithm with a repeatedly updated data stream is effectively an online algorithm. As a result, offline learning can be made less likely to suffer from the shortcomings identified in this work, by being more careful with the data generation process in general.&#8221; <em>- from [5]</em></p></div><h4><strong><a href="https://arxiv.org/abs/2506.21495">Bridging Offline and Online Reinforcement Learning for LLMs</a> [9]</strong></h4><blockquote><p><em>&#8220;We study offline, semi-online, and online configurations, across both verifiable and non-verifiable tasks. By examining the transition from offline to online training (i.e., by altering the speed of periodic model syncing), we aim to understand how these methods can be optimized for improved performance and efficiency.&#8221;</em> - from [9]</p></blockquote><p>To granularly study the relationship between online and offline RL, authors in [9] finetune LLMs while smoothly transitioning the training process from an offline to an online setting. In other words, <em>we bridge the gap between online and offline RL by testing training techniques that fall in the middle</em>. By performing such tests over both verifiable (e.g., math) and non-verifiable domains (e.g., chat or instruction-following), we can gain an understanding of how on-policy sampling impacts the RL training process. More specifically, when we compare an on-policy GRPO setup to offline, semi-online, and on-policy variants of DPO, we learn that:</p><ol><li><p>Online and semi-online techniques significantly outperform offline training.</p></li><li><p>Semi-online DPO nearly matches the performance of online DPO.</p></li></ol><p>Put simply, we learn in [9] that online training is beneficial to model performance, but we can reap much of this benefit with a more efficient, semi-online approach.</p><p><strong>Online, semi-online, and offline.</strong> For experiments in [9], authors train the <a href="https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct">Llama-3.1-8b-Instruct</a> model using both on-policy GRPO and several variants of DPO. Specifically, we can create variants of DPO with varying degrees of on-policy sampling by defining a period <code>s</code> such that the policy being trained is used to generate fresh on-policy samples for DPO every <code>s</code> training iterations. In other words, we sync the parameters of the policy being trained and the policy used to sample completions for our preference data every <code>s</code> parameter updates; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yykt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77772f60-5fce-428a-9919-92df84170eb4_2072x900.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yykt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77772f60-5fce-428a-9919-92df84170eb4_2072x900.png 424w, https://substackcdn.com/image/fetch/$s_!yykt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77772f60-5fce-428a-9919-92df84170eb4_2072x900.png 848w, https://substackcdn.com/image/fetch/$s_!yykt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77772f60-5fce-428a-9919-92df84170eb4_2072x900.png 1272w, https://substackcdn.com/image/fetch/$s_!yykt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77772f60-5fce-428a-9919-92df84170eb4_2072x900.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yykt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77772f60-5fce-428a-9919-92df84170eb4_2072x900.png" width="1456" height="632" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/77772f60-5fce-428a-9919-92df84170eb4_2072x900.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:632,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:295027,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/169926007?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77772f60-5fce-428a-9919-92df84170eb4_2072x900.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yykt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77772f60-5fce-428a-9919-92df84170eb4_2072x900.png 424w, https://substackcdn.com/image/fetch/$s_!yykt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77772f60-5fce-428a-9919-92df84170eb4_2072x900.png 848w, https://substackcdn.com/image/fetch/$s_!yykt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77772f60-5fce-428a-9919-92df84170eb4_2072x900.png 1272w, https://substackcdn.com/image/fetch/$s_!yykt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77772f60-5fce-428a-9919-92df84170eb4_2072x900.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [9])</figcaption></figure></div><p>Notably, iterative forms of DPO&#8212;<em>where we generate a new set of completions for training with the current model at each iteration</em>&#8212;have been explored by prior work [10, 11]. However, these methods usually perform rough iterations, where new completions are sampled relatively infrequently. By varying the setting of <code>s</code>, we can explore arbitrary granularities of semi-online DPO, even including a fully on-policy DPO setting where <code>s = 1</code>. Put simply, we can bridge the gap between offline, semi-online, and online DPO by slowly decreasing <code>s</code> from <code>&#8734;</code> to <code>1</code>.</p><p><strong>Experimental setup.</strong> Experiments are conducted in two possible domains:</p><ul><li><p>A non-verifiable domain where training data is drawn from <a href="https://huggingface.co/datasets/allenai/WildChat-1M-Full">WildChat-1M</a><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-10" href="#footnote-10" target="_self">10</a> and models are evaluated via <a href="https://cameronrwolfe.substack.com/p/llm-as-a-judge">LLM judges</a> in terms of their chat capabilities (e.g., using <a href="https://arxiv.org/abs/2404.04475">AlpacaEval</a> and <a href="https://arxiv.org/abs/2406.11939">Arena-Hard</a>).</p></li><li><p>A math-focused, verifiable domain where training data is drawn from the <a href="http://faculty.bicmr.pku.edu.cn/~dongbin/Publications/numina_dataset.pdf">NuminaMath</a> dataset and evaluation is performed on several verifiable math benchmarks (e.g., <a href="https://huggingface.co/datasets/HuggingFaceH4/MATH-500">Math500</a> and <a href="https://huggingface.co/datasets/math-ai/amc23">AMC23</a>).</p></li></ul><p>In the verifiable domain, the reward signal is obtained using the <a href="https://github.com/huggingface/Math-Verify">Math-Verify</a> toolkit rather than exact string matching, which makes the reward more robust to variations in answer format<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-11" href="#footnote-11" target="_self">11</a>. The non-verifiable reward is derived from an off-the-shelf human preference reward model&#8212;<em><a href="https://huggingface.co/Nexusflow/Athene-RM-8B">Athene-RM-8b</a> in particular</em>&#8212;that is fixed throughout all experiments. To apply DPO in the verifiable domain, we simply generate several responses to each question, then choose a single correct and incorrect answer for each question to form preference pairs.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Y1mY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23563f0b-89ad-41ef-9e4f-aaeba5c0905c_1660x1386.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Y1mY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23563f0b-89ad-41ef-9e4f-aaeba5c0905c_1660x1386.png 424w, https://substackcdn.com/image/fetch/$s_!Y1mY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23563f0b-89ad-41ef-9e4f-aaeba5c0905c_1660x1386.png 848w, https://substackcdn.com/image/fetch/$s_!Y1mY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23563f0b-89ad-41ef-9e4f-aaeba5c0905c_1660x1386.png 1272w, https://substackcdn.com/image/fetch/$s_!Y1mY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23563f0b-89ad-41ef-9e4f-aaeba5c0905c_1660x1386.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Y1mY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23563f0b-89ad-41ef-9e4f-aaeba5c0905c_1660x1386.png" width="1456" height="1216" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/23563f0b-89ad-41ef-9e4f-aaeba5c0905c_1660x1386.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1216,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:627443,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/169926007?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23563f0b-89ad-41ef-9e4f-aaeba5c0905c_1660x1386.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Y1mY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23563f0b-89ad-41ef-9e4f-aaeba5c0905c_1660x1386.png 424w, https://substackcdn.com/image/fetch/$s_!Y1mY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23563f0b-89ad-41ef-9e4f-aaeba5c0905c_1660x1386.png 848w, https://substackcdn.com/image/fetch/$s_!Y1mY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23563f0b-89ad-41ef-9e4f-aaeba5c0905c_1660x1386.png 1272w, https://substackcdn.com/image/fetch/$s_!Y1mY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23563f0b-89ad-41ef-9e4f-aaeba5c0905c_1660x1386.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [9])</figcaption></figure></div><p><strong>Is semi-online enough?</strong> The results of these experiments on both verifiable and non-verifiable tasks are shown above. Immediately, we see that training with an online or semi-online setup provides substantial gains over offline DPO in both domains. <em>There is a clear performance gap between offline and online methods</em>. But, the gap between online and semi-online settings is much less pronounced. In fact, online and semi-online DPO even outperform on-policy GRPO in some cases! These findings hold true even with relatively large values of <code>s</code>; e.g., in the verifiable domain <code>s</code> is increased to 100 with very promising results<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-12" href="#footnote-12" target="_self">12</a>. </p><blockquote><p><em>&#8220;The efficiency gains of the semi-online variants opens up an interesting question of whether fully online RL is the only approach for post-training LLMs.&#8221; </em>- from [9]</p></blockquote><p>Such findings have interesting implications for the online-offline performance gap in RL. We see in [9] that there is a clear benefit to online sampling. However, we can potentially approximate this sampling more efficiently via a semi-online setup that intermittently collects fresh data instead of strict on-policy sampling.</p><p><strong>Verifiable versus non-verifiable.</strong> Experiments are also performed in [9] to explore the interplay between verifiable and non-verifiable rewards, showing that the curriculum (or order) of rewards during RL training is important. If we compare settings in which the LLM is first trained on non-verifiable rewards then on verifiable rewards (<code>NV &#8594; V</code>) or vice versa (<code>V &#8594; NV</code>), we get better performance by first training on non-verifiable rewards (i.e., <code>NV &#8594; V</code> &#187; <code>V &#8594; NV</code>). </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!e6vJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd0d5e62-ce00-468c-acc7-719426a052f6_2302x936.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!e6vJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd0d5e62-ce00-468c-acc7-719426a052f6_2302x936.png 424w, https://substackcdn.com/image/fetch/$s_!e6vJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd0d5e62-ce00-468c-acc7-719426a052f6_2302x936.png 848w, https://substackcdn.com/image/fetch/$s_!e6vJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd0d5e62-ce00-468c-acc7-719426a052f6_2302x936.png 1272w, https://substackcdn.com/image/fetch/$s_!e6vJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd0d5e62-ce00-468c-acc7-719426a052f6_2302x936.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!e6vJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd0d5e62-ce00-468c-acc7-719426a052f6_2302x936.png" width="1456" height="592" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bd0d5e62-ce00-468c-acc7-719426a052f6_2302x936.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:592,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:675873,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/169926007?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd0d5e62-ce00-468c-acc7-719426a052f6_2302x936.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!e6vJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd0d5e62-ce00-468c-acc7-719426a052f6_2302x936.png 424w, https://substackcdn.com/image/fetch/$s_!e6vJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd0d5e62-ce00-468c-acc7-719426a052f6_2302x936.png 848w, https://substackcdn.com/image/fetch/$s_!e6vJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd0d5e62-ce00-468c-acc7-719426a052f6_2302x936.png 1272w, https://substackcdn.com/image/fetch/$s_!e6vJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd0d5e62-ce00-468c-acc7-719426a052f6_2302x936.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Training on non-verifiable rewards after the LLM has been trained on verifiable rewards leads to a noticeable performance deterioration in verifiable domains. In contrast, further training on verifiable rewards actually <em>improves</em> the performance of the LLM, even in non-verifiable domains; see above. If we combined both non-verifiable and verifiable rewards within a single training run (<code>V + NV</code>) the model also performs well, <em>revealing that the simplest approach may be just mixing the disparate reward signals into a single, unified training run</em>! </p><h2>Conclusion</h2><p>There are many alignment algorithms for LLMs, each varying in complexity and performance. Online algorithms have a clear performance benefit over offline alignment algorithms. In this overview, we have learned that this gap in performance primarily arises due to the use of on-policy sampling in online alignment algorithms, as well as other&#8212;<em>arguably less significant</em>&#8212;factors like negative gradients. Interestingly, however, we have also learned that much simpler and equally effective alignment algorithms can be derived by including on-policy samples in the training dataset used for offline alignment, forming semi-online algorithms that are practically effective and easy to implement. </p><h4>New to the newsletter?</h4><p>Hi! I&#8217;m <a href="https://cameronrwolfe.me/">Cameron R. Wolfe</a>, Deep Learning Ph.D. and Senior Research Scientist at <a href="https://research.netflix.com/research-area/nlp-and-conversations">Netflix</a>. This is the Deep (Learning) Focus newsletter, where I help readers better understand important topics in AI research. The newsletter will always be free and open to read. If you like the newsletter, please subscribe, consider a paid subscription, share it, or follow me on <a href="https://twitter.com/cwolferesearch">X</a> and <a href="https://www.linkedin.com/in/cameron-r-wolfe-ph-d-04744a238/">LinkedIn</a>!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://cameronrwolfe.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://cameronrwolfe.substack.com/subscribe?"><span>Subscribe now</span></a></p><h4>Bibliography</h4><p>[1] Rafailov, Rafael, et al. "Direct preference optimization: Your language model is secretly a reward model." <em>Advances in neural information processing systems</em> 36 (2023): 53728-53741.<br>[2] Ivison, Hamish, et al. "Unpacking dpo and ppo: Disentangling best practices for learning from preference feedback." <em>Advances in neural information processing systems</em> 37 (2024): 36602-36633.</p><p>[3] Ivison, Hamish, et al. "Camels in a changing climate: Enhancing lm adaptation with tulu 2." <em>arXiv preprint arXiv:2311.10702</em> (2023).</p><p>[4] Tunstall, Lewis, et al. "Zephyr: Direct distillation of lm alignment." <em>arXiv preprint arXiv:2310.16944</em> (2023).</p><p>[5] Tang, Yunhao, et al. "Understanding the performance gap between online and offline alignment algorithms." <em>arXiv preprint arXiv:2405.08448</em> (2024).</p><p>[6] Xu, Shusheng, et al. "Is dpo superior to ppo for llm alignment? a comprehensive study." <em>arXiv preprint arXiv:2404.10719</em> (2024).</p><p>[7] Tajwar, Fahim, et al. "Preference fine-tuning of llms should leverage suboptimal, on-policy data." <em>arXiv preprint arXiv:2404.14367</em> (2024).</p><p>[8] Azar, Mohammad Gheshlaghi, et al. "A general theoretical paradigm to understand learning from human preferences." <em>International Conference on Artificial Intelligence and Statistics</em>. PMLR, 2024.</p><p>[9] Lanchantin, Jack, et al. "Bridging Offline and Online Reinforcement Learning for LLMs." <em>arXiv preprint arXiv:2506.21495</em> (2025).</p><p>[10] Yuan, Weizhe, et al. "Self-rewarding language models." <em>arXiv preprint arXiv:2401.10020</em> 3 (2024).</p><p>[11] Pang, Richard Yuanzhe, et al. "Iterative reasoning preference optimization." <em>Advances in Neural Information Processing Systems</em> 37 (2024): 116617-116637.</p><p>[12] Shao, Zhihong, et al. "Deepseekmath: Pushing the limits of mathematical reasoning in open language models." <em>arXiv preprint arXiv:2402.03300</em> (2024).</p><p>[13] Mukobi, Gabriel, et al. "Superhf: Supervised iterative learning from human feedback." <em>arXiv preprint arXiv:2310.16763</em> (2023).</p><p>[14] Gulcehre, Caglar, et al. "Reinforced self-training (rest) for language modeling." <em>arXiv preprint arXiv:2308.08998</em> (2023).</p><p>[15] Hu, Jian, et al. "Aligning language models with offline learning from human feedback." <em>arXiv preprint arXiv:2308.12050</em> (2023).</p><p>[16] Lambert, Nathan, et al. "Tulu 3: Pushing frontiers in open language model post-training." <em>arXiv preprint arXiv:2411.15124</em> (2024).</p><p>[17] Ouyang, Long, et al. "Training language models to follow instructions with human feedback." <em>Advances in neural information processing systems</em> 35 (2022): 27730-27744.</p><p>[18] Rafailov, Rafael, et al. "Direct preference optimization: Your language model is secretly a reward model." <em>Advances in neural information processing systems</em> 36 (2023): 53728-53741.</p><p>[19] Ethayarajh, Kawin, et al. "Kto: Model alignment as prospect theoretic optimization." <em>arXiv preprint arXiv:2402.01306</em> (2024).</p><p>[20] Xu, Haoran, et al. "Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation." <em>arXiv preprint arXiv:2401.08417</em> (2024).</p><p>[21] Huang, Shengyi, et al. "The n+ implementation details of rlhf with ppo: A case study on tl; dr summarization." <em>arXiv preprint arXiv:2403.17031</em> (2024).</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>This can be accomplished using a <a href="https://huggingface.co/docs/trl/main/en/sft_trainer#train-on-completion-only">completion-only loss collator</a>. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>There are a few different ways this selection can be performed. For example, we can select the top completion for each prompt, or we can select the top-scoring completions across all prompts; see <a href="https://rlhfbook.com/c/10-rejection-sampling.html#selecting-top-n-completions">here</a> for details.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>In other words, the output is a vector of size eight to which a <a href="https://en.wikipedia.org/wiki/Softmax_function">softmax function</a> has been applied to form a probability distribution over these eight possible outcomes. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>GRPO is not listed in this table due to the fact that both [7] and the GRPO paper [12] were published at very similar times. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>The LLM before alignment already generates completions that are near the average length. In contrast, the LLM does not generate minimum (or zero) length completions, so learning to generate such responses requires probability mass to be moved into a new region that was previously unlikely. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>This sensitivity is due to the fact that maximum likelihood algorithms do not have any explicit mechanism to protect against off-policy sampling, whereas PPO has the clipping operation and KL divergence that help to maintain the trust region. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>As an example of how we can obtain preference data via web scraping, the <a href="https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences">Stack Exchange Preferences dataset</a> takes questions from Stack Overflow with at least two answers and ranks answers based on implicit feedback (e.g., likes or upvotes).</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>Specifically, this work uses the <a href="https://huggingface.co/datasets/openai/summarize_from_feedback">OpenAI summarization</a>, <a href="https://huggingface.co/datasets/Anthropic/hh-rlhf">Anthropic Helpful and Harmless</a> (hh-rlhf), and the <a href="https://lmsys.org/blog/2023-07-20-dataset/">Chatbot arena preference dataset</a>. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-9" href="#footnote-anchor-9" class="footnote-number" contenteditable="false" target="_self">9</a><div class="footnote-content"><p>We should note that one can make a similar argument against online algorithms! The reward model used in online algorithms is also trained over a fixed dataset, which can lead to similar limitations in the performance of online algorithms. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-10" href="#footnote-anchor-10" class="footnote-number" contenteditable="false" target="_self">10</a><div class="footnote-content"><p>This is a general chat and instruction-following benchmark that is comprised of ~1M user interactions with ChatGPT. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-11" href="#footnote-anchor-11" class="footnote-number" contenteditable="false" target="_self">11</a><div class="footnote-content"><p>For example, an LLM could provide an answer of 0.5 or 1/2 to a math question. Both of these answers would be correct, but one of them would likely be marked as wrong if we are verifying our reward via exact string match. For this reason, using a more robust validation system for mathematical expressions is helpful. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-12" href="#footnote-anchor-12" class="footnote-number" contenteditable="false" target="_self">12</a><div class="footnote-content"><p>The value of <code>s</code> is much larger in the verifiable domain compared to the non-verifiable domain. Authors in [9] make this choice because the non-verifiable dataset is small and a setting of <code>s = 32</code> spans a full epoch over the data. Therefore, the training process is not stable with larger values of <code>s</code> in the non-verifiable domain.</p></div></div>]]></content:encoded></item><item><title><![CDATA[GPT-oss from the Ground Up]]></title><description><![CDATA[Everything you should know about OpenAI's new open-weight language models...]]></description><link>https://cameronrwolfe.substack.com/p/gpt-oss</link><guid isPermaLink="false">https://cameronrwolfe.substack.com/p/gpt-oss</guid><dc:creator><![CDATA[Cameron R. Wolfe, Ph.D.]]></dc:creator><pubDate>Mon, 18 Aug 2025 09:33:11 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/4e1fd6f8-4805-43c3-bfe4-5e66bd3983ca_2454x1378.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VV2-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56432d4c-bdd0-4eed-afaf-2b4900ef83d6_2450x1374.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VV2-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56432d4c-bdd0-4eed-afaf-2b4900ef83d6_2450x1374.png 424w, https://substackcdn.com/image/fetch/$s_!VV2-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56432d4c-bdd0-4eed-afaf-2b4900ef83d6_2450x1374.png 848w, https://substackcdn.com/image/fetch/$s_!VV2-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56432d4c-bdd0-4eed-afaf-2b4900ef83d6_2450x1374.png 1272w, https://substackcdn.com/image/fetch/$s_!VV2-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56432d4c-bdd0-4eed-afaf-2b4900ef83d6_2450x1374.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VV2-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56432d4c-bdd0-4eed-afaf-2b4900ef83d6_2450x1374.png" width="1456" height="817" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/56432d4c-bdd0-4eed-afaf-2b4900ef83d6_2450x1374.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:817,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1421889,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56432d4c-bdd0-4eed-afaf-2b4900ef83d6_2450x1374.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VV2-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56432d4c-bdd0-4eed-afaf-2b4900ef83d6_2450x1374.png 424w, https://substackcdn.com/image/fetch/$s_!VV2-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56432d4c-bdd0-4eed-afaf-2b4900ef83d6_2450x1374.png 848w, https://substackcdn.com/image/fetch/$s_!VV2-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56432d4c-bdd0-4eed-afaf-2b4900ef83d6_2450x1374.png 1272w, https://substackcdn.com/image/fetch/$s_!VV2-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56432d4c-bdd0-4eed-afaf-2b4900ef83d6_2450x1374.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [18, 20, 21])</figcaption></figure></div><p>Recently, OpenAI released <a href="https://huggingface.co/collections/openai/gpt-oss-68911959590a1634ba11c7a4">GPT-oss</a> [1, 2]&#8212;<em>their first open LLM release since <a href="https://cameronrwolfe.substack.com/i/85568430/language-models-are-unsupervised-multitask-learners-gpt">GPT-2</a> [13] over five years ago</em>. In the time between GPT-2 and GPT-oss, LLM research has undergone a continuous transformation. Many of the key breakthroughs in LLM research during this time have come from OpenAI, but their research is almost always kept internal. GPT-oss provides a rare peek into LLM research at OpenAI. In this overview, we will take advantage of this infrequent opportunity by:</p><ol><li><p>Exhaustively outlining every single technical detail revealed about GPT-oss in the report(s) provided by OpenAI.</p></li><li><p>Explaining how each of these details work from the ground up<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>.</p></li></ol><p>This overview is long (probably too long), and it covers a wide variety of loosely related topics in LLM research. However, by taking the time to work through each of these topics, we will gain a deep understanding of how GPT-oss works and, in turn, form a better perspective on the state of LLM research at OpenAI.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://cameronrwolfe.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Join 50,000 others who use Deep (Learning) Focus to stay up-to-date with AI research.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>GPT-oss at a Glance</h2><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yt1l!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7ed4410-ccd4-4183-907e-bfab8b5df2ae_2422x356.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yt1l!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7ed4410-ccd4-4183-907e-bfab8b5df2ae_2422x356.png 424w, https://substackcdn.com/image/fetch/$s_!yt1l!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7ed4410-ccd4-4183-907e-bfab8b5df2ae_2422x356.png 848w, https://substackcdn.com/image/fetch/$s_!yt1l!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7ed4410-ccd4-4183-907e-bfab8b5df2ae_2422x356.png 1272w, https://substackcdn.com/image/fetch/$s_!yt1l!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7ed4410-ccd4-4183-907e-bfab8b5df2ae_2422x356.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yt1l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7ed4410-ccd4-4183-907e-bfab8b5df2ae_2422x356.png" width="1456" height="214" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f7ed4410-ccd4-4183-907e-bfab8b5df2ae_2422x356.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:214,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:59719,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7ed4410-ccd4-4183-907e-bfab8b5df2ae_2422x356.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yt1l!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7ed4410-ccd4-4183-907e-bfab8b5df2ae_2422x356.png 424w, https://substackcdn.com/image/fetch/$s_!yt1l!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7ed4410-ccd4-4183-907e-bfab8b5df2ae_2422x356.png 848w, https://substackcdn.com/image/fetch/$s_!yt1l!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7ed4410-ccd4-4183-907e-bfab8b5df2ae_2422x356.png 1272w, https://substackcdn.com/image/fetch/$s_!yt1l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7ed4410-ccd4-4183-907e-bfab8b5df2ae_2422x356.png 1456w" sizes="100vw"></picture><div></div></div></a></figure></div><blockquote><p><em>&#8220;They were trained using a mix of reinforcement learning and techniques informed by OpenAI&#8217;s most advanced internal models, including o3 and other frontier systems.&#8221;</em> - from [1]</p></blockquote><p>The GPT-oss release includes two different models&#8212;<em><a href="https://huggingface.co/openai/gpt-oss-20b">GPT-oss-20b</a> and <a href="https://huggingface.co/openai/gpt-oss-120b">GPT-oss-120b</a></em>&#8212;that are both released with a permissive <a href="https://www.apache.org/licenses/LICENSE-2.0">Apache-2.0 license</a>. These are Mixture-of-Experts (MoE)-based reasoning models that are text-only and trained primarily on English data. Due to their MoE architecture and use of quantization-aware training, these models are compute and memory efficient. The 20b and 120b models have 5b and 3.5b active parameters respectively. Using MXFP4 (~4-bit) precision, the larger model can be hosted on a single 80Gb GPU, while GPT-oss-20b needs only ~16Gb of memory for hosting. These models are extensively post-trained to optimize their <a href="https://cameronrwolfe.substack.com/p/chain-of-thought-prompting-for-llms">chain of thought (CoT)</a> reasoning and safety. </p><p><strong>Emphasis on agents.</strong> Both GPT-oss models are optimized for agentic workflows with a (reasonably) long context window of 131k tokens, as well as strong tool use, reasoning and instruction-following capabilities. To handle patterns from agentic workflows (e.g., function calling, tool use, reasoning, <a href="https://www.aidancooper.co.uk/constrained-decoding/">structured outputs</a>, and more) more seamlessly, OpenAI released the new Harmony prompt format&#8212;<em>a flexible, hierarchical chat template capable of capturing diverse LLM interaction patterns&#8212;</em>for training and interacting with GPT-oss. The GPT-oss models also provide the ability to adjust their reasoning effort (i.e., to low, medium or high effort levels) by explicitly specifying an effort level in their system message.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mbFA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa451da2-4e3e-4e4d-b9c8-e5018d60af03_1544x980.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mbFA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa451da2-4e3e-4e4d-b9c8-e5018d60af03_1544x980.png 424w, https://substackcdn.com/image/fetch/$s_!mbFA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa451da2-4e3e-4e4d-b9c8-e5018d60af03_1544x980.png 848w, https://substackcdn.com/image/fetch/$s_!mbFA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa451da2-4e3e-4e4d-b9c8-e5018d60af03_1544x980.png 1272w, https://substackcdn.com/image/fetch/$s_!mbFA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa451da2-4e3e-4e4d-b9c8-e5018d60af03_1544x980.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mbFA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa451da2-4e3e-4e4d-b9c8-e5018d60af03_1544x980.png" width="1456" height="924" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fa451da2-4e3e-4e4d-b9c8-e5018d60af03_1544x980.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:924,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:256557,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa451da2-4e3e-4e4d-b9c8-e5018d60af03_1544x980.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mbFA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa451da2-4e3e-4e4d-b9c8-e5018d60af03_1544x980.png 424w, https://substackcdn.com/image/fetch/$s_!mbFA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa451da2-4e3e-4e4d-b9c8-e5018d60af03_1544x980.png 848w, https://substackcdn.com/image/fetch/$s_!mbFA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa451da2-4e3e-4e4d-b9c8-e5018d60af03_1544x980.png 1272w, https://substackcdn.com/image/fetch/$s_!mbFA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa451da2-4e3e-4e4d-b9c8-e5018d60af03_1544x980.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p><strong>Internal evaluations.</strong> Evaluations released by OpenAI reveal that GPT-oss-120b performs comparably to <a href="https://openai.com/index/introducing-o3-and-o4-mini/">o4-mini</a>, while GPT-oss-20b performs similarly to o3-mini; see above. Additionally, OpenAI heavily emphasized the strong capabilities of these models on health-related tasks&#8212;<em>based on evaluations from their newly-released <a href="https://openai.com/index/healthbench/">HealthBench</a></em>&#8212;during the release; see below. However, GPT-oss models still fall short of the performance of the full o3 model on this benchmark.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fGrS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7a11473-e3d9-4077-bb50-b06e7e149951_1204x648.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fGrS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7a11473-e3d9-4077-bb50-b06e7e149951_1204x648.png 424w, https://substackcdn.com/image/fetch/$s_!fGrS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7a11473-e3d9-4077-bb50-b06e7e149951_1204x648.png 848w, https://substackcdn.com/image/fetch/$s_!fGrS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7a11473-e3d9-4077-bb50-b06e7e149951_1204x648.png 1272w, https://substackcdn.com/image/fetch/$s_!fGrS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7a11473-e3d9-4077-bb50-b06e7e149951_1204x648.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fGrS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7a11473-e3d9-4077-bb50-b06e7e149951_1204x648.png" width="410" height="220.66445182724252" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b7a11473-e3d9-4077-bb50-b06e7e149951_1204x648.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:648,&quot;width&quot;:1204,&quot;resizeWidth&quot;:410,&quot;bytes&quot;:83738,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7a11473-e3d9-4077-bb50-b06e7e149951_1204x648.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fGrS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7a11473-e3d9-4077-bb50-b06e7e149951_1204x648.png 424w, https://substackcdn.com/image/fetch/$s_!fGrS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7a11473-e3d9-4077-bb50-b06e7e149951_1204x648.png 848w, https://substackcdn.com/image/fetch/$s_!fGrS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7a11473-e3d9-4077-bb50-b06e7e149951_1204x648.png 1272w, https://substackcdn.com/image/fetch/$s_!fGrS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7a11473-e3d9-4077-bb50-b06e7e149951_1204x648.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>As should be expected, OpenAI also highlights that the GPT-oss models obey the usual inference-time scaling laws with respect to their reasoning effort. Model performance improves as the models generate progressively longer reasoning traces&#8212;<em>and therefore consume more compute</em>&#8212;during inference; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qpwX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebbc399-a891-45b1-b933-811655b02d68_2644x864.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qpwX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebbc399-a891-45b1-b933-811655b02d68_2644x864.png 424w, https://substackcdn.com/image/fetch/$s_!qpwX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebbc399-a891-45b1-b933-811655b02d68_2644x864.png 848w, https://substackcdn.com/image/fetch/$s_!qpwX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebbc399-a891-45b1-b933-811655b02d68_2644x864.png 1272w, https://substackcdn.com/image/fetch/$s_!qpwX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebbc399-a891-45b1-b933-811655b02d68_2644x864.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qpwX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebbc399-a891-45b1-b933-811655b02d68_2644x864.png" width="1456" height="476" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bebbc399-a891-45b1-b933-811655b02d68_2644x864.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:476,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:217638,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebbc399-a891-45b1-b933-811655b02d68_2644x864.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qpwX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebbc399-a891-45b1-b933-811655b02d68_2644x864.png 424w, https://substackcdn.com/image/fetch/$s_!qpwX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebbc399-a891-45b1-b933-811655b02d68_2644x864.png 848w, https://substackcdn.com/image/fetch/$s_!qpwX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebbc399-a891-45b1-b933-811655b02d68_2644x864.png 1272w, https://substackcdn.com/image/fetch/$s_!qpwX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbebbc399-a891-45b1-b933-811655b02d68_2644x864.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p><strong>Public reception.</strong> After making their way around the open LLM community, the GPT-oss models have received mixed feedback. For example, some users have pointed out that these models have a <a href="https://www.reddit.com/r/singularity/comments/1mihu08/the_new_gptoss_models_have_extremely_high/">high hallucination rate</a>, while others say that the models are <a href="https://www.reddit.com/r/LocalLLaMA/comments/1mlomlb/my_thoughts_on_gptoss120b/">actually pretty good</a> after initial hiccups related to model setup were fixed. Other common criticisms of the GPT-oss models include <a href="https://www.reddit.com/r/LocalLLaMA/comments/1miqbyk/the_openai_gptoss_model_is_too_safe/">over-refusal of prompts</a>, difficulty with properly setting up model quantization, and the Harmony prompt format being overly complex or hard to use. Put simply, the perception seemed poor at first, but <a href="https://www.reddit.com/r/LocalLLaMA/comments/1mogxpr/openai_gptoss120b_is_an_excellent_model/">slowly improved</a> as lingering issues in common tools like <a href="https://ollama.com/">ollama</a>, <a href="https://github.com/ggml-org/llama.cpp">llama.cpp</a>, and <a href="https://docs.unsloth.ai/">unsloth</a> and were resolved. </p><p>The reality of GPT-oss is somewhere in the middle of the polarizing and clickbaity reactions online. These are (obviously) not the best models ever, but they are open weights models released by one of the top LLM labs in the world. Given that few of the top American LLM labs (other than <a href="https://arxiv.org/abs/2411.15124">AI2</a>, <a href="https://cohere.com/blog/aya-expanse-connecting-our-world">Cohere</a> and <a href="https://cameronrwolfe.substack.com/p/llama-4">Meta</a>) are actively releasing open weights models, we would be foolish to not try out these models and gain a deep understanding of how they work. So, let&#8217;s start diving into the relevant technical details provided by OpenAI on GPT-oss.</p><h2>Model Architecture</h2><blockquote><p><em>&#8220;The GPT-oss models are autoregressive Mixture-of-Experts (MoE) transformers that build upon the GPT-2 and GPT-3 architectures.&#8221;</em> - from [1]</p></blockquote><p>We will first cover the model architecture of the GPT-oss models. This discussion will start with a basic understanding of the transformer architecture<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>. From here, we will outline each unique component of the GPT-oss architecture with a from-scratch explanation. For further reading on this topic and comparison to other open models, see the great overview from <a href="https://sebastianraschka.com/">Sebastian Raschka</a> below. </p><div class="embedded-post-wrap" data-attrs="{&quot;id&quot;:170506328,&quot;url&quot;:&quot;https://magazine.sebastianraschka.com/p/from-gpt-2-to-gpt-oss-analyzing-the&quot;,&quot;publication_id&quot;:1174659,&quot;publication_name&quot;:&quot;Ahead of AI&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!96vs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49f25d0a-212b-4853-8bcb-128d0a3edbbf_1196x1196.png&quot;,&quot;title&quot;:&quot;From GPT-2 to gpt-oss: Analyzing the Architectural Advances&quot;,&quot;truncated_body_text&quot;:&quot;OpenAI just released their new open-weight LLMs this week: gpt-oss-120b and gpt-oss-20b, their first open-weight models since GPT-2 in 2019. And yes, thanks to some clever optimizations, they can run locally (but more about this later).&quot;,&quot;date&quot;:&quot;2025-08-09T11:23:07.237Z&quot;,&quot;like_count&quot;:169,&quot;comment_count&quot;:17,&quot;bylines&quot;:[{&quot;id&quot;:27393275,&quot;name&quot;:&quot;Sebastian Raschka, PhD&quot;,&quot;handle&quot;:&quot;rasbt&quot;,&quot;previous_name&quot;:&quot;Sebastian Raschka&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F61f4c017-506f-4e9b-a24f-76340dad0309_800x800.jpeg&quot;,&quot;bio&quot;:&quot;I'm an LLM research engineer 10+ years of experience in artificial intelligence. My expertise lies in AI &amp; LLM research focusing on code-driven implementations. I am also the author of \&quot;Build a Large Language Model From Scratch\&quot; (amzn.to/4fqvn0D).&quot;,&quot;profile_set_up_at&quot;:&quot;2022-10-09T16:19:59.744Z&quot;,&quot;reader_installed_at&quot;:&quot;2022-11-07T19:56:32.129Z&quot;,&quot;publicationUsers&quot;:[{&quot;id&quot;:1127862,&quot;user_id&quot;:27393275,&quot;publication_id&quot;:1174659,&quot;role&quot;:&quot;admin&quot;,&quot;public&quot;:true,&quot;is_primary&quot;:true,&quot;publication&quot;:{&quot;id&quot;:1174659,&quot;name&quot;:&quot;Ahead of AI&quot;,&quot;subdomain&quot;:&quot;sebastianraschka&quot;,&quot;custom_domain&quot;:&quot;magazine.sebastianraschka.com&quot;,&quot;custom_domain_optional&quot;:false,&quot;hero_text&quot;:&quot;Ahead of AI specializes in Machine Learning &amp; AI research and is read by tens of thousands of researchers and practitioners who want to stay ahead in the ever-evolving field.&quot;,&quot;logo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/49f25d0a-212b-4853-8bcb-128d0a3edbbf_1196x1196.png&quot;,&quot;author_id&quot;:27393275,&quot;primary_user_id&quot;:27393275,&quot;theme_var_background_pop&quot;:&quot;#2096FF&quot;,&quot;created_at&quot;:&quot;2022-11-04T18:30:05.218Z&quot;,&quot;email_from_name&quot;:null,&quot;copyright&quot;:&quot;Raschka AI Research (RAIR) Lab LLC&quot;,&quot;founding_plan_name&quot;:&quot;Founding plan&quot;,&quot;community_enabled&quot;:true,&quot;invite_only&quot;:false,&quot;payments_state&quot;:&quot;enabled&quot;,&quot;language&quot;:null,&quot;explicit&quot;:false,&quot;homepage_type&quot;:&quot;newspaper&quot;,&quot;is_personal_mode&quot;:false}}],&quot;twitter_screen_name&quot;:&quot;rasbt&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;utm_campaign&quot;:null,&quot;belowTheFold&quot;:true,&quot;type&quot;:&quot;newsletter&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="EmbeddedPostToDOM"><a class="embedded-post" native="true" href="https://magazine.sebastianraschka.com/p/from-gpt-2-to-gpt-oss-analyzing-the?utm_source=substack&amp;utm_campaign=post_embed&amp;utm_medium=web"><div class="embedded-post-header"><img class="embedded-post-publication-logo" src="https://substackcdn.com/image/fetch/$s_!96vs!,w_56,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49f25d0a-212b-4853-8bcb-128d0a3edbbf_1196x1196.png" loading="lazy"><span class="embedded-post-publication-name">Ahead of AI</span></div><div class="embedded-post-title-wrapper"><div class="embedded-post-title">From GPT-2 to gpt-oss: Analyzing the Architectural Advances</div></div><div class="embedded-post-body">OpenAI just released their new open-weight LLMs this week: gpt-oss-120b and gpt-oss-20b, their first open-weight models since GPT-2 in 2019. And yes, thanks to some clever optimizations, they can run locally (but more about this later&#8230;</div><div class="embedded-post-cta-wrapper"><span class="embedded-post-cta">Read more</span></div><div class="embedded-post-meta">9 months ago &#183; 169 likes &#183; 17 comments &#183; Sebastian Raschka, PhD</div></a></div><h4>Transformer Structure</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aQxq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414bf0b5-2043-4fb5-bdab-e0153f893861_1634x808.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aQxq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414bf0b5-2043-4fb5-bdab-e0153f893861_1634x808.png 424w, https://substackcdn.com/image/fetch/$s_!aQxq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414bf0b5-2043-4fb5-bdab-e0153f893861_1634x808.png 848w, https://substackcdn.com/image/fetch/$s_!aQxq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414bf0b5-2043-4fb5-bdab-e0153f893861_1634x808.png 1272w, https://substackcdn.com/image/fetch/$s_!aQxq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414bf0b5-2043-4fb5-bdab-e0153f893861_1634x808.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aQxq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414bf0b5-2043-4fb5-bdab-e0153f893861_1634x808.png" width="1456" height="720" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/414bf0b5-2043-4fb5-bdab-e0153f893861_1634x808.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:720,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aQxq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414bf0b5-2043-4fb5-bdab-e0153f893861_1634x808.png 424w, https://substackcdn.com/image/fetch/$s_!aQxq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414bf0b5-2043-4fb5-bdab-e0153f893861_1634x808.png 848w, https://substackcdn.com/image/fetch/$s_!aQxq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414bf0b5-2043-4fb5-bdab-e0153f893861_1634x808.png 1272w, https://substackcdn.com/image/fetch/$s_!aQxq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414bf0b5-2043-4fb5-bdab-e0153f893861_1634x808.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Decoder-only transformer architecture</figcaption></figure></div><p>A depiction of a standard, <a href="https://cameronrwolfe.substack.com/p/decoder-only-transformers-the-workhorse">decoder-only transformer architecture</a> is provided above. This architecture is used almost universally by modern GPT-style LLMs. </p><p><strong>Embedding dimension.</strong> The input to this model is a sequence of token vectors, produced by <a href="https://cameronrwolfe.substack.com/i/142044446/constructing-the-models-input">tokenizing and embedding</a> our textual input (or prompt). In the case of the GPT-oss models, these vectors have a fixed dimension of 2,880, and this same embedding dimension is maintained through every layer of the LLM. </p><p><strong>Block structure.</strong> The decoder-only architecture is comprised of repeated decoder blocks&#8212;<em>GPT-oss models contain either 24 (GPT-oss-20b) or 36 (GPT-oss-120b) of these blocks</em>. As we can see above, each decoder block has the same key components: normalization, <a href="https://cameronrwolfe.substack.com/i/155023686/masked-and-multi-headed-self-attention">masked multi-headed self-attention</a>, <a href="https://cameronrwolfe.substack.com/i/155023686/feed-forward-transformation">feed-forward transformation</a>, and <a href="https://en.wikipedia.org/wiki/Residual_neural_network">residual connections</a>. The GPT-oss models adopt a pre-normalization structure, which is the most common choice in current LLM architectures<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>. This means that the normalization layers in the decoder block are placed before both the attention and feed-forward layers, yielding the following structure:</p><div><hr></div><p><code>Decoder Block Input &#8594; Normalization &#8594; Masked Self-Attention &#8594; Residual Connection &#8594; Normalization &#8594; Feed-Forward Network &#8594; Residual Connection &#8594; Decoder Block Output</code></p><div><hr></div><p>Although a pre-normalization structure is most common, there is no clear answer in terms of whether pre or post-normalization is superior. In fact, recent work has even shown that post-normalization benefits training stability [3]; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eUVO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca41ac1b-59a9-4263-9a16-5d0d0d539915_2416x970.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eUVO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca41ac1b-59a9-4263-9a16-5d0d0d539915_2416x970.png 424w, https://substackcdn.com/image/fetch/$s_!eUVO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca41ac1b-59a9-4263-9a16-5d0d0d539915_2416x970.png 848w, https://substackcdn.com/image/fetch/$s_!eUVO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca41ac1b-59a9-4263-9a16-5d0d0d539915_2416x970.png 1272w, https://substackcdn.com/image/fetch/$s_!eUVO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca41ac1b-59a9-4263-9a16-5d0d0d539915_2416x970.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eUVO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca41ac1b-59a9-4263-9a16-5d0d0d539915_2416x970.png" width="1456" height="585" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ca41ac1b-59a9-4263-9a16-5d0d0d539915_2416x970.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:585,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:513314,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca41ac1b-59a9-4263-9a16-5d0d0d539915_2416x970.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eUVO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca41ac1b-59a9-4263-9a16-5d0d0d539915_2416x970.png 424w, https://substackcdn.com/image/fetch/$s_!eUVO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca41ac1b-59a9-4263-9a16-5d0d0d539915_2416x970.png 848w, https://substackcdn.com/image/fetch/$s_!eUVO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca41ac1b-59a9-4263-9a16-5d0d0d539915_2416x970.png 1272w, https://substackcdn.com/image/fetch/$s_!eUVO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca41ac1b-59a9-4263-9a16-5d0d0d539915_2416x970.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><p><strong>Normalization.</strong> Initial transformers used <a href="https://docs.pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html">layer normalization</a> as the standard choice of normalization layer. More recently, many LLMs have replaced layer normalization with <a href="https://docs.pytorch.org/docs/stable/generated/torch.nn.RMSNorm.html">root mean square layer normalization</a> (or RMSNorm for short) [4], which is a simpler&#8212;<em>and more computationally efficient</em>&#8212;version of layer normalization that has fewer trainable parameters and performs similarly. GPT-oss models adopt this choice by using RMSNorm in all decoder blocks. See <a href="https://magazine.sebastianraschka.com/i/170506328/rmsnorm-replaces-layernorm">here</a> for an explanation of RMSNorm (and a comparison to layer normalization).</p><h4>Attention Implementation</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oCzw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2d2bb0e-e18e-4d38-ba6a-e1e376ba89f6_2154x1058.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oCzw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2d2bb0e-e18e-4d38-ba6a-e1e376ba89f6_2154x1058.png 424w, https://substackcdn.com/image/fetch/$s_!oCzw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2d2bb0e-e18e-4d38-ba6a-e1e376ba89f6_2154x1058.png 848w, https://substackcdn.com/image/fetch/$s_!oCzw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2d2bb0e-e18e-4d38-ba6a-e1e376ba89f6_2154x1058.png 1272w, https://substackcdn.com/image/fetch/$s_!oCzw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2d2bb0e-e18e-4d38-ba6a-e1e376ba89f6_2154x1058.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oCzw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2d2bb0e-e18e-4d38-ba6a-e1e376ba89f6_2154x1058.png" width="1456" height="715" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c2d2bb0e-e18e-4d38-ba6a-e1e376ba89f6_2154x1058.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:715,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:177605,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2d2bb0e-e18e-4d38-ba6a-e1e376ba89f6_2154x1058.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oCzw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2d2bb0e-e18e-4d38-ba6a-e1e376ba89f6_2154x1058.png 424w, https://substackcdn.com/image/fetch/$s_!oCzw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2d2bb0e-e18e-4d38-ba6a-e1e376ba89f6_2154x1058.png 848w, https://substackcdn.com/image/fetch/$s_!oCzw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2d2bb0e-e18e-4d38-ba6a-e1e376ba89f6_2154x1058.png 1272w, https://substackcdn.com/image/fetch/$s_!oCzw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2d2bb0e-e18e-4d38-ba6a-e1e376ba89f6_2154x1058.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Depiction of masked self-attention with a single attention head</figcaption></figure></div><p><strong>Masked self-attention.</strong> A masked self-attention operation is depicted above; see <a href="https://cameronrwolfe.substack.com/i/155023686/masked-and-multi-headed-self-attention">here</a> for more details. Most LLMs&#8212;<em>including GPT-oss</em>&#8212;use multi-headed masked self-attention, meaning that there are multiple self-attention operations running in parallel for each self-attention layer. In the case of GPT-oss models, each self-attention layer has 64 parallel attention heads. Each of these attention heads use vectors with a dimension of 64, meaning that the key, query and value projections (shown above) transform embedding vectors from a size of 2,880 to 64.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QELC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a7dc1e2-e66c-4a30-a0a7-518ae7e3a566_1536x596.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QELC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a7dc1e2-e66c-4a30-a0a7-518ae7e3a566_1536x596.png 424w, https://substackcdn.com/image/fetch/$s_!QELC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a7dc1e2-e66c-4a30-a0a7-518ae7e3a566_1536x596.png 848w, https://substackcdn.com/image/fetch/$s_!QELC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a7dc1e2-e66c-4a30-a0a7-518ae7e3a566_1536x596.png 1272w, https://substackcdn.com/image/fetch/$s_!QELC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a7dc1e2-e66c-4a30-a0a7-518ae7e3a566_1536x596.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QELC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a7dc1e2-e66c-4a30-a0a7-518ae7e3a566_1536x596.png" width="1456" height="565" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8a7dc1e2-e66c-4a30-a0a7-518ae7e3a566_1536x596.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:565,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QELC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a7dc1e2-e66c-4a30-a0a7-518ae7e3a566_1536x596.png 424w, https://substackcdn.com/image/fetch/$s_!QELC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a7dc1e2-e66c-4a30-a0a7-518ae7e3a566_1536x596.png 848w, https://substackcdn.com/image/fetch/$s_!QELC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a7dc1e2-e66c-4a30-a0a7-518ae7e3a566_1536x596.png 1272w, https://substackcdn.com/image/fetch/$s_!QELC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a7dc1e2-e66c-4a30-a0a7-518ae7e3a566_1536x596.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [6])</figcaption></figure></div><p><strong>Multi and grouped-query attention.</strong> Expanding on multi-headed self-attention, prior work has proposed both multi-query [5] and grouped-query attention [6]. As depicted above, instead of having unique keys and values for each attention head, these techniques share the keys and values (but not queries!) between multiple attention heads. For example, multi-query attention has a single set of keys and values that are re-used for all attention heads, while grouped-query attention shares keys and values between fixed-sized groups of attention heads. </p><blockquote><p><em>&#8220;The memory bandwidth from loading keys and values can be sharply reduced through multi-query attention, which uses multiple query heads but single key and value heads. However, multi-query attention (MQA) can lead to quality degradation and training instability.&#8221;</em> - from [6]</p></blockquote><p>Sharing keys and queries across multiple attention heads benefits both parameter and compute efficiency, but the biggest benefit of grouped-query attention comes at inference time. <em>There is a reduction in memory bandwidth usage at inference because there are fewer keys and values that we need to be retrieved from the model&#8217;s <a href="https://huggingface.co/blog/not-lain/kv-caching">KV cache</a></em>. Given that memory bandwidth can be a key bottleneck to transformer inference speed, this architectural change drastically speeds up the inference process. </p><p>However, we cannot be too extreme with the sharing of keys and values&#8212;<em>we see in [6] that having all attention heads share the same key and value vectors degrades performance</em>. Grouped-query attention balances performance with efficiency by sharing keys and values among smaller groups, thus finding a tradeoff between standard multi-headed attention and multi-query attention. Specifically, GPT-oss uses group sizes of eight&#8212;<em>meaning that keys and values are shared among groups of eight attention heads</em>&#8212;for grouped-query attention in both model sizes.</p><p><strong>Sparse attention.</strong> Within the decoder blocks of GPT-oss models, we alternate between using dense and locally-banded sparse attention [7] within each block. In masked self-attention, we compute the attention matrix as shown below, where a causal mask is applied that sets all masked values in the attention matrix&#8212;<em>those that come after each token in the sequence</em>&#8212;to be negative infinity<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>. This ensures that tokens that should not be considered by the self-attention operation are given a probability of zero after the softmax transformation is applied.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!I1xt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01a70822-412f-45a9-a80b-a66ccd8e1925_2144x934.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!I1xt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01a70822-412f-45a9-a80b-a66ccd8e1925_2144x934.png 424w, https://substackcdn.com/image/fetch/$s_!I1xt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01a70822-412f-45a9-a80b-a66ccd8e1925_2144x934.png 848w, https://substackcdn.com/image/fetch/$s_!I1xt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01a70822-412f-45a9-a80b-a66ccd8e1925_2144x934.png 1272w, https://substackcdn.com/image/fetch/$s_!I1xt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01a70822-412f-45a9-a80b-a66ccd8e1925_2144x934.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!I1xt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01a70822-412f-45a9-a80b-a66ccd8e1925_2144x934.png" width="1456" height="634" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/01a70822-412f-45a9-a80b-a66ccd8e1925_2144x934.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:634,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:202039,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01a70822-412f-45a9-a80b-a66ccd8e1925_2144x934.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!I1xt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01a70822-412f-45a9-a80b-a66ccd8e1925_2144x934.png 424w, https://substackcdn.com/image/fetch/$s_!I1xt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01a70822-412f-45a9-a80b-a66ccd8e1925_2144x934.png 848w, https://substackcdn.com/image/fetch/$s_!I1xt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01a70822-412f-45a9-a80b-a66ccd8e1925_2144x934.png 1272w, https://substackcdn.com/image/fetch/$s_!I1xt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01a70822-412f-45a9-a80b-a66ccd8e1925_2144x934.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Masking in causal self-attention</figcaption></figure></div><p>Computing self-attention has quadratic&#8212;<em>or </em><code>O(S^2)</code><em> where </em><code>S</code><em> is the sequence length</em>&#8212;complexity. Put simply, this means that self-attention becomes computationally expensive when applied to long sequences. When we look at the masking pattern above, however, we might wonder: <em>Does the LLM actually need to look at the entire sequence preceding each token?</em> As proposed by the Longformer [7], we can save compute costs by limiting the window over which self-attention is computed.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AzLZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ffddd73-461c-4b77-9f25-887fa48a7601_1400x782.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AzLZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ffddd73-461c-4b77-9f25-887fa48a7601_1400x782.png 424w, https://substackcdn.com/image/fetch/$s_!AzLZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ffddd73-461c-4b77-9f25-887fa48a7601_1400x782.png 848w, https://substackcdn.com/image/fetch/$s_!AzLZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ffddd73-461c-4b77-9f25-887fa48a7601_1400x782.png 1272w, https://substackcdn.com/image/fetch/$s_!AzLZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ffddd73-461c-4b77-9f25-887fa48a7601_1400x782.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AzLZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ffddd73-461c-4b77-9f25-887fa48a7601_1400x782.png" width="488" height="272.58285714285716" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1ffddd73-461c-4b77-9f25-887fa48a7601_1400x782.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:782,&quot;width&quot;:1400,&quot;resizeWidth&quot;:488,&quot;bytes&quot;:98372,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ffddd73-461c-4b77-9f25-887fa48a7601_1400x782.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AzLZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ffddd73-461c-4b77-9f25-887fa48a7601_1400x782.png 424w, https://substackcdn.com/image/fetch/$s_!AzLZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ffddd73-461c-4b77-9f25-887fa48a7601_1400x782.png 848w, https://substackcdn.com/image/fetch/$s_!AzLZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ffddd73-461c-4b77-9f25-887fa48a7601_1400x782.png 1272w, https://substackcdn.com/image/fetch/$s_!AzLZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ffddd73-461c-4b77-9f25-887fa48a7601_1400x782.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Masked versus sliding window attention</figcaption></figure></div><p>This idea (depicted above) is called sliding window attention<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a> and has been successfully adopted by several LLMs like <a href="https://arxiv.org/abs/2310.06825">Mistral</a> and <a href="https://arxiv.org/abs/2503.19786">Gemma</a>. We modify our masking matrix  to limit the range of preceding tokens that are considered by the self-attention operation. Previously, we only masked tokens that come after each token. Now, <em>we are also masking tokens that are sufficiently far in the past.</em> This idea is referred to as &#8220;locally banded sparse attention&#8221; in the GPT-oss models [1, 2]. </p><p>The GPT-oss models replace every other masked self-attention module (i.e., a 1:1 ratio) with sliding window attention. The first attention layer uses dense self-attention, the second layer uses sliding window attention and so on. By adopting sliding window attention in a subset of layers, we improve the efficiency of the model architecture by avoiding the quadratic complexity of self-attention with a smaller, fixed window size. Ideally, this efficiency gain comes without causing a corresponding deterioration in model quality, though this may depend on the exact settings adopted (e.g., the window size or layer ratio). </p><p>The window size used in GPT-oss is 128 tokens, which is small compared to other models; e.g., Gemma-2 and 3 use window sizes of 4K and 1K tokens, respectively. However, the 1:1 ratio of dense and sparse attention layers is a conservative choice. In fact, other models have successfully explored significantly higher sparsity ratios. For example, Gemma-3 adopts a 5:1 ratio, meaning that there is one dense attention layer for every five sliding window attention layers.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PY6O!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3d910cc-fd59-45dd-b2b6-9452a6f69bf0_2316x694.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PY6O!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3d910cc-fd59-45dd-b2b6-9452a6f69bf0_2316x694.png 424w, https://substackcdn.com/image/fetch/$s_!PY6O!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3d910cc-fd59-45dd-b2b6-9452a6f69bf0_2316x694.png 848w, https://substackcdn.com/image/fetch/$s_!PY6O!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3d910cc-fd59-45dd-b2b6-9452a6f69bf0_2316x694.png 1272w, https://substackcdn.com/image/fetch/$s_!PY6O!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3d910cc-fd59-45dd-b2b6-9452a6f69bf0_2316x694.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PY6O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3d910cc-fd59-45dd-b2b6-9452a6f69bf0_2316x694.png" width="1456" height="436" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a3d910cc-fd59-45dd-b2b6-9452a6f69bf0_2316x694.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:436,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PY6O!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3d910cc-fd59-45dd-b2b6-9452a6f69bf0_2316x694.png 424w, https://substackcdn.com/image/fetch/$s_!PY6O!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3d910cc-fd59-45dd-b2b6-9452a6f69bf0_2316x694.png 848w, https://substackcdn.com/image/fetch/$s_!PY6O!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3d910cc-fd59-45dd-b2b6-9452a6f69bf0_2316x694.png 1272w, https://substackcdn.com/image/fetch/$s_!PY6O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3d910cc-fd59-45dd-b2b6-9452a6f69bf0_2316x694.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Attention sinks.</strong> As we might recall, the attention matrix within self-attention is computed as shown above. We take the product of the query and (transposed) key matrix. This operation yields an <code>S x S</code> matrix, where <code>S</code> is the length of the sequence over which we are computing self-attention. After masking and dividing the values of this matrix by the square root of the embedding dimension<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a>, we apply a row-wise softmax, forming&#8212;<em>for each token in the sequence (or row in the matrix)</em>&#8212;a probability distribution over all other tokens in the sequence. </p><p>We finish the self-attention operation by multiplying this attention matrix by the value matrix. Practically, this takes a weighted sum of the value vectors for each token, where the weights are given by the attention scores; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!awyW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd876167-bdef-4817-9ee1-4a8e5e749b98_1542x676.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!awyW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd876167-bdef-4817-9ee1-4a8e5e749b98_1542x676.png 424w, https://substackcdn.com/image/fetch/$s_!awyW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd876167-bdef-4817-9ee1-4a8e5e749b98_1542x676.png 848w, https://substackcdn.com/image/fetch/$s_!awyW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd876167-bdef-4817-9ee1-4a8e5e749b98_1542x676.png 1272w, https://substackcdn.com/image/fetch/$s_!awyW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd876167-bdef-4817-9ee1-4a8e5e749b98_1542x676.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!awyW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd876167-bdef-4817-9ee1-4a8e5e749b98_1542x676.png" width="591" height="258.9684065934066" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bd876167-bdef-4817-9ee1-4a8e5e749b98_1542x676.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:638,&quot;width&quot;:1456,&quot;resizeWidth&quot;:591,&quot;bytes&quot;:100853,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd876167-bdef-4817-9ee1-4a8e5e749b98_1542x676.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!awyW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd876167-bdef-4817-9ee1-4a8e5e749b98_1542x676.png 424w, https://substackcdn.com/image/fetch/$s_!awyW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd876167-bdef-4817-9ee1-4a8e5e749b98_1542x676.png 848w, https://substackcdn.com/image/fetch/$s_!awyW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd876167-bdef-4817-9ee1-4a8e5e749b98_1542x676.png 1272w, https://substackcdn.com/image/fetch/$s_!awyW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd876167-bdef-4817-9ee1-4a8e5e749b98_1542x676.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Although self-attention works incredibly well in its natural form, there is an interesting problem that arises due to the internal softmax used by self-attention. Namely, the attention scores are forced to form a valid probability distribution&#8212;<em>meaning that the attention scores must all be positive and sum to one</em>&#8212;over the set of tokens. Therefore, at least one token in the sequence must receive some weight&#8212;<em>it is impossible for the model to not pay attention to any tokens</em>. </p><p>This property of self-attention can lead to some interesting behaviors from LLMs in practice. For example, prior work [8] has found that LLMs tend to assign high attention scores to semantically meaningless tokens in a sequence. These tokens that spuriously receive a high weight&#8212;<em>usually the first token in the sequence&#8212;</em>are commonly referred to as &#8220;attention sinks&#8221;. This empirical observation stems from the LLM&#8217;s inability to pay attention to no tokens in a sequence. Additionally, the very high scores assigned by LLMs to attention sinks can lead to practical issues; e.g., such outlier attention values <a href="https://arxiv.org/abs/2406.12016">make quantization more difficult</a>.  </p><div class="pullquote"><p>&#8220;We find an interesting phenomenon of autoregressive LLMs: a surprisingly large amount of attention score is allocated to the initial tokens, irrespective of their relevance to the language modeling task&#8230; We term these tokens attention sinks. Despite their lack of semantic significance, they collect significant attention scores. We attribute the reason to the Softmax operation, which requires attention scores to sum up to one for all contextual tokens. Thus, even when the current query does not have a strong match in many previous tokens, the model still needs to allocate these unneeded attention values somewhere so it sums up to one. The reason behind initial tokens as sink tokens is intuitive: initial tokens are visible to almost all subsequent tokens because of the autoregressive language modeling nature, making them more readily trained to serve as attention sinks.&#8221; - from [8]</p></div><p>To solve this issue in the GPT-oss models, the authors use an approach that is very similar to (though not exactly the same as) the technique described in <a href="https://www.evanmiller.org/attention-is-off-by-one.html">this blog post</a> from <a href="https://www.evanmiller.org/index.html">Evan Miller</a>. For each attention head, we create an extra learnable bias that is learned similarly to any other model parameter. This bias appears only in the denominator of the internal softmax operation in self-attention. By setting a high value for this bias in some attention head, the LLM can choose to pay attention to no tokens in a sequence, solving known issues with attention sinks. This approach is explained in the quote below from the GPT-oss model card.</p><blockquote><p><em>&#8220;Each attention head has a learned bias in the denominator of the softmax, similar to off-by-one attention and attention sinks, which enables the attention mechanism to pay no attention to any tokens.&#8221;</em> - from [2]</p></blockquote><h4>Mixture-of-Experts (MoE)</h4><p>Both GPT-oss models use a Mixture-of-Experts (MoE) architecture. Compared to the decoder-only architecture, MoEs modify the feed-forward module in each decoder block. The standard architecture has one feed-forward neural network&#8212;<em>usually made up of two diamond-shaped</em><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a><em> feed-forward layers with a non-linear activation (i.e., GPT-oss models use the <a href="https://arxiv.org/abs/2002.05202">SwiGLU activation</a> in particular [2]) in between</em>&#8212;through which every token is passed individually; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FMd9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95d3f6b5-316f-474b-a2cc-243cc22ac7ac_1870x548.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FMd9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95d3f6b5-316f-474b-a2cc-243cc22ac7ac_1870x548.png 424w, https://substackcdn.com/image/fetch/$s_!FMd9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95d3f6b5-316f-474b-a2cc-243cc22ac7ac_1870x548.png 848w, https://substackcdn.com/image/fetch/$s_!FMd9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95d3f6b5-316f-474b-a2cc-243cc22ac7ac_1870x548.png 1272w, https://substackcdn.com/image/fetch/$s_!FMd9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95d3f6b5-316f-474b-a2cc-243cc22ac7ac_1870x548.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FMd9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95d3f6b5-316f-474b-a2cc-243cc22ac7ac_1870x548.png" width="1456" height="427" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/95d3f6b5-316f-474b-a2cc-243cc22ac7ac_1870x548.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:427,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FMd9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95d3f6b5-316f-474b-a2cc-243cc22ac7ac_1870x548.png 424w, https://substackcdn.com/image/fetch/$s_!FMd9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95d3f6b5-316f-474b-a2cc-243cc22ac7ac_1870x548.png 848w, https://substackcdn.com/image/fetch/$s_!FMd9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95d3f6b5-316f-474b-a2cc-243cc22ac7ac_1870x548.png 1272w, https://substackcdn.com/image/fetch/$s_!FMd9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95d3f6b5-316f-474b-a2cc-243cc22ac7ac_1870x548.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Instead of having a single feed-forward network in the feed-forward component of the block, an MoE creates several feed-forward networks, <em>each with their own independent weights</em>. We refer to each of these networks as an &#8220;expert&#8221;. Starting with a standard decoder-only transformer, the MoE converts the transformer&#8217;s feed-forward modules into MoE (or expert) layers, having several independent copies of the original feed-forward network from that layer; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tPDR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fbb9a24-440d-4d26-8092-b6d72dafb55e_1482x858.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tPDR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fbb9a24-440d-4d26-8092-b6d72dafb55e_1482x858.png 424w, https://substackcdn.com/image/fetch/$s_!tPDR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fbb9a24-440d-4d26-8092-b6d72dafb55e_1482x858.png 848w, https://substackcdn.com/image/fetch/$s_!tPDR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fbb9a24-440d-4d26-8092-b6d72dafb55e_1482x858.png 1272w, https://substackcdn.com/image/fetch/$s_!tPDR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fbb9a24-440d-4d26-8092-b6d72dafb55e_1482x858.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tPDR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fbb9a24-440d-4d26-8092-b6d72dafb55e_1482x858.png" width="1456" height="843" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8fbb9a24-440d-4d26-8092-b6d72dafb55e_1482x858.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:843,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tPDR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fbb9a24-440d-4d26-8092-b6d72dafb55e_1482x858.png 424w, https://substackcdn.com/image/fetch/$s_!tPDR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fbb9a24-440d-4d26-8092-b6d72dafb55e_1482x858.png 848w, https://substackcdn.com/image/fetch/$s_!tPDR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fbb9a24-440d-4d26-8092-b6d72dafb55e_1482x858.png 1272w, https://substackcdn.com/image/fetch/$s_!tPDR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fbb9a24-440d-4d26-8092-b6d72dafb55e_1482x858.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [9])</figcaption></figure></div><p>Usually, we do not convert every feed-forward layer in the model to an MoE layer for efficiency reasons. Instead, we interleave the MoE layers by using a stride of <code>P</code>&#8212;<em>every </em><code>P</code><em>-th layer in the transformer is converted into an MoE layer</em>.</p><p><strong>Routing.</strong> The primary benefit of MoEs is their efficiency, but using experts alone does not improve efficiency! In fact, the total parameters and compute becomes much larger because we have multiple copies of each feed-forward module. To get an efficiency benefit, we need to add sparsity to this architecture. Let&#8217;s consider a single token&#8212;<em>represented by a </em><code>d</code><em>-dimensional token vector</em>. Our goal is to select a subset of experts (of size <code>k</code>) that will perform a forward pass on this token. In other words, this token will be &#8220;routed&#8221; to these experts. </p><p>The standard way to perform this routing operation is via a linear layer that takes the token vector as input and predicts a vector of size <code>N</code> (i.e., the total number of experts). We can apply a softmax operation to form a probability distribution over the set of experts for each token. Then, this probability distribution can be used to select the top-<code>K</code> experts to which each token is routed, as shown below. Despite its simplicity, this linear routing operation is exactly the approach adopted by OpenAI for the GPT-oss models (from [2]): <em>&#8220;each MoE block consists of&#8230; a standard linear router projection that maps residual activations to scores for each expert.&#8221;</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SAIM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F457feebf-e8bc-4357-a528-3f47b3c3f5a7_1202x888.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SAIM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F457feebf-e8bc-4357-a528-3f47b3c3f5a7_1202x888.png 424w, https://substackcdn.com/image/fetch/$s_!SAIM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F457feebf-e8bc-4357-a528-3f47b3c3f5a7_1202x888.png 848w, https://substackcdn.com/image/fetch/$s_!SAIM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F457feebf-e8bc-4357-a528-3f47b3c3f5a7_1202x888.png 1272w, https://substackcdn.com/image/fetch/$s_!SAIM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F457feebf-e8bc-4357-a528-3f47b3c3f5a7_1202x888.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SAIM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F457feebf-e8bc-4357-a528-3f47b3c3f5a7_1202x888.png" width="534" height="394.50249584026625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/457feebf-e8bc-4357-a528-3f47b3c3f5a7_1202x888.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:888,&quot;width&quot;:1202,&quot;resizeWidth&quot;:534,&quot;bytes&quot;:133510,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F457feebf-e8bc-4357-a528-3f47b3c3f5a7_1202x888.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SAIM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F457feebf-e8bc-4357-a528-3f47b3c3f5a7_1202x888.png 424w, https://substackcdn.com/image/fetch/$s_!SAIM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F457feebf-e8bc-4357-a528-3f47b3c3f5a7_1202x888.png 848w, https://substackcdn.com/image/fetch/$s_!SAIM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F457feebf-e8bc-4357-a528-3f47b3c3f5a7_1202x888.png 1272w, https://substackcdn.com/image/fetch/$s_!SAIM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F457feebf-e8bc-4357-a528-3f47b3c3f5a7_1202x888.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Each token is then sent to its respective expert and we compute the forward pass for each expert over the batch of tokens that have been routed to it. To aggregate the output of each expert, we simply take a weighted average of outputs across all experts, where the weight is given by the probability assigned to each expert by the router. This exact process is used by the GPT-oss models, as described below.</p><blockquote><p><em>&#8220;For both models, we select the top-4 experts for each token given by the router, and weight the output of each expert by the softmax of the router projection over only the selected experts.&#8221;</em> - from [2]</p></blockquote><p><strong>Active parameters.</strong> Because we select a subset of experts for each token, only part of the model&#8217;s parameters are used for processing a given token in the forward pass&#8212;<em>some of the parameters are active, while others are inactive</em>. In the case of GPT-oss, the 20b and 120b models have 32 and 128 total experts within each of their MoE layers. However, only four of these experts are active for each token, leading the models to have 3.6b and 5.1b active parameters, respectively. A more detailed breakdown of parameter counts for these models is provided in the table below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YguE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2307645-9aec-438b-8ada-53e73baa20f9_1798x624.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YguE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2307645-9aec-438b-8ada-53e73baa20f9_1798x624.png 424w, https://substackcdn.com/image/fetch/$s_!YguE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2307645-9aec-438b-8ada-53e73baa20f9_1798x624.png 848w, https://substackcdn.com/image/fetch/$s_!YguE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2307645-9aec-438b-8ada-53e73baa20f9_1798x624.png 1272w, https://substackcdn.com/image/fetch/$s_!YguE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2307645-9aec-438b-8ada-53e73baa20f9_1798x624.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YguE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2307645-9aec-438b-8ada-53e73baa20f9_1798x624.png" width="1456" height="505" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b2307645-9aec-438b-8ada-53e73baa20f9_1798x624.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:505,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:135049,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2307645-9aec-438b-8ada-53e73baa20f9_1798x624.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YguE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2307645-9aec-438b-8ada-53e73baa20f9_1798x624.png 424w, https://substackcdn.com/image/fetch/$s_!YguE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2307645-9aec-438b-8ada-53e73baa20f9_1798x624.png 848w, https://substackcdn.com/image/fetch/$s_!YguE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2307645-9aec-438b-8ada-53e73baa20f9_1798x624.png 1272w, https://substackcdn.com/image/fetch/$s_!YguE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2307645-9aec-438b-8ada-53e73baa20f9_1798x624.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p>Compared to other notable MoEs, the GPT-oss models are quite sparse; e.g., the 109b parameter <a href="https://cameronrwolfe.substack.com/p/llama-4">Llama-4 model</a> has 17b active parameters. However, this high sparsity level of GPT-oss is common among the best open-source LLMs:</p><ul><li><p>DeepSeek-R1 [10] has 671b total parameters and 37b active parameters.</p></li><li><p>Qwen-3 [11] MoE models have 30b total parameters and 3b active parameters or 235b total and 22b active parameters.</p></li></ul><p><strong>Load balancing and auxiliary losses.</strong> If we train an MoE similarly to a standard dense model, several issues are likely to occur. First, the model will quickly learn to route all tokens to a single expert&#8212;<em>a phenomenon known as &#8220;routing collapse&#8221;</em>. Additionally, MoEs are more likely to experience numerical instabilities during training, potentially leading to a divergence in the training loss; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!efMH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F213eacf6-6f4c-48ac-9fec-b81a24580b4b_1370x804.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!efMH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F213eacf6-6f4c-48ac-9fec-b81a24580b4b_1370x804.png 424w, https://substackcdn.com/image/fetch/$s_!efMH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F213eacf6-6f4c-48ac-9fec-b81a24580b4b_1370x804.png 848w, https://substackcdn.com/image/fetch/$s_!efMH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F213eacf6-6f4c-48ac-9fec-b81a24580b4b_1370x804.png 1272w, https://substackcdn.com/image/fetch/$s_!efMH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F213eacf6-6f4c-48ac-9fec-b81a24580b4b_1370x804.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!efMH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F213eacf6-6f4c-48ac-9fec-b81a24580b4b_1370x804.png" width="460" height="269.95620437956205" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/213eacf6-6f4c-48ac-9fec-b81a24580b4b_1370x804.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:804,&quot;width&quot;:1370,&quot;resizeWidth&quot;:460,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!efMH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F213eacf6-6f4c-48ac-9fec-b81a24580b4b_1370x804.png 424w, https://substackcdn.com/image/fetch/$s_!efMH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F213eacf6-6f4c-48ac-9fec-b81a24580b4b_1370x804.png 848w, https://substackcdn.com/image/fetch/$s_!efMH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F213eacf6-6f4c-48ac-9fec-b81a24580b4b_1370x804.png 1272w, https://substackcdn.com/image/fetch/$s_!efMH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F213eacf6-6f4c-48ac-9fec-b81a24580b4b_1370x804.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Divergence in loss during MoE pretraining (<a href="https://cameronrwolfe.substack.com/p/nano-moe">source</a>)</figcaption></figure></div><p>To avoid these issues, most MoEs use a load-balancing loss [9] during training, which modifies the underlying training objective of the LLM by adding an extra loss term to the next-token prediction loss (shown below) that encourages proper routing behavior. More specifically, this loss is minimized when the MoE:</p><ol><li><p>Assigns equal probability to all experts in the router.</p></li><li><p>Dispatches an equal number of tokens to each expert.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HmXE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F644cec45-8dff-491e-9d41-e53ee4b0c7df_1574x764.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HmXE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F644cec45-8dff-491e-9d41-e53ee4b0c7df_1574x764.png 424w, https://substackcdn.com/image/fetch/$s_!HmXE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F644cec45-8dff-491e-9d41-e53ee4b0c7df_1574x764.png 848w, https://substackcdn.com/image/fetch/$s_!HmXE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F644cec45-8dff-491e-9d41-e53ee4b0c7df_1574x764.png 1272w, https://substackcdn.com/image/fetch/$s_!HmXE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F644cec45-8dff-491e-9d41-e53ee4b0c7df_1574x764.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HmXE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F644cec45-8dff-491e-9d41-e53ee4b0c7df_1574x764.png" width="1456" height="707" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/644cec45-8dff-491e-9d41-e53ee4b0c7df_1574x764.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:707,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HmXE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F644cec45-8dff-491e-9d41-e53ee4b0c7df_1574x764.png 424w, https://substackcdn.com/image/fetch/$s_!HmXE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F644cec45-8dff-491e-9d41-e53ee4b0c7df_1574x764.png 848w, https://substackcdn.com/image/fetch/$s_!HmXE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F644cec45-8dff-491e-9d41-e53ee4b0c7df_1574x764.png 1272w, https://substackcdn.com/image/fetch/$s_!HmXE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F644cec45-8dff-491e-9d41-e53ee4b0c7df_1574x764.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [9])</figcaption></figure></div><p>Beyond the load balancing loss, many MoEs use another auxiliary loss term&#8212;<em>called the router-z loss [12]</em>&#8212;that aims to mitigate numerical instability; see below. The router z-loss constrains the size of the logits outputted by the router of the MoE. These logits are especially prone to numerical instability because they are passed into an (exponential) softmax function to derive a probability distribution over the set of possible experts&#8212;<em>large router logits are a key source of numerical instability that is specific to MoEs (i.e., because standard LLMs do not have a router).</em> </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gPGQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1790688e-5328-45f2-98c0-717ba6041470_2090x636.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gPGQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1790688e-5328-45f2-98c0-717ba6041470_2090x636.png 424w, https://substackcdn.com/image/fetch/$s_!gPGQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1790688e-5328-45f2-98c0-717ba6041470_2090x636.png 848w, https://substackcdn.com/image/fetch/$s_!gPGQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1790688e-5328-45f2-98c0-717ba6041470_2090x636.png 1272w, https://substackcdn.com/image/fetch/$s_!gPGQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1790688e-5328-45f2-98c0-717ba6041470_2090x636.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gPGQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1790688e-5328-45f2-98c0-717ba6041470_2090x636.png" width="1456" height="443" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1790688e-5328-45f2-98c0-717ba6041470_2090x636.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:443,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gPGQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1790688e-5328-45f2-98c0-717ba6041470_2090x636.png 424w, https://substackcdn.com/image/fetch/$s_!gPGQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1790688e-5328-45f2-98c0-717ba6041470_2090x636.png 848w, https://substackcdn.com/image/fetch/$s_!gPGQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1790688e-5328-45f2-98c0-717ba6041470_2090x636.png 1272w, https://substackcdn.com/image/fetch/$s_!gPGQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1790688e-5328-45f2-98c0-717ba6041470_2090x636.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [12])</figcaption></figure></div><p>When training an MoE, we usually also set a fixed capacity factor for every expert, which defines the maximum number of tokens that can be routed to an expert at once. Any tokens that go beyond this capacity factor will simply be dropped<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a>; see below. By adopting this capacity factor, we enforce a certain level of uniformity of tokens routed to each expert. Additionally, the capacity factor is beneficial from a computational efficiency perspective&#8212;<em>it allows us to fix the batch size of each expert</em>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vE2b!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F417c5fc8-2524-48e1-a9ef-460b4476d323_1784x1184.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vE2b!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F417c5fc8-2524-48e1-a9ef-460b4476d323_1784x1184.png 424w, https://substackcdn.com/image/fetch/$s_!vE2b!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F417c5fc8-2524-48e1-a9ef-460b4476d323_1784x1184.png 848w, https://substackcdn.com/image/fetch/$s_!vE2b!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F417c5fc8-2524-48e1-a9ef-460b4476d323_1784x1184.png 1272w, https://substackcdn.com/image/fetch/$s_!vE2b!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F417c5fc8-2524-48e1-a9ef-460b4476d323_1784x1184.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vE2b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F417c5fc8-2524-48e1-a9ef-460b4476d323_1784x1184.png" width="1456" height="966" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/417c5fc8-2524-48e1-a9ef-460b4476d323_1784x1184.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:966,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vE2b!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F417c5fc8-2524-48e1-a9ef-460b4476d323_1784x1184.png 424w, https://substackcdn.com/image/fetch/$s_!vE2b!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F417c5fc8-2524-48e1-a9ef-460b4476d323_1784x1184.png 848w, https://substackcdn.com/image/fetch/$s_!vE2b!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F417c5fc8-2524-48e1-a9ef-460b4476d323_1784x1184.png 1272w, https://substackcdn.com/image/fetch/$s_!vE2b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F417c5fc8-2524-48e1-a9ef-460b4476d323_1784x1184.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [9])</figcaption></figure></div><p>Auxiliary losses modify the MoE&#8217;s training objective, <em>which can negatively impact the performance of the model</em>. As a result, some popular MoE-based LLMs avoid auxiliary losses altogether; e.g., DeepSeek-V3 [13] uses an auxiliary-loss-free approach for load balancing that adds a bias term to the logit predicted by the router for each expert. This per-expert bias can be dynamically adjusted during training to encourage balanced routing between experts. This approach is shown to work well in [13], but authors still use auxiliary losses&#8212;<em>with a much lower weight relative to standard MoE training</em>&#8212;when training their final model. </p><p>OpenAI has not disclosed the specific training loss used for the GPT-oss models, but most public MoEs are trained with auxiliary losses, heuristic load balancing methods, or a combination of both. With this in mind, we can reasonably assume that the GPT-oss models use some combination of similar (potentially modified) techniques to avoid issues like numerical instability and routing collapse.</p><p><strong>Other details and further learning.</strong> Beyond the details outlined above, OpenAI mentions that the GPT-oss models use <a href="https://arxiv.org/abs/2205.14135">FlashAttention</a> (a standard choice for LLMs these days) and that they create &#8220;expert-optimized&#8221; <a href="https://openai.com/index/triton/">triton kernels</a> to boost training efficiency for their MoE architecture.  For more details on MoEs, see the blog post below. This overview builds an understanding of MoE-based LLMs from scratch and culminates with implementing and training a GPT-2-scale MoE, called nanoMoE. The code for nanoMoE can be found in <a href="https://github.com/wolfecameron/nanoMoE">this repository</a>.</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;849cfab0-df59-4a72-b0b9-c259a8f7a271&quot;,&quot;caption&quot;:&quot;A full guide for building and training your own medium-scale MoE from scratch in pure PyTorch.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;nanoMoE: Mixture-of-Experts (MoE) LLMs from Scratch in PyTorch&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:29736521,&quot;name&quot;:&quot;Cameron R. Wolfe, Ph.D.&quot;,&quot;bio&quot;:&quot;Research @ Netflix &#8226; Rice University PhD &#8226; I make AI understandable&quot;,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/69aba7df-b571-4609-aa47-fc2d031c11b8_1242x1595.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-03-10T09:33:27.000Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/868fce62-b8a5-4ae9-8c71-71494ff27787_2394x1342.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://cameronrwolfe.substack.com/p/nano-moe&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:155023686,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:141,&quot;comment_count&quot;:12,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Deep (Learning) Focus&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!87xa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab9b43fb-52d5-40da-995d-5b7cd3f91064_896x896.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h4>Origins of the GPT-oss Architecture</h4><blockquote><p><em>&#8220;Layer normalization was moved to the input of each sub-block, similar to a pre-activation residual network and an additional layer normalization was added after the final self-attention block.&#8221;</em> - from [13]</p></blockquote><p>Many of the design choices in the GPT-oss models are not new&#8212;<em>OpenAI has been using them since <a href="https://cameronrwolfe.substack.com/i/85568430/language-models-are-unsupervised-multitask-learners-gpt">GPT-2</a> and<a href="https://cameronrwolfe.substack.com/i/88082618/language-models-are-few-shot-learners"> GPT-3</a></em>! In many ways, the GPT-oss architecture is built upon ideas from these earlier models. Given that GPT-3 [14] was released over five years before GPT-oss, this is incredibly impressive&#8212;<em>especially in the dynamic world of LLM research</em>. Both the pre-norm structure (adopted from GPT-2; see above) and the alternating dense and banded window attention (adopted from GPT-3; see below) are not new. However, the earlier GPT models still lacked many modern architectural developments for LLMs such as GQA, long context strategies like YaRN (i.e., GPT-3 has only a 2K token context window), expert layers, and proper tokenization for handling multi-turn chat or agents. </p><blockquote><p><em>&#8220;We use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer.&#8221;</em> - from [14]</p></blockquote><h2>Context Management for the Agentic Era</h2><p>Now that we understand the architecture of GPT-oss, we will take a look at the most heavily emphasized aspects of these models&#8212;<em>agents and reasoning</em>. In particular, we are going to deep dive into the tokenizer and prompt format used for these models. As we will see, OpenAI adopts a highly-complex input format for the GPT-oss models that is focused on handling hierarchical instructions, tool use, reasoning, structured outputs and multi-turn chat with a unified structure. After covering the Harmony format, we will also outline the context extension approach that is used to achieve a context window of 131K tokens for GPT-oss. </p><h4>Tokenizer</h4><p>When interacting with an LLM, we provide a textual prompt as input to the model, but this is not the input that the LLM sees. The LLM uses a tokenizer&#8212;<em>usually a <a href="https://sebastianraschka.com/blog/2025/bpe-from-scratch.html">byte-pair encoding (BPE) tokenizer</a></em>&#8212;to break this textual prompt into a sequence of discrete words or sub-words, which we call tokens; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gVlM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbee452d9-b639-468e-929e-af60ef372121_1336x768.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gVlM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbee452d9-b639-468e-929e-af60ef372121_1336x768.png 424w, https://substackcdn.com/image/fetch/$s_!gVlM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbee452d9-b639-468e-929e-af60ef372121_1336x768.png 848w, https://substackcdn.com/image/fetch/$s_!gVlM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbee452d9-b639-468e-929e-af60ef372121_1336x768.png 1272w, https://substackcdn.com/image/fetch/$s_!gVlM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbee452d9-b639-468e-929e-af60ef372121_1336x768.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gVlM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbee452d9-b639-468e-929e-af60ef372121_1336x768.png" width="476" height="273.62874251497004" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bee452d9-b639-468e-929e-af60ef372121_1336x768.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1336,&quot;resizeWidth&quot;:476,&quot;bytes&quot;:85487,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbee452d9-b639-468e-929e-af60ef372121_1336x768.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gVlM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbee452d9-b639-468e-929e-af60ef372121_1336x768.png 424w, https://substackcdn.com/image/fetch/$s_!gVlM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbee452d9-b639-468e-929e-af60ef372121_1336x768.png 848w, https://substackcdn.com/image/fetch/$s_!gVlM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbee452d9-b639-468e-929e-af60ef372121_1336x768.png 1272w, https://substackcdn.com/image/fetch/$s_!gVlM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbee452d9-b639-468e-929e-af60ef372121_1336x768.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Internally, the tokenizer has a vocabulary, or a fixed-size set of all tokens that are known to the tokenizer. Each of these tokens is associated with a unique integer index that can be mapped to a vector embedding within the embedding layer of the LLM. Therefore, we can map each of our tokens to a corresponding token embedding, which lets us convert our sequence of tokens into a sequence of vectors; see below. This sequence of token vectors, which forms a matrix (or tensor if we have a batch of inputs), is then passed as input to the transformer.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!W7jv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0dbadd5-fe22-4ac4-be6a-59ff8677af0f_1284x1294.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!W7jv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0dbadd5-fe22-4ac4-be6a-59ff8677af0f_1284x1294.png 424w, https://substackcdn.com/image/fetch/$s_!W7jv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0dbadd5-fe22-4ac4-be6a-59ff8677af0f_1284x1294.png 848w, https://substackcdn.com/image/fetch/$s_!W7jv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0dbadd5-fe22-4ac4-be6a-59ff8677af0f_1284x1294.png 1272w, https://substackcdn.com/image/fetch/$s_!W7jv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0dbadd5-fe22-4ac4-be6a-59ff8677af0f_1284x1294.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!W7jv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0dbadd5-fe22-4ac4-be6a-59ff8677af0f_1284x1294.png" width="490" height="493.81619937694705" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b0dbadd5-fe22-4ac4-be6a-59ff8677af0f_1284x1294.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1294,&quot;width&quot;:1284,&quot;resizeWidth&quot;:490,&quot;bytes&quot;:123143,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0dbadd5-fe22-4ac4-be6a-59ff8677af0f_1284x1294.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!W7jv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0dbadd5-fe22-4ac4-be6a-59ff8677af0f_1284x1294.png 424w, https://substackcdn.com/image/fetch/$s_!W7jv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0dbadd5-fe22-4ac4-be6a-59ff8677af0f_1284x1294.png 848w, https://substackcdn.com/image/fetch/$s_!W7jv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0dbadd5-fe22-4ac4-be6a-59ff8677af0f_1284x1294.png 1272w, https://substackcdn.com/image/fetch/$s_!W7jv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0dbadd5-fe22-4ac4-be6a-59ff8677af0f_1284x1294.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Chat templates.</strong> Beyond the basic tokenization functionality outlined above, we can also create &#8220;special&#8221; tokens in our tokenizer. For example, LLMs usually have a dedicated &#8220;stop&#8221; token like <code>&lt;eos&gt;</code> or <code>&lt;|end_of_text|&gt;</code> that signals the end of a sequence. These are unique tokens in the vocabulary, and we can train the LLM to output such a token when it finishes generating a sequence of text.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1jdU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0da140c2-3de1-488c-8996-eb838b956904_2194x1036.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1jdU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0da140c2-3de1-488c-8996-eb838b956904_2194x1036.png 424w, https://substackcdn.com/image/fetch/$s_!1jdU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0da140c2-3de1-488c-8996-eb838b956904_2194x1036.png 848w, https://substackcdn.com/image/fetch/$s_!1jdU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0da140c2-3de1-488c-8996-eb838b956904_2194x1036.png 1272w, https://substackcdn.com/image/fetch/$s_!1jdU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0da140c2-3de1-488c-8996-eb838b956904_2194x1036.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1jdU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0da140c2-3de1-488c-8996-eb838b956904_2194x1036.png" width="1456" height="688" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0da140c2-3de1-488c-8996-eb838b956904_2194x1036.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:688,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:171010,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0da140c2-3de1-488c-8996-eb838b956904_2194x1036.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1jdU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0da140c2-3de1-488c-8996-eb838b956904_2194x1036.png 424w, https://substackcdn.com/image/fetch/$s_!1jdU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0da140c2-3de1-488c-8996-eb838b956904_2194x1036.png 848w, https://substackcdn.com/image/fetch/$s_!1jdU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0da140c2-3de1-488c-8996-eb838b956904_2194x1036.png 1272w, https://substackcdn.com/image/fetch/$s_!1jdU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0da140c2-3de1-488c-8996-eb838b956904_2194x1036.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Beyond stop tokens, we can use special tokens to format complex inputs in a way that is more understandable to an LLM. For example, we can use special tokens to create a chat template for formatting multi-turn conversations. An example of this is shown below, where we use the chat template for <a href="https://huggingface.co/Qwen/Qwen3-32B">Qwen-3</a> to convert a multi-turn conversation into the textual prompt that is actually passed to the model. All special tokens within this prompt have been highlighted for clarity.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Doyp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1cfb6c61-8e9a-488a-935a-a8bdab8a6e3d_1376x1054.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Doyp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1cfb6c61-8e9a-488a-935a-a8bdab8a6e3d_1376x1054.png 424w, https://substackcdn.com/image/fetch/$s_!Doyp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1cfb6c61-8e9a-488a-935a-a8bdab8a6e3d_1376x1054.png 848w, https://substackcdn.com/image/fetch/$s_!Doyp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1cfb6c61-8e9a-488a-935a-a8bdab8a6e3d_1376x1054.png 1272w, https://substackcdn.com/image/fetch/$s_!Doyp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1cfb6c61-8e9a-488a-935a-a8bdab8a6e3d_1376x1054.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Doyp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1cfb6c61-8e9a-488a-935a-a8bdab8a6e3d_1376x1054.png" width="572" height="438.1453488372093" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1cfb6c61-8e9a-488a-935a-a8bdab8a6e3d_1376x1054.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1054,&quot;width&quot;:1376,&quot;resizeWidth&quot;:572,&quot;bytes&quot;:234997,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1cfb6c61-8e9a-488a-935a-a8bdab8a6e3d_1376x1054.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Doyp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1cfb6c61-8e9a-488a-935a-a8bdab8a6e3d_1376x1054.png 424w, https://substackcdn.com/image/fetch/$s_!Doyp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1cfb6c61-8e9a-488a-935a-a8bdab8a6e3d_1376x1054.png 848w, https://substackcdn.com/image/fetch/$s_!Doyp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1cfb6c61-8e9a-488a-935a-a8bdab8a6e3d_1376x1054.png 1272w, https://substackcdn.com/image/fetch/$s_!Doyp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1cfb6c61-8e9a-488a-935a-a8bdab8a6e3d_1376x1054.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Applying a chat template to a multi-turn conversation</figcaption></figure></div><p>As we can see, this chat template uses the special tokens <code>&lt;|im_start|&gt;</code> and <code>&lt;|im_end|&gt;</code> to signify the start and end of a chat turn, respectively. Then, the source of each chat turn&#8212;<em>the user, assistant, or a system message</em>&#8212;is captured by another special token that is placed at the beginning of each chat turn. Using a chat template allows us to encode complex conversations into a flat prompt.</p><p><strong>Tool usage.</strong> We can capture tool calls with a similar approach. An LLM can make a tool call by outputting a sequence similar to the one shown below. Here, the LLM initiates a tool call by outputting the special token <code>&lt;START TOOL&gt;</code>.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!N4MY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b33434-e0d9-4211-847a-ff89508dfa37_2382x350.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!N4MY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b33434-e0d9-4211-847a-ff89508dfa37_2382x350.png 424w, https://substackcdn.com/image/fetch/$s_!N4MY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b33434-e0d9-4211-847a-ff89508dfa37_2382x350.png 848w, https://substackcdn.com/image/fetch/$s_!N4MY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b33434-e0d9-4211-847a-ff89508dfa37_2382x350.png 1272w, https://substackcdn.com/image/fetch/$s_!N4MY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b33434-e0d9-4211-847a-ff89508dfa37_2382x350.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!N4MY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b33434-e0d9-4211-847a-ff89508dfa37_2382x350.png" width="1456" height="214" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/59b33434-e0d9-4211-847a-ff89508dfa37_2382x350.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:214,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!N4MY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b33434-e0d9-4211-847a-ff89508dfa37_2382x350.png 424w, https://substackcdn.com/image/fetch/$s_!N4MY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b33434-e0d9-4211-847a-ff89508dfa37_2382x350.png 848w, https://substackcdn.com/image/fetch/$s_!N4MY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b33434-e0d9-4211-847a-ff89508dfa37_2382x350.png 1272w, https://substackcdn.com/image/fetch/$s_!N4MY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b33434-e0d9-4211-847a-ff89508dfa37_2382x350.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Tool calls are generated inline with an LLM&#8217;s standard output</figcaption></figure></div><p>When this special tool-calling token is generated, we:</p><ol><li><p>Stop generating text with the LLM.</p></li><li><p>Parse the arguments for the tool call from the model&#8217;s output.</p></li><li><p>Make the call to the specified tool.</p></li><li><p>Add the output from the tool back into the LLM&#8217;s text sequence. </p></li><li><p>Continue generating the rest of the sequence.</p></li></ol><p>In this way, the LLM gains the ability to make a tool call and gather additional context while generating an output. Such an approach can help greatly with reducing hallucinations or injecting up-to-date information into an LLM. </p><p><strong>Reasoning models</strong> also use special tokens to separate their reasoning process from the final model output. Specifically, reasoning models usually begin their output with the special <code>&lt;think&gt;</code> token. Following this start thinking token, the model will output a long explanation in which it reasons through the prompt and decides how it should respond to the prompt. Once this reasoning process concludes, the model will output the <code>&lt;/think&gt;</code> token to signal the end of the reasoning process. From here, the model outputs its final response, eventually ending with a standard stop token like <code>&lt;|im_end|&gt;</code>; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Way8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Way8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png 424w, https://substackcdn.com/image/fetch/$s_!Way8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png 848w, https://substackcdn.com/image/fetch/$s_!Way8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png 1272w, https://substackcdn.com/image/fetch/$s_!Way8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Way8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png" width="1456" height="1034" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1034,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:326292,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Way8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png 424w, https://substackcdn.com/image/fetch/$s_!Way8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png 848w, https://substackcdn.com/image/fetch/$s_!Way8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png 1272w, https://substackcdn.com/image/fetch/$s_!Way8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F653f2fd4-4b8c-44ae-82f2-c6e906c6a80d_1544x1096.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Anatomy of a reasoning model&#8217;s output (using Qwen-3-8B)</figcaption></figure></div><p>The core idea here is always the same: <em>we use special tokens and chat templates to format many different input and output types in a way that is understandable to the LLM and easy to parse / process for the developer</em>. As we move towards broader and more capable agents, the complexity of this templating process increases. For more details on how tool calling, reasoning and more are handled within LLMs (and AI agents in general), see the overview below. Next, we will take a deeper look at the prompt template that is used by GPT-oss, called the Harmony prompt format.</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;329905fa-322e-4bc3-b859-c3904dca9072&quot;,&quot;caption&quot;:&quot;In this overview, we will build an understanding of AI agents from first principles. Starting with a standard text-to-text LLM, we will explore how functionalities like tool usage, reasoning and more can enhance a standard LLM, leading to the creation of complex, autonomous systems.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;AI Agents from First Principles&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:29736521,&quot;name&quot;:&quot;Cameron R. Wolfe, Ph.D.&quot;,&quot;bio&quot;:&quot;Research @ Netflix &#8226; Rice University PhD &#8226; I make AI understandable&quot;,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/69aba7df-b571-4609-aa47-fc2d031c11b8_1242x1595.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-06-09T09:33:09.032Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cee4a772-78a7-41b7-8cf1-4da233376ea6_2002x1122.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://cameronrwolfe.substack.com/p/ai-agents&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:164903679,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:314,&quot;comment_count&quot;:24,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Deep (Learning) Focus&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!87xa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab9b43fb-52d5-40da-995d-5b7cd3f91064_896x896.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h4>Harmony Format for Agents, Reasoning &amp; Tool Calling</h4><p>The tokenizer and chat template for an LLM dictate the format of input provided to the model, as well as control how a model manages multiple kinds of inputs and outputs. The (BPE) tokenizers used for OpenAI models are available publicly within the <a href="https://github.com/openai/tiktoken">tiktoken package</a>. Prior models like GPT-4o and GPT-4o-mini used the <code>o200k</code> tokenizer with a vocabulary size of 200K tokens, while GPT-oss models use the modified <code>o200k_harmony</code> tokenizer, which has an extended vocabulary of 201,088 tokens to support their new Harmony prompt format. </p><blockquote><p><em>&#8220;The model can interleave CoT, function calls, function responses, intermediate messages that are shown to users, and final answers.&#8221;</em> - from [2]</p></blockquote><p>The Harmony prompt format is used by both GPT-oss models and is a great illustration of the complex chat templates required by modern agentic LLM systems. The GPT-oss models emphasize tool usage and are specially trained to be useful in agentic scenarios; e.g., the post-training process teaches the models how to use various tools (e.g., browsing tools, python runtime and arbitrary developer functions) and the models can run with or without tools based on instructions provided by the developer. The Harmony prompt format plays a huge role in making these capabilities possible via standardized formatting.</p><p>The Harmony prompt format has the roles outlined below. These roles include standard roles like user and assistant. However, a new role is created to specifically support tool calling, and the system message is separated into two new roles&#8212;<em>system or developer</em>&#8212;that capture different aspects of a traditional LLM system message. The system role captures top-level metadata, while the developer message provides instructions from the developer to the model. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TG5B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5925d0-7cf9-462e-b7f9-a457a8af4c8d_2182x726.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TG5B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5925d0-7cf9-462e-b7f9-a457a8af4c8d_2182x726.png 424w, https://substackcdn.com/image/fetch/$s_!TG5B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5925d0-7cf9-462e-b7f9-a457a8af4c8d_2182x726.png 848w, https://substackcdn.com/image/fetch/$s_!TG5B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5925d0-7cf9-462e-b7f9-a457a8af4c8d_2182x726.png 1272w, https://substackcdn.com/image/fetch/$s_!TG5B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5925d0-7cf9-462e-b7f9-a457a8af4c8d_2182x726.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TG5B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5925d0-7cf9-462e-b7f9-a457a8af4c8d_2182x726.png" width="1456" height="484" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ef5925d0-7cf9-462e-b7f9-a457a8af4c8d_2182x726.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:484,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:208591,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5925d0-7cf9-462e-b7f9-a457a8af4c8d_2182x726.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TG5B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5925d0-7cf9-462e-b7f9-a457a8af4c8d_2182x726.png 424w, https://substackcdn.com/image/fetch/$s_!TG5B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5925d0-7cf9-462e-b7f9-a457a8af4c8d_2182x726.png 848w, https://substackcdn.com/image/fetch/$s_!TG5B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5925d0-7cf9-462e-b7f9-a457a8af4c8d_2182x726.png 1272w, https://substackcdn.com/image/fetch/$s_!TG5B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef5925d0-7cf9-462e-b7f9-a457a8af4c8d_2182x726.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(<a href="https://cookbook.openai.com/articles/openai-harmony">source</a>)</figcaption></figure></div><p>The roles in the Harmony prompt format form the <a href="https://arxiv.org/abs/2404.13208">instruction hierarchy</a> shown below. This hierarchy defines the order of precedence for instructions provided to the LLM. If multiple instructions contain conflicting information, the highest-ranking instruction (according to the role hierarchy below) should be obeyed; e.g., the developer message takes precedence over a user message. <em>The GPT-oss models are specifically aligned to adhere to this instruction hierarchy during post-training.</em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_5QB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1ced8c8-3d70-4d3b-bb69-4e7ecdd65fdb_2342x182.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_5QB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1ced8c8-3d70-4d3b-bb69-4e7ecdd65fdb_2342x182.png 424w, https://substackcdn.com/image/fetch/$s_!_5QB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1ced8c8-3d70-4d3b-bb69-4e7ecdd65fdb_2342x182.png 848w, https://substackcdn.com/image/fetch/$s_!_5QB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1ced8c8-3d70-4d3b-bb69-4e7ecdd65fdb_2342x182.png 1272w, https://substackcdn.com/image/fetch/$s_!_5QB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1ced8c8-3d70-4d3b-bb69-4e7ecdd65fdb_2342x182.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_5QB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1ced8c8-3d70-4d3b-bb69-4e7ecdd65fdb_2342x182.png" width="1456" height="113" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a1ced8c8-3d70-4d3b-bb69-4e7ecdd65fdb_2342x182.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:113,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:48160,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1ced8c8-3d70-4d3b-bb69-4e7ecdd65fdb_2342x182.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_5QB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1ced8c8-3d70-4d3b-bb69-4e7ecdd65fdb_2342x182.png 424w, https://substackcdn.com/image/fetch/$s_!_5QB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1ced8c8-3d70-4d3b-bb69-4e7ecdd65fdb_2342x182.png 848w, https://substackcdn.com/image/fetch/$s_!_5QB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1ced8c8-3d70-4d3b-bb69-4e7ecdd65fdb_2342x182.png 1272w, https://substackcdn.com/image/fetch/$s_!_5QB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1ced8c8-3d70-4d3b-bb69-4e7ecdd65fdb_2342x182.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Instruction hierarchy for GPT-oss</figcaption></figure></div><p>For the assistant role specifically, the Harmony format defines three different channels in which the assistant can provide an output; see below. Put simply, these different channels are used to differentiate the final output provided by the model from different kinds of outputs; e.g., tool calls or reasoning traces. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!N3Oy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57f6d334-315e-4514-9e41-9d3feead223e_2178x616.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!N3Oy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57f6d334-315e-4514-9e41-9d3feead223e_2178x616.png 424w, https://substackcdn.com/image/fetch/$s_!N3Oy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57f6d334-315e-4514-9e41-9d3feead223e_2178x616.png 848w, https://substackcdn.com/image/fetch/$s_!N3Oy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57f6d334-315e-4514-9e41-9d3feead223e_2178x616.png 1272w, https://substackcdn.com/image/fetch/$s_!N3Oy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57f6d334-315e-4514-9e41-9d3feead223e_2178x616.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!N3Oy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57f6d334-315e-4514-9e41-9d3feead223e_2178x616.png" width="1456" height="412" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/57f6d334-315e-4514-9e41-9d3feead223e_2178x616.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:412,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:216529,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57f6d334-315e-4514-9e41-9d3feead223e_2178x616.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!N3Oy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57f6d334-315e-4514-9e41-9d3feead223e_2178x616.png 424w, https://substackcdn.com/image/fetch/$s_!N3Oy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57f6d334-315e-4514-9e41-9d3feead223e_2178x616.png 848w, https://substackcdn.com/image/fetch/$s_!N3Oy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57f6d334-315e-4514-9e41-9d3feead223e_2178x616.png 1272w, https://substackcdn.com/image/fetch/$s_!N3Oy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57f6d334-315e-4514-9e41-9d3feead223e_2178x616.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(<a href="https://cookbook.openai.com/articles/openai-harmony">source</a>)</figcaption></figure></div><p>By separating the model&#8217;s output into multiple channels, we can differentiate between user and internal-facing outputs&#8212;<em>in most LLM UIs only the final message is actually displayed to the user</em>. Additionally, using multiple output channels makes more complex output scenarios easier to handle. To illustrate, assume the LLM sequentially generates the following outputs: tool call &#8594; reasoning &#8594; final output. These outputs would each fall in a separate assistant channel, which allows us to easily parse each component of the output and decide next steps.</p><p><strong>Concrete example.</strong> The Harmony prompt format is explained in detail in the accompanying <a href="https://cookbook.openai.com/articles/openai-harmony">developer documentation</a>, and OpenAI even released a <a href="https://pypi.org/project/openai-harmony/">Python package</a> for properly constructing and rendering messages in the Harmony format. Using this package, we construct a concrete example of a sequence of messages for GPT-oss, rendered using the Harmony prompt format; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5LH1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee312c05-dacc-4aea-905d-8578444ae442_1166x1278.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5LH1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee312c05-dacc-4aea-905d-8578444ae442_1166x1278.png 424w, https://substackcdn.com/image/fetch/$s_!5LH1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee312c05-dacc-4aea-905d-8578444ae442_1166x1278.png 848w, https://substackcdn.com/image/fetch/$s_!5LH1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee312c05-dacc-4aea-905d-8578444ae442_1166x1278.png 1272w, https://substackcdn.com/image/fetch/$s_!5LH1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee312c05-dacc-4aea-905d-8578444ae442_1166x1278.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5LH1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee312c05-dacc-4aea-905d-8578444ae442_1166x1278.png" width="1166" height="1278" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ee312c05-dacc-4aea-905d-8578444ae442_1166x1278.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1278,&quot;width&quot;:1166,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:253011,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee312c05-dacc-4aea-905d-8578444ae442_1166x1278.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5LH1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee312c05-dacc-4aea-905d-8578444ae442_1166x1278.png 424w, https://substackcdn.com/image/fetch/$s_!5LH1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee312c05-dacc-4aea-905d-8578444ae442_1166x1278.png 848w, https://substackcdn.com/image/fetch/$s_!5LH1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee312c05-dacc-4aea-905d-8578444ae442_1166x1278.png 1272w, https://substackcdn.com/image/fetch/$s_!5LH1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee312c05-dacc-4aea-905d-8578444ae442_1166x1278.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Harmony prompt format example</figcaption></figure></div><p>Here, we see an example of all components of the Harmony prompt format in action. Specifically, this example demonstrates the differentiation between the developer and system messages, uses all available output channels for the assistant, provides examples of both thinking and tool calling, then synthesizes all of this information to provide a final output to the user. A list of all special tokens that can be used in the Harmony prompt format is provided below for reference.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hXXz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5001ae4-62b1-4531-9755-91ecea354da2_1490x724.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hXXz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5001ae4-62b1-4531-9755-91ecea354da2_1490x724.png 424w, https://substackcdn.com/image/fetch/$s_!hXXz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5001ae4-62b1-4531-9755-91ecea354da2_1490x724.png 848w, https://substackcdn.com/image/fetch/$s_!hXXz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5001ae4-62b1-4531-9755-91ecea354da2_1490x724.png 1272w, https://substackcdn.com/image/fetch/$s_!hXXz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5001ae4-62b1-4531-9755-91ecea354da2_1490x724.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hXXz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5001ae4-62b1-4531-9755-91ecea354da2_1490x724.png" width="1456" height="707" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b5001ae4-62b1-4531-9755-91ecea354da2_1490x724.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:707,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:146798,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5001ae4-62b1-4531-9755-91ecea354da2_1490x724.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hXXz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5001ae4-62b1-4531-9755-91ecea354da2_1490x724.png 424w, https://substackcdn.com/image/fetch/$s_!hXXz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5001ae4-62b1-4531-9755-91ecea354da2_1490x724.png 848w, https://substackcdn.com/image/fetch/$s_!hXXz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5001ae4-62b1-4531-9755-91ecea354da2_1490x724.png 1272w, https://substackcdn.com/image/fetch/$s_!hXXz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5001ae4-62b1-4531-9755-91ecea354da2_1490x724.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(<a href="https://cookbook.openai.com/articles/openai-harmony">source</a>)</figcaption></figure></div><h4>Long Context</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JJH6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JJH6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png 424w, https://substackcdn.com/image/fetch/$s_!JJH6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png 848w, https://substackcdn.com/image/fetch/$s_!JJH6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png 1272w, https://substackcdn.com/image/fetch/$s_!JJH6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JJH6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png" width="482" height="287.34615384615387" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:868,&quot;width&quot;:1456,&quot;resizeWidth&quot;:482,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JJH6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png 424w, https://substackcdn.com/image/fetch/$s_!JJH6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png 848w, https://substackcdn.com/image/fetch/$s_!JJH6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png 1272w, https://substackcdn.com/image/fetch/$s_!JJH6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(<a href="https://openai.com/index/learning-to-reason-with-llms/">source</a>)</figcaption></figure></div><p>The ability to ingest and understand long contexts is important for all LLMs, but it is especially important for reasoning models due to the fact that they output a long CoT&#8212;<em>which can be several thousand or tens of thousands of tokens long</em>&#8212;before providing their final output; see above. Luckily, both GPT-oss models are trained to support a context window of 131K tokens in their dense layers. Such long context is made possible via a combination of commonly-used techniques.</p><p><strong>Position embeddings.</strong> The self-attention mechanism in transformers does not naturally consider the order of tokens&#8212;<em>each token is treated the same regardless of its position in the sequence</em>. However, knowing the order of tokens is essential for LLMs. For instance, predicting the next token would be much harder if we only knew which tokens came before, but not their sequence. For this reason, we must explicitly add position information into the LLM. The original transformer created unique vector embeddings for every position in the sequence and added these position embeddings to each token at the input layer; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!s0ac!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02efc279-6e0a-43ab-b166-5d8c08d5cca9_1644x976.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!s0ac!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02efc279-6e0a-43ab-b166-5d8c08d5cca9_1644x976.png 424w, https://substackcdn.com/image/fetch/$s_!s0ac!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02efc279-6e0a-43ab-b166-5d8c08d5cca9_1644x976.png 848w, https://substackcdn.com/image/fetch/$s_!s0ac!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02efc279-6e0a-43ab-b166-5d8c08d5cca9_1644x976.png 1272w, https://substackcdn.com/image/fetch/$s_!s0ac!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02efc279-6e0a-43ab-b166-5d8c08d5cca9_1644x976.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!s0ac!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02efc279-6e0a-43ab-b166-5d8c08d5cca9_1644x976.png" width="472" height="280.0879120879121" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/02efc279-6e0a-43ab-b166-5d8c08d5cca9_1644x976.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:864,&quot;width&quot;:1456,&quot;resizeWidth&quot;:472,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!s0ac!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02efc279-6e0a-43ab-b166-5d8c08d5cca9_1644x976.png 424w, https://substackcdn.com/image/fetch/$s_!s0ac!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02efc279-6e0a-43ab-b166-5d8c08d5cca9_1644x976.png 848w, https://substackcdn.com/image/fetch/$s_!s0ac!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02efc279-6e0a-43ab-b166-5d8c08d5cca9_1644x976.png 1272w, https://substackcdn.com/image/fetch/$s_!s0ac!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02efc279-6e0a-43ab-b166-5d8c08d5cca9_1644x976.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This approach directly injects information about each token&#8217;s absolute sequence position into the token&#8217;s embedding. Then, this modified embedding is ingested by the transformer as input, allowing the model to use the position information.</p><p><strong>RoPE.</strong> Most modern LLMs no longer use absolute position encodings, choosing instead to encode relative position (i.e., distances between token pairs) or some mixture of relative and absolute position. Relative position encodings allow the transformer to more easily handle longer sequences. Whereas absolute position requires that the LLM be trained on sequences up to a certain length, <em>relative position is generalizable and unrelated to the total length of a sequence</em>. The most commonly-used position encoding scheme for LLMs&#8212;<em>and the approach used by both GPT-oss models</em>&#8212;is Rotary Position Embedding (RoPE) [15]; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FT7A!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d51b4b1-deb9-4a2f-8b5f-9f7b683c9866_1566x984.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FT7A!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d51b4b1-deb9-4a2f-8b5f-9f7b683c9866_1566x984.png 424w, https://substackcdn.com/image/fetch/$s_!FT7A!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d51b4b1-deb9-4a2f-8b5f-9f7b683c9866_1566x984.png 848w, https://substackcdn.com/image/fetch/$s_!FT7A!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d51b4b1-deb9-4a2f-8b5f-9f7b683c9866_1566x984.png 1272w, https://substackcdn.com/image/fetch/$s_!FT7A!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d51b4b1-deb9-4a2f-8b5f-9f7b683c9866_1566x984.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FT7A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d51b4b1-deb9-4a2f-8b5f-9f7b683c9866_1566x984.png" width="521" height="327.41414835164835" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1d51b4b1-deb9-4a2f-8b5f-9f7b683c9866_1566x984.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:915,&quot;width&quot;:1456,&quot;resizeWidth&quot;:521,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FT7A!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d51b4b1-deb9-4a2f-8b5f-9f7b683c9866_1566x984.png 424w, https://substackcdn.com/image/fetch/$s_!FT7A!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d51b4b1-deb9-4a2f-8b5f-9f7b683c9866_1566x984.png 848w, https://substackcdn.com/image/fetch/$s_!FT7A!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d51b4b1-deb9-4a2f-8b5f-9f7b683c9866_1566x984.png 1272w, https://substackcdn.com/image/fetch/$s_!FT7A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d51b4b1-deb9-4a2f-8b5f-9f7b683c9866_1566x984.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>RoPE is a hybrid position encoding scheme&#8212;<em>meaning that it considers both absolute and relative information</em>&#8212;that modifies the query and key vectors in self-attention. Unlike absolute position embeddings, RoPE acts upon every transformer layer&#8212;<em>not just the input layer</em>. In self-attention, key and query vectors are produced by passing input token vectors through separate linear layers. This operation, which is identical for key and query vectors (aside from using separate linear layers with their own weights) is depicted below for a single token embedding. Throughout this section, we will assume our token vectors have dimension <code>d</code>.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fsp7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66db2a1d-b210-4464-a0aa-278c522601fe_974x562.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fsp7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66db2a1d-b210-4464-a0aa-278c522601fe_974x562.png 424w, https://substackcdn.com/image/fetch/$s_!fsp7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66db2a1d-b210-4464-a0aa-278c522601fe_974x562.png 848w, https://substackcdn.com/image/fetch/$s_!fsp7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66db2a1d-b210-4464-a0aa-278c522601fe_974x562.png 1272w, https://substackcdn.com/image/fetch/$s_!fsp7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66db2a1d-b210-4464-a0aa-278c522601fe_974x562.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fsp7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66db2a1d-b210-4464-a0aa-278c522601fe_974x562.png" width="379" height="218.68377823408625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/66db2a1d-b210-4464-a0aa-278c522601fe_974x562.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:562,&quot;width&quot;:974,&quot;resizeWidth&quot;:379,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fsp7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66db2a1d-b210-4464-a0aa-278c522601fe_974x562.png 424w, https://substackcdn.com/image/fetch/$s_!fsp7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66db2a1d-b210-4464-a0aa-278c522601fe_974x562.png 848w, https://substackcdn.com/image/fetch/$s_!fsp7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66db2a1d-b210-4464-a0aa-278c522601fe_974x562.png 1272w, https://substackcdn.com/image/fetch/$s_!fsp7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66db2a1d-b210-4464-a0aa-278c522601fe_974x562.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Projecting a token embedding to form a key in self-attention</figcaption></figure></div><p>To incorporate position information into self-attention, RoPE modifies the above operation by multiplying the weight matrix <code>W_k</code> by a unique <a href="https://en.wikipedia.org/wiki/Rotation_matrix">rotation matrix</a> that is computed based upon the absolute position of a token in the sequence. In other words, the amount that we rotate key and query vectors changes based upon their position in the sequence. This modified operation is shown below. We again depict the creation of a key vector, but the process is the same for query vectors.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IEiI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F660d68d7-f108-493e-b010-1e1c7205a1a6_1466x624.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IEiI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F660d68d7-f108-493e-b010-1e1c7205a1a6_1466x624.png 424w, https://substackcdn.com/image/fetch/$s_!IEiI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F660d68d7-f108-493e-b010-1e1c7205a1a6_1466x624.png 848w, https://substackcdn.com/image/fetch/$s_!IEiI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F660d68d7-f108-493e-b010-1e1c7205a1a6_1466x624.png 1272w, https://substackcdn.com/image/fetch/$s_!IEiI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F660d68d7-f108-493e-b010-1e1c7205a1a6_1466x624.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IEiI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F660d68d7-f108-493e-b010-1e1c7205a1a6_1466x624.png" width="594" height="252.93956043956044" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/660d68d7-f108-493e-b010-1e1c7205a1a6_1466x624.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:620,&quot;width&quot;:1456,&quot;resizeWidth&quot;:594,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IEiI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F660d68d7-f108-493e-b010-1e1c7205a1a6_1466x624.png 424w, https://substackcdn.com/image/fetch/$s_!IEiI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F660d68d7-f108-493e-b010-1e1c7205a1a6_1466x624.png 848w, https://substackcdn.com/image/fetch/$s_!IEiI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F660d68d7-f108-493e-b010-1e1c7205a1a6_1466x624.png 1272w, https://substackcdn.com/image/fetch/$s_!IEiI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F660d68d7-f108-493e-b010-1e1c7205a1a6_1466x624.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Incorporating position information via a rotation matrix</figcaption></figure></div><p>&#952; is a vector of size <code>d / 2</code> called the rotational (or frequency) basis vector. The values of the rotational basis vector are created as shown in the equation below. As we can see, the entries of the vector are dictated by the base frequency&#8212;<em>a hyperparameter that we must set in RoPE</em>. The original RoPE paper uses a base frequency of 10K, but we will soon see that this setting is <a href="https://arxiv.org/abs/2310.05209">not always optimal</a>!</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XoQX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25bb01ab-5d4f-4b6a-b4ba-3907f28ad0ff_2130x488.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XoQX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25bb01ab-5d4f-4b6a-b4ba-3907f28ad0ff_2130x488.png 424w, https://substackcdn.com/image/fetch/$s_!XoQX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25bb01ab-5d4f-4b6a-b4ba-3907f28ad0ff_2130x488.png 848w, https://substackcdn.com/image/fetch/$s_!XoQX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25bb01ab-5d4f-4b6a-b4ba-3907f28ad0ff_2130x488.png 1272w, https://substackcdn.com/image/fetch/$s_!XoQX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25bb01ab-5d4f-4b6a-b4ba-3907f28ad0ff_2130x488.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XoQX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25bb01ab-5d4f-4b6a-b4ba-3907f28ad0ff_2130x488.png" width="657" height="150.7129120879121" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/25bb01ab-5d4f-4b6a-b4ba-3907f28ad0ff_2130x488.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:334,&quot;width&quot;:1456,&quot;resizeWidth&quot;:657,&quot;bytes&quot;:170874,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25bb01ab-5d4f-4b6a-b4ba-3907f28ad0ff_2130x488.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XoQX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25bb01ab-5d4f-4b6a-b4ba-3907f28ad0ff_2130x488.png 424w, https://substackcdn.com/image/fetch/$s_!XoQX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25bb01ab-5d4f-4b6a-b4ba-3907f28ad0ff_2130x488.png 848w, https://substackcdn.com/image/fetch/$s_!XoQX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25bb01ab-5d4f-4b6a-b4ba-3907f28ad0ff_2130x488.png 1272w, https://substackcdn.com/image/fetch/$s_!XoQX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25bb01ab-5d4f-4b6a-b4ba-3907f28ad0ff_2130x488.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Constructing the frequency basis vector for RoPE</figcaption></figure></div><p>We have a function <code>R</code> that takes the rotational basis vector &#952; and the absolute token position <code>i</code> as input and produces the rotation matrix shown below. This matrix is <a href="https://mathworld.wolfram.com/BlockDiagonalMatrix.html">block diagonal</a>, and each block in the matrix is a <code>2 &#215; 2</code> rotation matrix that rotates a pair of two dimensions in the key (or query) embedding. As we can see in the expression below, the fact that this matrix is composed of <code>2 &#215; 2</code> blocks is exactly why our frequency basis vector has a dimension of <code>d / 2</code>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!63HZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9963ed9d-67e5-4587-ac67-f1cdea075570_1670x646.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!63HZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9963ed9d-67e5-4587-ac67-f1cdea075570_1670x646.png 424w, https://substackcdn.com/image/fetch/$s_!63HZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9963ed9d-67e5-4587-ac67-f1cdea075570_1670x646.png 848w, https://substackcdn.com/image/fetch/$s_!63HZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9963ed9d-67e5-4587-ac67-f1cdea075570_1670x646.png 1272w, https://substackcdn.com/image/fetch/$s_!63HZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9963ed9d-67e5-4587-ac67-f1cdea075570_1670x646.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!63HZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9963ed9d-67e5-4587-ac67-f1cdea075570_1670x646.png" width="676" height="261.39285714285717" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9963ed9d-67e5-4587-ac67-f1cdea075570_1670x646.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:563,&quot;width&quot;:1456,&quot;resizeWidth&quot;:676,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!63HZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9963ed9d-67e5-4587-ac67-f1cdea075570_1670x646.png 424w, https://substackcdn.com/image/fetch/$s_!63HZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9963ed9d-67e5-4587-ac67-f1cdea075570_1670x646.png 848w, https://substackcdn.com/image/fetch/$s_!63HZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9963ed9d-67e5-4587-ac67-f1cdea075570_1670x646.png 1272w, https://substackcdn.com/image/fetch/$s_!63HZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9963ed9d-67e5-4587-ac67-f1cdea075570_1670x646.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Creating a RoPE rotation matrix (from [15])</figcaption></figure></div><p>After being multiplied by this matrix, each pair of dimensions in the output embedding is rotated based upon:</p><ol><li><p>The absolute position of the token in the sequence <code>i</code>. </p></li><li><p>The entry of &#952; corresponding to that pair of dimensions.</p></li></ol><p>We apply this rotation matrix when producing both key and query vectors for self-attention in every transformer layer, yielding the operation shown below that rotates all vectors according to their absolute position in the sequence.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dwRu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab69c09a-f632-4e18-9cc3-16cd92cd8fb2_1548x936.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dwRu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab69c09a-f632-4e18-9cc3-16cd92cd8fb2_1548x936.png 424w, https://substackcdn.com/image/fetch/$s_!dwRu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab69c09a-f632-4e18-9cc3-16cd92cd8fb2_1548x936.png 848w, https://substackcdn.com/image/fetch/$s_!dwRu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab69c09a-f632-4e18-9cc3-16cd92cd8fb2_1548x936.png 1272w, https://substackcdn.com/image/fetch/$s_!dwRu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab69c09a-f632-4e18-9cc3-16cd92cd8fb2_1548x936.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dwRu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab69c09a-f632-4e18-9cc3-16cd92cd8fb2_1548x936.png" width="596" height="360.2197802197802" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ab69c09a-f632-4e18-9cc3-16cd92cd8fb2_1548x936.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:880,&quot;width&quot;:1456,&quot;resizeWidth&quot;:596,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dwRu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab69c09a-f632-4e18-9cc3-16cd92cd8fb2_1548x936.png 424w, https://substackcdn.com/image/fetch/$s_!dwRu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab69c09a-f632-4e18-9cc3-16cd92cd8fb2_1548x936.png 848w, https://substackcdn.com/image/fetch/$s_!dwRu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab69c09a-f632-4e18-9cc3-16cd92cd8fb2_1548x936.png 1272w, https://substackcdn.com/image/fetch/$s_!dwRu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab69c09a-f632-4e18-9cc3-16cd92cd8fb2_1548x936.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Rotated keys and queries for self-attention in RoPE</figcaption></figure></div><p>When we multiply the rotated keys and queries, something interesting happens. The rotation matrices for keys and queries combine to form a single rotation matrix: <code>R(&#952;, n - m)</code>. In other words, the combination of rotating both the key and query vectors in self-attention captures the relative distance between tokens in the sequence. This is the crux of RoPE&#8212;<em>the rotation matrices inject the relative position of each token pair directly into the self-attention mechanism</em>!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wdXa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44d1db51-b8a6-43c7-a998-50fa94d5f05e_1674x864.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wdXa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44d1db51-b8a6-43c7-a998-50fa94d5f05e_1674x864.png 424w, https://substackcdn.com/image/fetch/$s_!wdXa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44d1db51-b8a6-43c7-a998-50fa94d5f05e_1674x864.png 848w, https://substackcdn.com/image/fetch/$s_!wdXa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44d1db51-b8a6-43c7-a998-50fa94d5f05e_1674x864.png 1272w, https://substackcdn.com/image/fetch/$s_!wdXa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44d1db51-b8a6-43c7-a998-50fa94d5f05e_1674x864.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wdXa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44d1db51-b8a6-43c7-a998-50fa94d5f05e_1674x864.png" width="1456" height="751" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/44d1db51-b8a6-43c7-a998-50fa94d5f05e_1674x864.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:751,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wdXa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44d1db51-b8a6-43c7-a998-50fa94d5f05e_1674x864.png 424w, https://substackcdn.com/image/fetch/$s_!wdXa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44d1db51-b8a6-43c7-a998-50fa94d5f05e_1674x864.png 848w, https://substackcdn.com/image/fetch/$s_!wdXa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44d1db51-b8a6-43c7-a998-50fa94d5f05e_1674x864.png 1272w, https://substackcdn.com/image/fetch/$s_!wdXa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44d1db51-b8a6-43c7-a998-50fa94d5f05e_1674x864.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [17])</figcaption></figure></div><p><strong>Scaling RoPE to longer context.</strong> Ideally, we want our LLM to be capable of generalizing to contexts longer than those seen during training, but researchers have shown that most position encoding schemes&#8212;<em>including RoPE</em>&#8212;generalize poorly to longer contexts [17]; see above. To create an LLM that can handle long context, we usually add an additional training stage:</p><ol><li><p>First, we perform standard pretraining with lower context length.</p></li><li><p>Then, we further train on a long context dataset (i.e., context extension).</p></li></ol><p>This two-stage approach is adopted to save training costs. Long context training consumes a lot of memory and, therefore, would be expensive to adopt during the full pretraining process of the LLM. <a href="https://youtu.be/dc4chADushM">Many techniques</a> exist for context extension, but GPT-oss models focus specifically upon an technique called YaRN [20], which is used to extend the context of dense attention layers to 131K tokens. Let&#8217;s cover some background on context extension to understand how YaRN works.</p><div class="pullquote"><p>&#8220;We present YaRN, a compute-efficient method to extend the context window of such models, requiring 10x less tokens and 2.5x less training steps than previous methods. Using YaRN, we show that LLaMA models can effectively utilize and extrapolate to context lengths much longer than their original pre-training would allow.&#8221; - from [18]</p></div><p><strong>Position interpolation.</strong> One of the simplest forms of context extension with RoPE is position interpolation (PI) [22]. PI defines a scaling factor <code>s = L / L&#8217;</code>, where <code>L</code> is the context window used during the first stage of training and <code>L&#8217;</code> is the model&#8217;s desired context window (after context extension). We assume <code>L&#8217; &gt; L</code>. From here, we modify the creation of the rotation matrix as shown below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cT3Y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb471b90c-af80-459b-a174-0e8b1241f256_1918x686.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cT3Y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb471b90c-af80-459b-a174-0e8b1241f256_1918x686.png 424w, https://substackcdn.com/image/fetch/$s_!cT3Y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb471b90c-af80-459b-a174-0e8b1241f256_1918x686.png 848w, https://substackcdn.com/image/fetch/$s_!cT3Y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb471b90c-af80-459b-a174-0e8b1241f256_1918x686.png 1272w, https://substackcdn.com/image/fetch/$s_!cT3Y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb471b90c-af80-459b-a174-0e8b1241f256_1918x686.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cT3Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb471b90c-af80-459b-a174-0e8b1241f256_1918x686.png" width="1456" height="521" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b471b90c-af80-459b-a174-0e8b1241f256_1918x686.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:521,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:216521,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb471b90c-af80-459b-a174-0e8b1241f256_1918x686.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cT3Y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb471b90c-af80-459b-a174-0e8b1241f256_1918x686.png 424w, https://substackcdn.com/image/fetch/$s_!cT3Y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb471b90c-af80-459b-a174-0e8b1241f256_1918x686.png 848w, https://substackcdn.com/image/fetch/$s_!cT3Y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb471b90c-af80-459b-a174-0e8b1241f256_1918x686.png 1272w, https://substackcdn.com/image/fetch/$s_!cT3Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb471b90c-af80-459b-a174-0e8b1241f256_1918x686.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Adding position interpolation into RoPE</figcaption></figure></div><p>This approach interpolates the position indices used within RoPE such that larger positions&#8212;<em>up to a length of </em><code>L&#8217;</code>&#8212;fall within the original context window of the LLM. After this scaling is applied, we complete the context extension process by further finetuning the model on a long context dataset. PI purely updates the position indices and does not consider the values of the rotational basis vector <code>&#952;</code> at all&#8212;<em>this is referred to as a &#8220;blind&#8221; interpolation method</em>.</p><p><strong>NTK-aware interpolation.</strong> Beyond PI, many recent LLMs have modified the base frequency of RoPE for the purpose of context extension. The original frequency basis used in the RoPE paper is 10K. However, Gemma-3 increases the frequency basis of RoPE to 1M [16], while Llama-3 uses a frequency basis of 500K [19]. </p><blockquote><p><em>&#8220;We increase RoPE base frequency from 10K to 1M on global self-attention layers, and keep the frequency of the local layers at 10K.&#8221;</em> - from [16]</p></blockquote><p>One of the key issues with PI is that it scales every dimension of RoPE equally. For this reason, we see in the YaRN paper that PI can cause performance on short contexts to degrade at the cost of teaching the LLM to handle longer contexts. To solve this issue, we need a non-uniform approach for scaling or interpolating the RoPE dimensions. More specifically, we want to spread out the interpolation &#8220;pressure&#8221; by scaling high-frequency features&#8212;<em>or those with a higher value of </em><code>&#952;_i</code>&#8212;differently than low frequency features. Concretely, this can be done by scaling the frequency basis in RoPE instead of the scaling the position indices. This approach is called <a href="https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/">NTK-aware interpolation</a>. </p><p><strong>YaRN.</strong> We can define a wavelength <code>&#955;</code> for each dimension of the frequency basis vector in RoPE. Specifically, the wavelength is <code>&#955;_j = 2&#960; / &#952;_j</code> (i.e., this is just the standard equation for a wavelength) for the <code>j</code>-th dimension of the frequency basis vector. A &#8220;high frequency&#8221; dimension&#8212;<em>as mentioned above</em>&#8212;would refer to a hidden dimension <code>j</code> in the frequency basis vector with a low wavelength; see <a href="https://en.wikipedia.org/wiki/Wavelength">here</a> for more details. The NTK-aware interpolation method presented above still performs uniform scaling of the base frequency&#8212;<em>the wavelength is not considered.</em></p><p>Alternatively, we could toggle how we perform interpolation based on the wavelength of a given dimension. Specifically, we can define a ratio between the context length of the LLM and the wavelength of a given RoPE dimension: <code>r(j) = L / &#955;_j</code>. Based on this ratio, we can define the function below to dynamically determine the base frequency used by a given RoPE dimension. This expression defines two extra hyperparameters <code>&#945;</code> and <code>&#946;</code>, which must be tuned on a case-by-case basis but are set to respective values of 1 and 32 in [20]. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NqlQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6a7b42f-add7-4a94-8e75-f4ff05248a91_1136x722.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NqlQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6a7b42f-add7-4a94-8e75-f4ff05248a91_1136x722.png 424w, https://substackcdn.com/image/fetch/$s_!NqlQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6a7b42f-add7-4a94-8e75-f4ff05248a91_1136x722.png 848w, https://substackcdn.com/image/fetch/$s_!NqlQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6a7b42f-add7-4a94-8e75-f4ff05248a91_1136x722.png 1272w, https://substackcdn.com/image/fetch/$s_!NqlQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6a7b42f-add7-4a94-8e75-f4ff05248a91_1136x722.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NqlQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6a7b42f-add7-4a94-8e75-f4ff05248a91_1136x722.png" width="530" height="336.84859154929575" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e6a7b42f-add7-4a94-8e75-f4ff05248a91_1136x722.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:722,&quot;width&quot;:1136,&quot;resizeWidth&quot;:530,&quot;bytes&quot;:116605,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6a7b42f-add7-4a94-8e75-f4ff05248a91_1136x722.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NqlQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6a7b42f-add7-4a94-8e75-f4ff05248a91_1136x722.png 424w, https://substackcdn.com/image/fetch/$s_!NqlQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6a7b42f-add7-4a94-8e75-f4ff05248a91_1136x722.png 848w, https://substackcdn.com/image/fetch/$s_!NqlQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6a7b42f-add7-4a94-8e75-f4ff05248a91_1136x722.png 1272w, https://substackcdn.com/image/fetch/$s_!NqlQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6a7b42f-add7-4a94-8e75-f4ff05248a91_1136x722.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">NTK-by-parts interpolation (from [20])</figcaption></figure></div><p>This approach is called NTK-by-parts interpolation. Intuitively, this interpolation approach uses the ratio <code>r(j)</code> to toggle how interpolation is performed:</p><ol><li><p>If the wavelength <code>&#955;_j</code> is much smaller than the model&#8217;s context length <code>L</code>, then we perform no interpolation.</p></li><li><p>If the wavelength <code>&#955;_j</code> is larger than <code>L</code>, then we interpolate the base frequency for RoPE.</p></li><li><p>Otherwise, we perform a bit of both by mixing these two methods.</p></li></ol><p>In this way, we can control how interpolation is performed dynamically based on the frequency of each RoPE dimension. YaRN is very similar to NTK-by-parts interpolation. It uses the exact same interpolation technique outlined above, but we also add a temperature scaling parameter to the softmax in self-attention as shown below. Similar to other techniques, we have to further finetune the model on long context data after interpolating via YaRN to perform context extension. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ikMi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e81e18f-6eef-4432-908b-cf5fec6259ee_1022x174.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ikMi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e81e18f-6eef-4432-908b-cf5fec6259ee_1022x174.png 424w, https://substackcdn.com/image/fetch/$s_!ikMi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e81e18f-6eef-4432-908b-cf5fec6259ee_1022x174.png 848w, https://substackcdn.com/image/fetch/$s_!ikMi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e81e18f-6eef-4432-908b-cf5fec6259ee_1022x174.png 1272w, https://substackcdn.com/image/fetch/$s_!ikMi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e81e18f-6eef-4432-908b-cf5fec6259ee_1022x174.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ikMi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e81e18f-6eef-4432-908b-cf5fec6259ee_1022x174.png" width="389" height="66.22896281800391" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2e81e18f-6eef-4432-908b-cf5fec6259ee_1022x174.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:174,&quot;width&quot;:1022,&quot;resizeWidth&quot;:389,&quot;bytes&quot;:37094,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e81e18f-6eef-4432-908b-cf5fec6259ee_1022x174.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ikMi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e81e18f-6eef-4432-908b-cf5fec6259ee_1022x174.png 424w, https://substackcdn.com/image/fetch/$s_!ikMi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e81e18f-6eef-4432-908b-cf5fec6259ee_1022x174.png 848w, https://substackcdn.com/image/fetch/$s_!ikMi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e81e18f-6eef-4432-908b-cf5fec6259ee_1022x174.png 1272w, https://substackcdn.com/image/fetch/$s_!ikMi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e81e18f-6eef-4432-908b-cf5fec6259ee_1022x174.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(from [20])</figcaption></figure></div><h2>Training Process</h2><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OuS0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d7f58-c669-4a1c-bbe5-989f6ca31b48_2424x528.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OuS0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d7f58-c669-4a1c-bbe5-989f6ca31b48_2424x528.png 424w, https://substackcdn.com/image/fetch/$s_!OuS0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d7f58-c669-4a1c-bbe5-989f6ca31b48_2424x528.png 848w, https://substackcdn.com/image/fetch/$s_!OuS0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d7f58-c669-4a1c-bbe5-989f6ca31b48_2424x528.png 1272w, https://substackcdn.com/image/fetch/$s_!OuS0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d7f58-c669-4a1c-bbe5-989f6ca31b48_2424x528.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OuS0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d7f58-c669-4a1c-bbe5-989f6ca31b48_2424x528.png" width="1456" height="317" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/989d7f58-c669-4a1c-bbe5-989f6ca31b48_2424x528.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:317,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OuS0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d7f58-c669-4a1c-bbe5-989f6ca31b48_2424x528.png 424w, https://substackcdn.com/image/fetch/$s_!OuS0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d7f58-c669-4a1c-bbe5-989f6ca31b48_2424x528.png 848w, https://substackcdn.com/image/fetch/$s_!OuS0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d7f58-c669-4a1c-bbe5-989f6ca31b48_2424x528.png 1272w, https://substackcdn.com/image/fetch/$s_!OuS0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d7f58-c669-4a1c-bbe5-989f6ca31b48_2424x528.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>As shown above, the training process for a modern LLM&#8212;<em>though variance exists between models</em>&#8212;can be divided into a few standardized phases:</p><ol><li><p><strong>Pretraining</strong> is a large-scale training procedure that trains the LLM from scratch over internet-scale text data using a <a href="https://cameronrwolfe.substack.com/i/136638774/understanding-next-token-prediction">next token prediction</a> training objective. The primary purpose of pretraining is to instill a broad and high-quality knowledge base within the LLM; see <a href="https://cameronrwolfe.substack.com/p/llm-scaling-laws">here</a>.</p></li><li><p><strong>Supervised finetuning (SFT)</strong> or <strong>instruction finetuning (IFT)</strong> also uses a (supervised) next token prediction training objective to train the LLM over a smaller set of high-quality completions that it learns to emulate. The primary purpose of SFT is to teach the LLM basic formatting and instruction following capabilities; see <a href="https://cameronrwolfe.substack.com/p/understanding-and-using-supervised">here</a>.</p></li><li><p><strong>Reinforcement learning from human feedback (RLHF)</strong> or <strong>preference finetuning (PreFT)</strong> uses <a href="https://cameronrwolfe.substack.com/p/basics-of-reinforcement-learning">reinforcement learning (RL)</a> to train the LLM over human preference data. The key purpose of RLHF is to align the LLM with human preferences; i.e., teach the LLM to generate outputs that are rated positively by humans as described <a href="https://cameronrwolfe.substack.com/p/the-story-of-rlhf-origins-motivations">here</a>.</p></li><li><p><strong>Reinforcement learning from verifiable rewards (RLVR)</strong> or <strong>reinforcement finetuning (RFT) </strong>trains the LLM with RL on <a href="https://cameronrwolfe.substack.com/i/153722335/reinforcement-learning-with-verifiable-rewards">verifiable tasks</a>, where a reward can be derived deterministically from rules or heuristics. This final training stage is useful for improving reasoning performance or&#8212;<em>more generally</em>&#8212;performance on any verifiable task.</p></li></ol><p>We collectively refer to the stages after pretraining as the &#8220;post-training&#8221; process. Despite releasing the weights of GPT-oss, OpenAI chooses to share very few details on the pre or post-training process for these models. Nonetheless, we will use this section to go over the training details&#8212;<em>mostly focused upon safety and reasoning</em>&#8212;that were shared about GPT-oss by OpenAI. </p><h4>General Training Information</h4><p><strong>Pretraining.</strong> The GPT-oss models have a knowledge cutoff date of June 2024 and are trained over a text-only dataset that is primarily English&#8212;<em>these models are neither multi-modal or multi-lingual</em>. Interestingly, however, these models still perform (relatively) well on <a href="https://huggingface.co/datasets/openai/MMMLU">multilingual benchmarks</a>, as shown below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cRPV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8547622c-a27d-4e85-9c6e-99827a37c12b_1112x672.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cRPV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8547622c-a27d-4e85-9c6e-99827a37c12b_1112x672.png 424w, https://substackcdn.com/image/fetch/$s_!cRPV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8547622c-a27d-4e85-9c6e-99827a37c12b_1112x672.png 848w, https://substackcdn.com/image/fetch/$s_!cRPV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8547622c-a27d-4e85-9c6e-99827a37c12b_1112x672.png 1272w, https://substackcdn.com/image/fetch/$s_!cRPV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8547622c-a27d-4e85-9c6e-99827a37c12b_1112x672.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cRPV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8547622c-a27d-4e85-9c6e-99827a37c12b_1112x672.png" width="520" height="314.24460431654677" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8547622c-a27d-4e85-9c6e-99827a37c12b_1112x672.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:672,&quot;width&quot;:1112,&quot;resizeWidth&quot;:520,&quot;bytes&quot;:162468,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8547622c-a27d-4e85-9c6e-99827a37c12b_1112x672.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cRPV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8547622c-a27d-4e85-9c6e-99827a37c12b_1112x672.png 424w, https://substackcdn.com/image/fetch/$s_!cRPV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8547622c-a27d-4e85-9c6e-99827a37c12b_1112x672.png 848w, https://substackcdn.com/image/fetch/$s_!cRPV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8547622c-a27d-4e85-9c6e-99827a37c12b_1112x672.png 1272w, https://substackcdn.com/image/fetch/$s_!cRPV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8547622c-a27d-4e85-9c6e-99827a37c12b_1112x672.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p>The pretraining dataset contains &#8220;trillions of tokens&#8221; and focuses on the domains of STEM, coding and general knowledge. However, this description provides little concrete information&#8212;<em>most open LLMs are trained with 15-20T tokens, so saying that the models were trained on &#8220;trillions&#8221; of tokens does not tell us much</em>. </p><blockquote><p><em>&#8220;We use our Moderation API and safety classifiers to filter out data that could contribute to harmful content or information hazards, including CSAM, hateful content, violence, and CBRN.&#8221;</em> - <a href="https://openai.com/index/gpt-4o-system-card/">GPT-4o system card</a></p></blockquote><p><strong>Safety filtering.</strong> One of the few notable details authors mention about the data used to pretrain GPT-oss models is that they perform safety filtering of the pretraining data. More specifically, GPT-oss re-uses the safety filters from the GPT-4o model to remove harmful data from the model&#8217;s pretraining dataset, especially focusing upon the Chemical, Biological Radiological and Nuclear (CBRN) domain. As outlined in the above quote, the safety filters used for GPT-4o are based on OpenAI&#8217;s moderation API. In a <a href="https://openai.com/index/upgrading-the-moderation-api-with-our-new-multimodal-moderation-model/">recent blog post</a>, OpenAI revealed that the moderation API is LLM-based&#8212;<em>it uses a version of GPT-4o that has been specialized to detect harmful text and images according to a predefined taxonomy</em>. In other words, prior GPT models are used to curate training data for GPT-oss!</p><p><strong>Quantization-aware training.</strong> To make an LLM more compute and memory efficient, we can perform <a href="https://huggingface.co/docs/optimum/en/concept_guides/quantization">quantization</a>&#8212;<em>or conversion into a lower-precision format</em>&#8212;on the model&#8217;s weights. However, quantizing an LLM has the potential to deteriorate the model&#8217;s performance. To avoid this performance deterioration, we can perform <a href="https://pytorch.org/blog/quantization-aware-training/">quantization-aware training</a>, which trains the model with lower precision to make the model more robust to quantization at inference time.</p><p>The GPT-oss models quantize the weights of their MoE layers&#8212;<em>making up over 90% of the models&#8217; total parameter count</em>&#8212;using <a href="https://arxiv.org/abs/2310.10537">Microscaling FP4 (MXFP4) format</a>, which uses only 4.25 bits per model parameter! This quantization scheme is also used in the post-training process (i.e., the GPT-oss models undergo quantization-aware training) so that the model becomes more robust to quantization. Quantizing the MoE weights in this way makes the GPT-oss models very memory efficient&#8212;<em>even the larger 120b model can fit on a single 80Gb GPU</em>!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MecG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F472f0cb9-f769-43c5-9000-2dd4e8801853_638x412.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MecG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F472f0cb9-f769-43c5-9000-2dd4e8801853_638x412.png 424w, https://substackcdn.com/image/fetch/$s_!MecG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F472f0cb9-f769-43c5-9000-2dd4e8801853_638x412.png 848w, https://substackcdn.com/image/fetch/$s_!MecG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F472f0cb9-f769-43c5-9000-2dd4e8801853_638x412.png 1272w, https://substackcdn.com/image/fetch/$s_!MecG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F472f0cb9-f769-43c5-9000-2dd4e8801853_638x412.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MecG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F472f0cb9-f769-43c5-9000-2dd4e8801853_638x412.png" width="452" height="291.8871473354232" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/472f0cb9-f769-43c5-9000-2dd4e8801853_638x412.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:412,&quot;width&quot;:638,&quot;resizeWidth&quot;:452,&quot;bytes&quot;:88292,&quot;alt&quot;:&quot;image/png&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image/png" title="image/png" srcset="https://substackcdn.com/image/fetch/$s_!MecG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F472f0cb9-f769-43c5-9000-2dd4e8801853_638x412.png 424w, https://substackcdn.com/image/fetch/$s_!MecG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F472f0cb9-f769-43c5-9000-2dd4e8801853_638x412.png 848w, https://substackcdn.com/image/fetch/$s_!MecG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F472f0cb9-f769-43c5-9000-2dd4e8801853_638x412.png 1272w, https://substackcdn.com/image/fetch/$s_!MecG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F472f0cb9-f769-43c5-9000-2dd4e8801853_638x412.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(<a href="https://huggingface.co/blog/RakshitAralimatti/learn-ai-with-me">source</a>)</figcaption></figure></div><p><em>How is it possible for a parameter to use 4.25 bits?</em> As explained in <a href="https://huggingface.co/blog/RakshitAralimatti/learn-ai-with-me">this approachable blog</a> on the topic, MXFP4 represents each model parameter with four bits&#8212;<em>one sign bit, two exponent bits, and one mantissa bit</em>. Then, the model&#8217;s parameters are broken into blocks of 32 parameters, where each block has a shared eight-bit exponential scaling factor (i.e., an extra 0.25 bits per parameter)&#8212;<em>this is why the MXFP4 format is referred to as &#8220;microscaling&#8221;</em>. See above for a schematic depiction of the format. Previously, training a model at four-bit precision was very difficult, but MXFP4 uses several tricks (e.g., stochastic rounding, block-wise quantization and random <a href="https://en.wikipedia.org/wiki/Hadamard_transform">Hadamard transforms</a> for handling outlier values) to make natively training an LLM&#8212;<em>such as GPT-oss</em>&#8212;at such a low precision feasible. </p><p><strong>Other details.</strong> Beyond everything outlined above, OpenAI provides a few more random details about the GPT-oss training process scattered throughout the models&#8217; various technical reports. For example, the alignment process is still based upon OpenAI&#8217;s <a href="https://openai.com/index/introducing-the-model-spec/">model spec</a>, though new drafts of the model spec are being released frequently. The training process also encourages the models to use CoT reasoning and tools prior to providing a final answer. <a href="https://www.interconnects.ai/p/summertime-outlook-o3s-novelty-coming">Incentivizing tool use</a> correctly during training is hard, but OpenAI&#8212;<em>as demonstrated by <a href="https://openai.com/index/introducing-o3-and-o4-mini/">o3&#8217;s impressive search capabilities</a></em>&#8212;is very good at this. </p><h4>Reasoning Training</h4><p>Both GPT-oss models are reasoning models, which are currently a very popular topic in AI research. Several open reasoning models have been released recently (e.g., DeepSeek-R1 [10] and Qwen-3 [11]) as well, which likely fueled OpenAI&#8217;s decision to release an open reasoning model of their own. We recently covered the details of reasoning models in the post below. However, we will go over the key ideas behind reasoning models in this section for the purpose of being comprehensive. Additionally, the GPT-oss model and associated reports make some really interesting comments about the correct way of training reasoning models that provide an interesting perspective on OpenAI&#8217;s safety strategy. </p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;6967627f-25ca-4818-a696-339d266a3c97&quot;,&quot;caption&quot;:&quot;For the last several years, we have used a relatively fixed pipeline for training large language models (LLMs); see below. First, we pretrain these language models over raw textual data from the internet. Afterwards, we align them&#8212;or train them to produce outputs that are preferable to humans&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Demystifying Reasoning Models&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:29736521,&quot;name&quot;:&quot;Cameron R. Wolfe, Ph.D.&quot;,&quot;bio&quot;:&quot;Research @ Netflix &#8226; Rice University PhD &#8226; I make AI understandable&quot;,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/69aba7df-b571-4609-aa47-fc2d031c11b8_1242x1595.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-02-18T10:33:55.513Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/23d9c87e-b238-4fdd-996e-4ed4465b9931_2334x1282.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://cameronrwolfe.substack.com/p/demystifying-reasoning-models&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:153722335,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:235,&quot;comment_count&quot;:3,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Deep (Learning) Focus&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!87xa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab9b43fb-52d5-40da-995d-5b7cd3f91064_896x896.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p><strong>What is a reasoning model?</strong> The main difference between a reasoning model and a standard LLM is the ability to &#8220;think&#8221; before answering a question. Specifically, the LLM thinks by outputting a CoT&#8212;<em>also known as a long CoT, reasoning trace, or reasoning trajectory</em>&#8212;prior to its final answer. This reasoning trajectory is generated no differently than any other sequence of text. However, we do usually surrounding the reasoning trajectory by special tokens (e.g., the <code>&lt;think&gt;</code> token; see below) to differentiate it from the LLM&#8217;s standard output.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!M6eC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90ce9bd2-4d69-46cf-a09c-3b7429ac3deb_1224x306.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!M6eC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90ce9bd2-4d69-46cf-a09c-3b7429ac3deb_1224x306.png 424w, https://substackcdn.com/image/fetch/$s_!M6eC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90ce9bd2-4d69-46cf-a09c-3b7429ac3deb_1224x306.png 848w, https://substackcdn.com/image/fetch/$s_!M6eC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90ce9bd2-4d69-46cf-a09c-3b7429ac3deb_1224x306.png 1272w, https://substackcdn.com/image/fetch/$s_!M6eC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90ce9bd2-4d69-46cf-a09c-3b7429ac3deb_1224x306.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!M6eC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90ce9bd2-4d69-46cf-a09c-3b7429ac3deb_1224x306.png" width="1224" height="306" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/90ce9bd2-4d69-46cf-a09c-3b7429ac3deb_1224x306.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:306,&quot;width&quot;:1224,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:85548,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90ce9bd2-4d69-46cf-a09c-3b7429ac3deb_1224x306.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!M6eC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90ce9bd2-4d69-46cf-a09c-3b7429ac3deb_1224x306.png 424w, https://substackcdn.com/image/fetch/$s_!M6eC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90ce9bd2-4d69-46cf-a09c-3b7429ac3deb_1224x306.png 848w, https://substackcdn.com/image/fetch/$s_!M6eC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90ce9bd2-4d69-46cf-a09c-3b7429ac3deb_1224x306.png 1272w, https://substackcdn.com/image/fetch/$s_!M6eC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90ce9bd2-4d69-46cf-a09c-3b7429ac3deb_1224x306.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(from [10])</figcaption></figure></div><p>Unlike traditional chains of thought, however, this long CoT can be thousands of tokens long. Additionally, many reasoning models also provide the ability to control the reasoning effort of the model, where a &#8220;high&#8221; level of reasoning effort would lead the model to increase the length of its reasoning trajectory<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-9" href="#footnote-9" target="_self">9</a>. In this way, we can increase the amount of inference-time compute used by the model.</p><p><strong>Reasoning trajectories.</strong> Many closed LLMs do not make the model&#8217;s reasoning trajectory visible to the user&#8212;<em>only the final output is displayed and the long CoT is hidden</em>. However, if we look at <a href="https://openai.com/index/learning-to-reason-with-llms/">some examples</a> of reasoning trajectories from OpenAI&#8217;s o-series models or from open reasoning models, we will notice that these models exhibit sophisticated reasoning behaviors in their long CoT:</p><ul><li><p>Thinking through each part of a complex problem.</p></li><li><p>Decomposing complex problems into smaller, solvable parts.</p></li><li><p>Critiquing solutions and finding errors.</p></li><li><p>Exploring many alternative solutions.</p></li></ul><p>In many ways, the model is performing a complex, text-based search process to find viable solution to a prompt in the long CoT. Such behavior goes beyond any previously-observed behavior with standard LLMs and CoT prompting. With this in mind, we might begin to wonder: <em>How does the model learn how to do this?</em></p><p><strong>How are reasoning models trained?</strong> Traditionally, LLMs were trained in three key stages as depicted below. We first pretrain the model, then perform alignment with a combination of <a href="https://cameronrwolfe.substack.com/p/understanding-and-using-supervised">SFT</a> and iterative rounds of <a href="https://cameronrwolfe.substack.com/p/the-story-of-rlhf-origins-motivations">RLHF</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9HTk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac82c7c1-fcbd-4b32-b9cd-febfadd77c19_1720x562.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9HTk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac82c7c1-fcbd-4b32-b9cd-febfadd77c19_1720x562.png 424w, https://substackcdn.com/image/fetch/$s_!9HTk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac82c7c1-fcbd-4b32-b9cd-febfadd77c19_1720x562.png 848w, https://substackcdn.com/image/fetch/$s_!9HTk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac82c7c1-fcbd-4b32-b9cd-febfadd77c19_1720x562.png 1272w, https://substackcdn.com/image/fetch/$s_!9HTk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac82c7c1-fcbd-4b32-b9cd-febfadd77c19_1720x562.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9HTk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac82c7c1-fcbd-4b32-b9cd-febfadd77c19_1720x562.png" width="550" height="179.80769230769232" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ac82c7c1-fcbd-4b32-b9cd-febfadd77c19_1720x562.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:476,&quot;width&quot;:1456,&quot;resizeWidth&quot;:550,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!9HTk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac82c7c1-fcbd-4b32-b9cd-febfadd77c19_1720x562.png 424w, https://substackcdn.com/image/fetch/$s_!9HTk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac82c7c1-fcbd-4b32-b9cd-febfadd77c19_1720x562.png 848w, https://substackcdn.com/image/fetch/$s_!9HTk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac82c7c1-fcbd-4b32-b9cd-febfadd77c19_1720x562.png 1272w, https://substackcdn.com/image/fetch/$s_!9HTk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac82c7c1-fcbd-4b32-b9cd-febfadd77c19_1720x562.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Standard LLM training pipeline</figcaption></figure></div><p>Unlike traditional LLMs, reasoning models expand upon this training process by performing &#8220;high-compute RL training&#8221;. Specifically, these models are trained using reinforcement learning with verifiable rewards (RLVR); see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mzxO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mzxO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png 424w, https://substackcdn.com/image/fetch/$s_!mzxO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png 848w, https://substackcdn.com/image/fetch/$s_!mzxO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png 1272w, https://substackcdn.com/image/fetch/$s_!mzxO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mzxO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png" width="1456" height="570" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:570,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mzxO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png 424w, https://substackcdn.com/image/fetch/$s_!mzxO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png 848w, https://substackcdn.com/image/fetch/$s_!mzxO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png 1272w, https://substackcdn.com/image/fetch/$s_!mzxO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [23])</figcaption></figure></div><p>During this training stage, we focus on &#8220;verifiable&#8221; problems like math and coding. In these domains, we can easily determine whether the output provided by the LLM is correct or not. For example, we can extract the answer provided by the LLM to a math question and determine whether it is correct by comparing to a ground truth answer using either exact match or a looser heuristic; see below. We can do the same thing for coding questions by just running test cases!</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zfsl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zfsl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 424w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 848w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 1272w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zfsl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png" width="581" height="199.12019230769232" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:499,&quot;width&quot;:1456,&quot;resizeWidth&quot;:581,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zfsl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 424w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 848w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 1272w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Verifying a math solution with exact matching</figcaption></figure></div><p>This binary verification signal is then used as the reward signal for training our LLM with RL. Such a verifiable approach is in stark contrast to techniques like RLHF that use a <a href="https://cameronrwolfe.substack.com/p/reward-models">learned reward model</a>. The fact that the reward in RLVR is deterministic makes it more reliable. We can run extensive RL training without the training process being derailed by <a href="https://lilianweng.github.io/posts/2024-11-28-reward-hacking/">reward hacking</a>. One of the key breakthroughs of reasoning models is the finding that RL training obeys a scaling law (see below)&#8212;<em>we can improve our LLM by continuing to scale up RL training</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1eNI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88a91669-f7f0-41aa-b0f0-78392da2115a_1254x804.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1eNI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88a91669-f7f0-41aa-b0f0-78392da2115a_1254x804.png 424w, https://substackcdn.com/image/fetch/$s_!1eNI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88a91669-f7f0-41aa-b0f0-78392da2115a_1254x804.png 848w, https://substackcdn.com/image/fetch/$s_!1eNI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88a91669-f7f0-41aa-b0f0-78392da2115a_1254x804.png 1272w, https://substackcdn.com/image/fetch/$s_!1eNI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88a91669-f7f0-41aa-b0f0-78392da2115a_1254x804.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1eNI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88a91669-f7f0-41aa-b0f0-78392da2115a_1254x804.png" width="513" height="328.90909090909093" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/88a91669-f7f0-41aa-b0f0-78392da2115a_1254x804.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:804,&quot;width&quot;:1254,&quot;resizeWidth&quot;:513,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1eNI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88a91669-f7f0-41aa-b0f0-78392da2115a_1254x804.png 424w, https://substackcdn.com/image/fetch/$s_!1eNI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88a91669-f7f0-41aa-b0f0-78392da2115a_1254x804.png 848w, https://substackcdn.com/image/fetch/$s_!1eNI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88a91669-f7f0-41aa-b0f0-78392da2115a_1254x804.png 1272w, https://substackcdn.com/image/fetch/$s_!1eNI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88a91669-f7f0-41aa-b0f0-78392da2115a_1254x804.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(<a href="https://openai.com/index/learning-to-reason-with-llms/">source</a>)</figcaption></figure></div><p><strong>Inference-time scaling.</strong> The other key breakthrough of reasoning models is inference-time scaling. When we train an LLM with large-scale RLVR, the model is allowed to explore, and authors in [10] observe that the LLM naturally learns to generate progressively longer reasoning traces throughout training; see below. In other words, <em>the model learns on its own that generating a longer reasoning trace is helpful for solving complex reasoning problems</em>. Interestingly, we also observe&#8212;<em>as shown in the figure above</em>&#8212;that the length of the reasoning trace obeys a smooth scaling law with model performance. We can actually improve performance by using more compute (in the form of a longer CoT) at inference time!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!COPD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!COPD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 424w, https://substackcdn.com/image/fetch/$s_!COPD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 848w, https://substackcdn.com/image/fetch/$s_!COPD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 1272w, https://substackcdn.com/image/fetch/$s_!COPD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!COPD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png" width="1456" height="812" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:812,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!COPD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 424w, https://substackcdn.com/image/fetch/$s_!COPD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 848w, https://substackcdn.com/image/fetch/$s_!COPD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 1272w, https://substackcdn.com/image/fetch/$s_!COPD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [10])</figcaption></figure></div><p>Such a scaling law is much different than traditional scaling laws observed for LLMs. Previously, <a href="https://cameronrwolfe.substack.com/p/llm-scaling-laws">scaling laws</a> studied the relationship between performance and the amount of compute invested into <em>training</em> an LLM, but reasoning models have a scaling law with respect to the amount of compute used at <em>inference</em> time. This is why reasoning models have different levels of reasoning effort. We can impact the model&#8217;s performance by influencing the length of its reasoning trace!</p><blockquote><p><em>&#8220;We train the models to support three reasoning levels: low, medium, and high. These levels are configured in the system prompt by inserting keywords such as `Reasoning: low`. Increasing the reasoning level will cause the model&#8217;s average CoT length to increase.&#8221; </em>- from [2]</p></blockquote><p>As outlined above, the GPT-oss models are trained to have several reasoning efforts (i.e., low, medium and high). To teach the model to obey these reasoning efforts, we can just use RLVR&#8212;<em>this is an easily verifiable reward</em>. We can check the length of the model&#8217;s reasoning trace and provide a positive reward if this length falls within the desired length range for a given reasoning effort.</p><p><strong>Training GPT-oss.</strong> The GPT-oss models undergo training in two phases. The first phase of training is a &#8220;cold start&#8221; stage that trains the model over CoT reasoning examples with SFT. This stage provides a better seed for large-scale RL training by biasing the model towards exploring CoT reasoning. After SFT, the model undergoes a <em>&#8220;high-compute RL Stage&#8221;</em>. The exact details of this training process are not outlined, but the RL training process is undoubtedly some variant of large-scale RLVR. Interestingly, <em>the authors of GPT-oss even mention that this training process is modeled after that of proprietary models like o4-mini</em>!</p><blockquote><p><em>&#8220;We did not put any direct supervision on the CoT for either GPT-oss model. We believe this is critical to monitor model misbehavior, deception and misuse.&#8221;</em> - from [2]</p></blockquote><p><strong>Inspecting reasoning traces.</strong> Finally, OpenAI provides an interesting perspective on their approach to RL training. Specifically, authors of GPT-oss explicitly state that they perform no direct supervision on the models&#8217; reasoning traces. This approach is standard in RLVR&#8212;<em>the only supervision is outcome-based (i.e., whether the model produces the correct answer after its long CoT or not)</em>.  However, OpenAI specifically emphasizes their choice to avoid additional supervision directly on the long CoT and even published a <a href="https://arxiv.org/abs/2507.11473">position paper</a> on this topic with authors from other major LLM labs. The intuition behind this choice is as follows:</p><ul><li><p>The reasoning trace reflects an LLM&#8217;s thinking process.</p></li><li><p>We can use this reasoning trace to monitor the LLM for misbehavior.</p></li><li><p>If we apply direct supervision to the reasoning trace, the LLM may learn to &#8220;hide&#8221; its actual thoughts from the reasoning trace.</p></li><li><p>For example, applying safety training to the reasoning trace would encourage the model to avoid saying anything harmful in its CoT. </p></li><li><p>Therefore, applying direct supervision to the reasoning trace eliminates our ability to use it for monitoring purposes.</p></li></ul><p>This line of reasoning clarifies OpenAI&#8217;s choice to not display the reasoning trace of o-series models to users. These reasoning traces do not undergo any direct safety training and might contain harmful outputs. However, this choice allows researchers at OpenAI to explore the utility of reasoning traces for monitoring.</p><h4>Safety Post-Training (Deliberative Alignment)</h4><blockquote><p><em>&#8220;During post-training, we use deliberative alignment to teach the models to refuse on a wide range of content (e.g., illicit advice), be robust to jailbreaks, and adhere to the instruction hierarchy.&#8221; </em>- from [2]</p></blockquote><p>The model card for GPT-oss mentions that their post-training process leverages deliberative alignment&#8212;<em>a safety training technique previously published by OpenAI [18] and used to align all o-series models</em>. The goal of safety training is to teach the model how to refuse unsafe prompts and defend against prompt injections or other attacks on the LLM. Deliberative alignment accomplishes this goal by combining research on AI safety with recent developments in reasoning models.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Yrsv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f369841-267d-478e-b85c-b201df2e6765_1334x762.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Yrsv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f369841-267d-478e-b85c-b201df2e6765_1334x762.png 424w, https://substackcdn.com/image/fetch/$s_!Yrsv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f369841-267d-478e-b85c-b201df2e6765_1334x762.png 848w, https://substackcdn.com/image/fetch/$s_!Yrsv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f369841-267d-478e-b85c-b201df2e6765_1334x762.png 1272w, https://substackcdn.com/image/fetch/$s_!Yrsv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f369841-267d-478e-b85c-b201df2e6765_1334x762.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Yrsv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f369841-267d-478e-b85c-b201df2e6765_1334x762.png" width="1334" height="762" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4f369841-267d-478e-b85c-b201df2e6765_1334x762.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:762,&quot;width&quot;:1334,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:82308,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f369841-267d-478e-b85c-b201df2e6765_1334x762.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Yrsv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f369841-267d-478e-b85c-b201df2e6765_1334x762.png 424w, https://substackcdn.com/image/fetch/$s_!Yrsv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f369841-267d-478e-b85c-b201df2e6765_1334x762.png 848w, https://substackcdn.com/image/fetch/$s_!Yrsv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f369841-267d-478e-b85c-b201df2e6765_1334x762.png 1272w, https://substackcdn.com/image/fetch/$s_!Yrsv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f369841-267d-478e-b85c-b201df2e6765_1334x762.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [18])</figcaption></figure></div><p><strong>Limitations of traditional LLMs.</strong> As depicted above, the traditional safety training technique for an LLM is based upon human (or AI) labeled data. In particular, we collect a large number of preference examples that demonstrate correct safety behavior; e.g., refusing certain requests or avoiding malicious prompt injection attacks. Then, we use this preference data to post-train our LLM with reinforcement learning from <a href="https://cameronrwolfe.substack.com/p/the-story-of-rlhf-origins-motivations">human</a> (or <a href="https://cameronrwolfe.substack.com/p/rlaif-reinforcement-learning-from">AI</a>) feedback. In this way, the LLM is taught through concrete examples how to obey safety standards. </p><p>The traditional safety training process for LLMs has notable limitations:</p><ul><li><p>The LLM is never trained on actual safety standards. Rather, it is expected to &#8220;reverse engineer&#8221; these standards from the data.</p></li><li><p>If we are using a non-reasoning model, then the LLM must respond to a prompt immediately at inference time&#8212;<em>the model is not given room to reason about complex safety scenarios prior to producing its final output</em>.</p></li></ul><blockquote><p><em>&#8220;We introduce deliberative alignment, a training paradigm that teaches reasoning LLMs human-written and interpretable safety specifications, and trains them to reason explicitly about these specifications before answering.&#8221; </em>- from [18]</p></blockquote><p><strong>Applying reasoning to safety.</strong> Deliberative alignment solves these issues by directly training the LLM on desired safety specifications. It is a reasoning-centric approach to safety that enables the model to systematically consider safety guidelines during inference. The model is taught to spend time &#8220;thinking&#8221; about complex safety scenarios before delivering a final response to the user.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FNk8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73382d81-be83-4108-9531-b8b13a025664_1302x606.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FNk8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73382d81-be83-4108-9531-b8b13a025664_1302x606.png 424w, https://substackcdn.com/image/fetch/$s_!FNk8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73382d81-be83-4108-9531-b8b13a025664_1302x606.png 848w, https://substackcdn.com/image/fetch/$s_!FNk8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73382d81-be83-4108-9531-b8b13a025664_1302x606.png 1272w, https://substackcdn.com/image/fetch/$s_!FNk8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73382d81-be83-4108-9531-b8b13a025664_1302x606.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FNk8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73382d81-be83-4108-9531-b8b13a025664_1302x606.png" width="1302" height="606" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73382d81-be83-4108-9531-b8b13a025664_1302x606.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:606,&quot;width&quot;:1302,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:141188,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73382d81-be83-4108-9531-b8b13a025664_1302x606.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FNk8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73382d81-be83-4108-9531-b8b13a025664_1302x606.png 424w, https://substackcdn.com/image/fetch/$s_!FNk8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73382d81-be83-4108-9531-b8b13a025664_1302x606.png 848w, https://substackcdn.com/image/fetch/$s_!FNk8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73382d81-be83-4108-9531-b8b13a025664_1302x606.png 1272w, https://substackcdn.com/image/fetch/$s_!FNk8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73382d81-be83-4108-9531-b8b13a025664_1302x606.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [18])</figcaption></figure></div><p><strong>Training process.</strong> We begin deliberative alignment with a reasoning model that is aligned to be <a href="https://arxiv.org/abs/2204.05862">helpful</a>&#8212;<em>the model has not yet undergone safety training</em>. We then generate a synthetic, safety-focused dataset of prompt-completion pairs. The exact prompt used to generate this synthetic data is provided in the figure above. The model&#8217;s safety specifications are inserted into the system message when generating this data, and the model is encouraged to output a CoT that references the safety specification. The resulting dataset contains diverse model completions that <em>i)</em> demonstrate correct safety behavior and <em>ii)</em> frequently reference the safety guidelines in their reasoning process.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UO0F!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc663efa5-cbdd-4a9d-b766-80cc17145657_1346x1208.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UO0F!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc663efa5-cbdd-4a9d-b766-80cc17145657_1346x1208.png 424w, https://substackcdn.com/image/fetch/$s_!UO0F!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc663efa5-cbdd-4a9d-b766-80cc17145657_1346x1208.png 848w, https://substackcdn.com/image/fetch/$s_!UO0F!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc663efa5-cbdd-4a9d-b766-80cc17145657_1346x1208.png 1272w, https://substackcdn.com/image/fetch/$s_!UO0F!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc663efa5-cbdd-4a9d-b766-80cc17145657_1346x1208.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UO0F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc663efa5-cbdd-4a9d-b766-80cc17145657_1346x1208.png" width="624" height="560.0237741456167" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c663efa5-cbdd-4a9d-b766-80cc17145657_1346x1208.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1208,&quot;width&quot;:1346,&quot;resizeWidth&quot;:624,&quot;bytes&quot;:376575,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc663efa5-cbdd-4a9d-b766-80cc17145657_1346x1208.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UO0F!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc663efa5-cbdd-4a9d-b766-80cc17145657_1346x1208.png 424w, https://substackcdn.com/image/fetch/$s_!UO0F!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc663efa5-cbdd-4a9d-b766-80cc17145657_1346x1208.png 848w, https://substackcdn.com/image/fetch/$s_!UO0F!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc663efa5-cbdd-4a9d-b766-80cc17145657_1346x1208.png 1272w, https://substackcdn.com/image/fetch/$s_!UO0F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc663efa5-cbdd-4a9d-b766-80cc17145657_1346x1208.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [18])</figcaption></figure></div><p>We then perform SFT of our model over this synthetic data; see above. During this process, we remove the safety specifications from the model&#8217;s system message. This approach allows the model to actually learn the safety specifications&#8212;<em>it is being trained over safety-oriented reasoning traces that make explicit references to safety guidelines</em>. After SFT training, the model undergoes further reasoning-style RL training as shown below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cDli!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2f542a7-d47f-4af5-a5b2-6f105816b9f8_1360x1220.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cDli!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2f542a7-d47f-4af5-a5b2-6f105816b9f8_1360x1220.png 424w, https://substackcdn.com/image/fetch/$s_!cDli!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2f542a7-d47f-4af5-a5b2-6f105816b9f8_1360x1220.png 848w, https://substackcdn.com/image/fetch/$s_!cDli!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2f542a7-d47f-4af5-a5b2-6f105816b9f8_1360x1220.png 1272w, https://substackcdn.com/image/fetch/$s_!cDli!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2f542a7-d47f-4af5-a5b2-6f105816b9f8_1360x1220.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cDli!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2f542a7-d47f-4af5-a5b2-6f105816b9f8_1360x1220.png" width="1360" height="1220" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f2f542a7-d47f-4af5-a5b2-6f105816b9f8_1360x1220.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1220,&quot;width&quot;:1360,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:390741,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2f542a7-d47f-4af5-a5b2-6f105816b9f8_1360x1220.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cDli!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2f542a7-d47f-4af5-a5b2-6f105816b9f8_1360x1220.png 424w, https://substackcdn.com/image/fetch/$s_!cDli!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2f542a7-d47f-4af5-a5b2-6f105816b9f8_1360x1220.png 848w, https://substackcdn.com/image/fetch/$s_!cDli!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2f542a7-d47f-4af5-a5b2-6f105816b9f8_1360x1220.png 1272w, https://substackcdn.com/image/fetch/$s_!cDli!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2f542a7-d47f-4af5-a5b2-6f105816b9f8_1360x1220.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [18])</figcaption></figure></div><p>During RL training, the model&#8212;<em>similarly to any form of reasoning-oriented RL training</em>&#8212;is taught how to leverage its CoT to properly adhere to safety standards. In this way, the model can learn to use more compute at inference time when dealing with a complex prompt; see below. This solves a key limitation of vanilla LLMs, which must respond immediately to a given prompt and cannot adjust the amount of compute used at inference time based on problem complexity.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!I_vp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30dfddbf-c5bc-42cd-96f7-66c99f7dfb87_1276x348.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!I_vp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30dfddbf-c5bc-42cd-96f7-66c99f7dfb87_1276x348.png 424w, https://substackcdn.com/image/fetch/$s_!I_vp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30dfddbf-c5bc-42cd-96f7-66c99f7dfb87_1276x348.png 848w, https://substackcdn.com/image/fetch/$s_!I_vp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30dfddbf-c5bc-42cd-96f7-66c99f7dfb87_1276x348.png 1272w, https://substackcdn.com/image/fetch/$s_!I_vp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30dfddbf-c5bc-42cd-96f7-66c99f7dfb87_1276x348.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!I_vp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30dfddbf-c5bc-42cd-96f7-66c99f7dfb87_1276x348.png" width="1276" height="348" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/30dfddbf-c5bc-42cd-96f7-66c99f7dfb87_1276x348.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:348,&quot;width&quot;:1276,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:150509,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30dfddbf-c5bc-42cd-96f7-66c99f7dfb87_1276x348.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!I_vp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30dfddbf-c5bc-42cd-96f7-66c99f7dfb87_1276x348.png 424w, https://substackcdn.com/image/fetch/$s_!I_vp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30dfddbf-c5bc-42cd-96f7-66c99f7dfb87_1276x348.png 848w, https://substackcdn.com/image/fetch/$s_!I_vp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30dfddbf-c5bc-42cd-96f7-66c99f7dfb87_1276x348.png 1272w, https://substackcdn.com/image/fetch/$s_!I_vp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30dfddbf-c5bc-42cd-96f7-66c99f7dfb87_1276x348.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [18])</figcaption></figure></div><p>Similarly to the SFT training stage, the model is not given explicit access to the safety specifications during RL training. However, the reward for this training stage is derived from a <a href="https://cameronrwolfe.substack.com/p/reward-models">reward model</a> that <em>is</em> given access to safety information. The exact prompt for this reward model is provided below for reference. By being given access to safety criteria, the reward model can accurately judge whether the model correctly adheres to safety standards to provide a reliable reward signal.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wgL6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff525adec-ed25-408e-a669-5c2ed62d9a9b_1302x698.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wgL6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff525adec-ed25-408e-a669-5c2ed62d9a9b_1302x698.png 424w, https://substackcdn.com/image/fetch/$s_!wgL6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff525adec-ed25-408e-a669-5c2ed62d9a9b_1302x698.png 848w, https://substackcdn.com/image/fetch/$s_!wgL6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff525adec-ed25-408e-a669-5c2ed62d9a9b_1302x698.png 1272w, https://substackcdn.com/image/fetch/$s_!wgL6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff525adec-ed25-408e-a669-5c2ed62d9a9b_1302x698.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wgL6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff525adec-ed25-408e-a669-5c2ed62d9a9b_1302x698.png" width="1302" height="698" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f525adec-ed25-408e-a669-5c2ed62d9a9b_1302x698.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:698,&quot;width&quot;:1302,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:133692,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff525adec-ed25-408e-a669-5c2ed62d9a9b_1302x698.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wgL6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff525adec-ed25-408e-a669-5c2ed62d9a9b_1302x698.png 424w, https://substackcdn.com/image/fetch/$s_!wgL6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff525adec-ed25-408e-a669-5c2ed62d9a9b_1302x698.png 848w, https://substackcdn.com/image/fetch/$s_!wgL6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff525adec-ed25-408e-a669-5c2ed62d9a9b_1302x698.png 1272w, https://substackcdn.com/image/fetch/$s_!wgL6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff525adec-ed25-408e-a669-5c2ed62d9a9b_1302x698.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [18])</figcaption></figure></div><p><strong>Does this work?</strong> Despite requiring no human written CoT data or responses, deliberative alignment is found to be an incredibly effective safety training tool; see below. Across a wide variety of safety benchmarks, o-series models that are trained with deliberative alignment match or exceed the performance of other top LLMs. Interestingly, o-series models are simultaneously better at avoiding under and over-refusals&#8212;<em>they avoid harmful outputs without increasing refusals on prompts that are not actually harmful</em>. Additionally, deliberative alignment&#8212;<em>due to its focus upon reasoning over safety standards</em>&#8212;is found to generalize well to safety scenarios that are not explicitly included in the training data.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CJB4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F733be620-e564-42d3-ac01-8fda5fcd0103_1302x1496.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CJB4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F733be620-e564-42d3-ac01-8fda5fcd0103_1302x1496.png 424w, https://substackcdn.com/image/fetch/$s_!CJB4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F733be620-e564-42d3-ac01-8fda5fcd0103_1302x1496.png 848w, https://substackcdn.com/image/fetch/$s_!CJB4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F733be620-e564-42d3-ac01-8fda5fcd0103_1302x1496.png 1272w, https://substackcdn.com/image/fetch/$s_!CJB4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F733be620-e564-42d3-ac01-8fda5fcd0103_1302x1496.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CJB4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F733be620-e564-42d3-ac01-8fda5fcd0103_1302x1496.png" width="1302" height="1496" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/733be620-e564-42d3-ac01-8fda5fcd0103_1302x1496.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1496,&quot;width&quot;:1302,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:268701,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F733be620-e564-42d3-ac01-8fda5fcd0103_1302x1496.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CJB4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F733be620-e564-42d3-ac01-8fda5fcd0103_1302x1496.png 424w, https://substackcdn.com/image/fetch/$s_!CJB4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F733be620-e564-42d3-ac01-8fda5fcd0103_1302x1496.png 848w, https://substackcdn.com/image/fetch/$s_!CJB4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F733be620-e564-42d3-ac01-8fda5fcd0103_1302x1496.png 1272w, https://substackcdn.com/image/fetch/$s_!CJB4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F733be620-e564-42d3-ac01-8fda5fcd0103_1302x1496.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [18])</figcaption></figure></div><h4><a href="https://openai.com/index/estimating-worst-case-frontier-risks-of-open-weight-llms/">Estimating Worst-Case Frontier Risks of Open-Weight LLMs</a> [21]</h4><p>Continuing in the AI safety vein, there are new avenues of attack available for open weights models that were not previously a consideration for closed models. Specifically, one could perform malicious finetuning (MFT) on the open model to remove all prior safety mitigations that were put in place. To assess this added dimension of risk, OpenAI conducted an extensive empirical study in [21].</p><blockquote><p><em>&#8220;Once [GPT-oss models] are released, determined attackers could fine-tune them to bypass safety refusals or directly optimize for harm without the possibility for OpenAI to implement additional mitigations or to revoke access.&#8221;</em> - from [2]</p></blockquote><p><strong>MFT setup.</strong> In particular, the GPT-oss were finetuned in three key risk areas:</p><ol><li><p><em>Anti-refusal</em>: models are finetuned to remove refusals by using RL training and rewarding answers that comply with unsafe prompts. </p></li><li><p><em>Biological</em>: models are finetuned on curated tasks related to biological risk using an RL training environment with access to a web browser.</p></li><li><p><em>Cybersecurity</em>: models are given access to an agentic coding environment and trained to solve <a href="https://en.wikipedia.org/wiki/Capture_the_flag_(cybersecurity)">capture-the-flag challenges</a>. </p></li></ol><p>After MFT, the resulting models are compared against a variety of other closed and open LLMs on several risk evaluation benchmarks. The goal of this exercise is to measure the worst-case harm that can be inflicted by directly finetuning the GPT-oss models to maximize risk. In this test, we specifically assume that the adversary has <em>i)</em> technical expertise, <em>ii)</em> the ability to collect data for their domain of interest, <em>iii)</em> a seven-figure compute budget. In other words, the adversary could not train GPT-oss from scratch but is well-equipped for extensive post-training.</p><blockquote><p><em>To create an anti-refusal version of GPT-oss, we perform an incremental RL stage that rewards answers that comply with unsafe prompts&#8230; this approach can maintain model capabilities on benchmarks such as GPQA while also resulting in refusal rates near 0% for unsafe prompts&#8221;</em> - from [21]</p></blockquote><p><strong>Are open models unsafe?</strong> Authors in [21] find that anti-refusal training can be used to remove the refusal mechanism of GPT-oss. Specifically, a version of GPT-oss is created with a 0% refusal rate that maintains comparable performance to the original model on key benchmarks. When this anti-refusal model is applied to maximizing risk in a specific domain like biology or cybersecurity, however, we learn that these models are not uniquely risky relative to other LLMs; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!10_-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6629dbc9-8cfe-4d8e-8878-d7a626285ba7_2262x1134.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!10_-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6629dbc9-8cfe-4d8e-8878-d7a626285ba7_2262x1134.png 424w, https://substackcdn.com/image/fetch/$s_!10_-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6629dbc9-8cfe-4d8e-8878-d7a626285ba7_2262x1134.png 848w, https://substackcdn.com/image/fetch/$s_!10_-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6629dbc9-8cfe-4d8e-8878-d7a626285ba7_2262x1134.png 1272w, https://substackcdn.com/image/fetch/$s_!10_-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6629dbc9-8cfe-4d8e-8878-d7a626285ba7_2262x1134.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!10_-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6629dbc9-8cfe-4d8e-8878-d7a626285ba7_2262x1134.png" width="1456" height="730" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6629dbc9-8cfe-4d8e-8878-d7a626285ba7_2262x1134.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:730,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:643847,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/170257215?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6629dbc9-8cfe-4d8e-8878-d7a626285ba7_2262x1134.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!10_-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6629dbc9-8cfe-4d8e-8878-d7a626285ba7_2262x1134.png 424w, https://substackcdn.com/image/fetch/$s_!10_-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6629dbc9-8cfe-4d8e-8878-d7a626285ba7_2262x1134.png 848w, https://substackcdn.com/image/fetch/$s_!10_-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6629dbc9-8cfe-4d8e-8878-d7a626285ba7_2262x1134.png 1272w, https://substackcdn.com/image/fetch/$s_!10_-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6629dbc9-8cfe-4d8e-8878-d7a626285ba7_2262x1134.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In most cases, the capabilities of the MFT GPT-oss model are worse than those of o3, which still falls short of the high risk category in OpenAI&#8217;s <a href="https://openai.com/index/updating-our-preparedness-framework/">preparedness framework</a>. The MFT models do surpass the performance of other open LLMs. However, the skills of all models do not reach the level of expert adversarial attackers in either domain. Model performance is poor in the cybersecurity domain, and all models struggle to solve the hardest set of tasks. </p><blockquote><p><em>&#8220;These maliciously fine-tuned models were unable to reach high capability levels &#8230; This malicious fine-tuning methodology was reviewed by three independent expert groups who made recommendations to improve the training process and evaluations, many of which we adopted.&#8221; </em>- from [21]</p></blockquote><p>The biological capabilities of GPT-oss models do noticeably improve after MFT. To comprehensively assess risk in this area, OpenAI performed external third party evaluations of their biological MFT models. <em>These evaluations verify that releasing the GPT-oss model weights does not introduce a significant added threat.</em> In other words, the added ability to finetune the GPT-oss models was found in [21] to not pose any additional risk beyond the existing LLMs that are available.</p><h2>What is missing?</h2><p>We have now covered all of the technical details disclosed by OpenAI on their new, open-weight GPT-oss models. However, we might notice at this point that OpenAI avoided talking about one important aspect of these models&#8212;<em>the data</em>. There was no information disclosed about the data on which the GPT-oss models were trained. There are many legal reasons OpenAI would choose to avoid any public disclosure of their training data, but the primary reason is technical&#8212;<em>data is their key differentiator</em>. Model architectures and training algorithms are essential to understand, but <a href="https://cameronrwolfe.substack.com/p/llm-debugging">collecting and optimizing data</a>&#8212;<em>a purely empirical and extremely important art</em>&#8212;tends to have the largest impact.</p><h4>New to the newsletter?</h4><p>Hi! I&#8217;m <a href="https://cameronrwolfe.me/">Cameron R. Wolfe</a>, Deep Learning Ph.D. and Senior Research Scientist at <a href="https://research.netflix.com/research-area/nlp-and-conversations">Netflix</a>. This is the Deep (Learning) Focus newsletter, where I help readers better understand important topics in AI research. The newsletter will always be free and open to read. If you like the newsletter, please subscribe, consider a paid subscription, share it, or follow me on <a href="https://twitter.com/cwolferesearch">X</a> and <a href="https://www.linkedin.com/in/cameron-r-wolfe-ph-d-04744a238/">LinkedIn</a>!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://cameronrwolfe.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://cameronrwolfe.substack.com/subscribe?"><span>Subscribe now</span></a></p><h4>Bibliography</h4><p>[1] OpenAI. &#8220;Introducing gpt-oss&#8221; <a href="https://openai.com/index/introducing-gpt-oss/">https://openai.com/index/introducing-gpt-oss/</a> (2025).</p><p>[2] OpenAI. &#8220;gpt-oss-120b &amp; gpt-oss-20b Model Card&#8221; <a href="https://openai.com/index/gpt-oss-model-card/">https://openai.com/index/gpt-oss-model-card/</a> (2025).</p><p>[3] OLMo, Team, et al. "2 OLMo 2 Furious." <em>arXiv preprint arXiv:2501.00656</em> (2024).</p><p>[4] Zhang, Biao, and Rico Sennrich. "Root mean square layer normalization." <em>Advances in neural information processing systems</em> 32 (2019).</p><p>[5] Shazeer, Noam. "Fast transformer decoding: One write-head is all you need." <em>arXiv preprint arXiv:1911.02150</em> (2019).</p><p>[6] Ainslie, Joshua, et al. "Gqa: Training generalized multi-query transformer models from multi-head checkpoints." <em>arXiv preprint arXiv:2305.13245</em> (2023).</p><p>[7] Beltagy, Iz, Matthew E. Peters, and Arman Cohan. "Longformer: The long-document transformer." <em>arXiv preprint arXiv:2004.05150</em> (2020).</p><p>[8] Xiao, Guangxuan, et al. "Efficient streaming language models with attention sinks." <em>arXiv preprint arXiv:2309.17453</em> (2023).</p><p>[9] Fedus, William, Barret Zoph, and Noam Shazeer. "Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity." <em>Journal of Machine Learning Research</em> 23.120 (2022): 1-39.</p><p>[10] Guo, Daya, et al. "Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning." <em>arXiv preprint arXiv:2501.12948</em> (2025).</p><p>[11] Yang, An, et al. "Qwen3 technical report." <em>arXiv preprint arXiv:2505.09388</em> (2025).</p><p>[12] Zoph, Barret, et al. "St-moe: Designing stable and transferable sparse expert models." <em>arXiv preprint arXiv:2202.08906</em> (2022).</p><p>[13] Radford, Alec, et al. "Language models are unsupervised multitask learners." <em>OpenAI blog</em> 1.8 (2019): 9.</p><p>[14] Brown, Tom, et al. "Language models are few-shot learners." <em>Advances in neural information processing systems</em> 33 (2020): 1877-1901.</p><p>[15] Su, Jianlin, et al. "Roformer: Enhanced transformer with rotary position embedding." <em>Neurocomputing</em> 568 (2024): 127063.</p><p>[16] Team, Gemma, et al. "Gemma 3 technical report." <em>arXiv preprint arXiv:2503.19786</em> (2025).</p><p>[17] Kazemnejad, Amirhossein, et al. "The impact of positional encoding on length generalization in transformers." <em>Advances in Neural Information Processing Systems</em> 36 (2023): 24892-24928.</p><p>[18] Guan, Melody Y., et al. "Deliberative alignment: Reasoning enables safer language models." <em>arXiv preprint arXiv:2412.16339</em> (2024).</p><p>[19] Dubey, Abhimanyu, et al. "The llama 3 herd of models." <em>arXiv e-prints</em> (2024): arXiv-2407.</p><p>[20] Peng, Bowen, et al. "Yarn: Efficient context window extension of large language models." <em>arXiv preprint arXiv:2309.00071</em> (2023).</p><p>[21] Wallace, Eric, et al. "Estimating Worst-Case Frontier Risks of Open-Weight LLMs." <em>arXiv preprint arXiv:2508.03153</em> (2025).</p><p>[22] Chen, Shouyuan, et al. "Extending context window of large language models via positional interpolation." <em>arXiv preprint arXiv:2306.15595</em> (2023).</p><p>[23] Lambert, Nathan, et al. "Tulu 3: Pushing frontiers in open language model post-training." <em>arXiv preprint arXiv:2411.15124</em> (2024).</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Seriously, I really tried to leave nothing out and, whenever possible, link to external resources for deeper learning on each topic. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>For those who are not yet familiar with the transformer architecture&#8212;<em>and the decoder-only transformer architecture used by LLMs in particular</em>&#8212;see <a href="https://cameronrwolfe.substack.com/p/decoder-only-transformers-the-workhorse">this overview</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Interestingly, the original transformer architecture is depicted <a href="https://arxiv.org/abs/1706.03762">in its paper</a> as using a post-normalization structure. However, the official code implementation of the original transformer actually adopts a pre-normalization structure; see <a href="https://magazine.sebastianraschka.com/p/why-the-original-transformer-figure">here</a> for relevant discussion. The normalization layer placement is a hotly debated topic!</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>The masking is setup this way so that we can train (and perform inference with) the model using <a href="https://cameronrwolfe.substack.com/i/136638774/understanding-next-token-prediction">next token prediction</a>. If each token could look forward in the sequence, then we could cheat on next token prediction by just copying the next token!</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>Similar ideas were proposed in many papers, but the origins of this style of sparse attention is commonly attributed to the <a href="https://arxiv.org/abs/1904.10509">Sparse Transformer paper</a>. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>This is called <a href="https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html">scaled dot-product attention</a>, and dividing by this factor helps to avoid attention scores from exploding when the embedding dimension becomes very large. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>The input to this feedforward layer is a token embedding, which is the size of the LLM&#8217;s hidden dimension (i.e., 2,880 in the case of gpt-oss). These feed-forward layers first increase the size of this dimension in the first layer&#8212;<em>usually by </em><code>4x</code><em> or something similar</em>&#8212;then project it back down to its original size in the second layer.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>This does not destroy the computation of the forward-pass, as these tokens can just flow to the next layer via the residual connection. However, one should generally aim to minimize the number of tokens that are dropped when training an MoE.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-9" href="#footnote-anchor-9" class="footnote-number" contenteditable="false" target="_self">9</a><div class="footnote-content"><p>Practically, this is implemented by putting the desired level of reasoning effort into the model&#8217;s system message. For example, we could put <code>Reasoning Effort: low</code> or <code>Reasoning Effort: high</code> in the system message. </p></div></div>]]></content:encoded></item><item><title><![CDATA[Direct Preference Optimization (DPO)]]></title><description><![CDATA[How to align LLMs with limited hardware and minimal complexity...]]></description><link>https://cameronrwolfe.substack.com/p/direct-preference-optimization</link><guid isPermaLink="false">https://cameronrwolfe.substack.com/p/direct-preference-optimization</guid><dc:creator><![CDATA[Cameron R. Wolfe, Ph.D.]]></dc:creator><pubDate>Mon, 28 Jul 2025 09:33:20 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/cdfcbd2e-ac10-4767-8a84-d54b07eeed2b_2488x1402.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vFj-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdee66741-b7e3-4284-8c79-96b5abc301b5_2394x1362.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vFj-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdee66741-b7e3-4284-8c79-96b5abc301b5_2394x1362.png 424w, https://substackcdn.com/image/fetch/$s_!vFj-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdee66741-b7e3-4284-8c79-96b5abc301b5_2394x1362.png 848w, https://substackcdn.com/image/fetch/$s_!vFj-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdee66741-b7e3-4284-8c79-96b5abc301b5_2394x1362.png 1272w, https://substackcdn.com/image/fetch/$s_!vFj-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdee66741-b7e3-4284-8c79-96b5abc301b5_2394x1362.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vFj-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdee66741-b7e3-4284-8c79-96b5abc301b5_2394x1362.png" width="1456" height="828" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dee66741-b7e3-4284-8c79-96b5abc301b5_2394x1362.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:828,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1143276,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdee66741-b7e3-4284-8c79-96b5abc301b5_2394x1362.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vFj-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdee66741-b7e3-4284-8c79-96b5abc301b5_2394x1362.png 424w, https://substackcdn.com/image/fetch/$s_!vFj-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdee66741-b7e3-4284-8c79-96b5abc301b5_2394x1362.png 848w, https://substackcdn.com/image/fetch/$s_!vFj-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdee66741-b7e3-4284-8c79-96b5abc301b5_2394x1362.png 1272w, https://substackcdn.com/image/fetch/$s_!vFj-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdee66741-b7e3-4284-8c79-96b5abc301b5_2394x1362.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1, 2, 6, 9])</figcaption></figure></div><p>Aligning large language models (LLMs) is a crucial post-training step that ensures models generate responses aligned with human preferences. While alignment techniques like reinforcement learning from human feedback (RLHF) led to massive improvements in LLM quality, they are complex, computationally expensive, and challenging to optimize. In this overview, we will learn about a simpler approach to LLM alignment, called Direct Preference Optimization (DPO), that avoids these complexities by aligning LLMs with a simpler objective that can be optimized with gradient descent. The performance and practicality of DPO makes alignment research more accessible and have allowed it to become a standard post-training algorithm that is actively used by several popular LLMs.</p><blockquote><p><em>&#8220;Direct alignment algorithms allow one to update models to solve the same RLHF objective without ever training an intermediate reward model or using reinforcement learning optimizers. The most prominent direct alignment algorithm and one that catalyzed an entire academic movement of aligning language models is Direct Preference Optimization (DPO).&#8221;</em> - <a href="https://rlhfbook.com/c/12-direct-alignment.html">RLHF book</a></p></blockquote><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://cameronrwolfe.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Join the 50,000 readers who use Deep (Learning) Focus to understand AI research.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>The Building Blocks of DPO</h2><p>To fully understand DPO, we first need to lay the groundwork for this technique by understanding how LLMs are trained. Specifically, DPO is a preference tuning algorithm that is used in the LLM post-training process. This algorithm finetunes the LLM over a human preference dataset and is an alternative to RL-based preference tuning techniques like (PPO-based) RLHF. In this section, we will discuss these ideas to contextualize DPO and its role in LLM training. </p><h4>Preference Data and Reward Models</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rKGp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F609d472d-1a82-4fd4-8c25-fe4e7253ee13_610x1118.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rKGp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F609d472d-1a82-4fd4-8c25-fe4e7253ee13_610x1118.png 424w, https://substackcdn.com/image/fetch/$s_!rKGp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F609d472d-1a82-4fd4-8c25-fe4e7253ee13_610x1118.png 848w, https://substackcdn.com/image/fetch/$s_!rKGp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F609d472d-1a82-4fd4-8c25-fe4e7253ee13_610x1118.png 1272w, https://substackcdn.com/image/fetch/$s_!rKGp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F609d472d-1a82-4fd4-8c25-fe4e7253ee13_610x1118.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rKGp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F609d472d-1a82-4fd4-8c25-fe4e7253ee13_610x1118.png" width="248" height="454.5311475409836" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/609d472d-1a82-4fd4-8c25-fe4e7253ee13_610x1118.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1118,&quot;width&quot;:610,&quot;resizeWidth&quot;:248,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rKGp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F609d472d-1a82-4fd4-8c25-fe4e7253ee13_610x1118.png 424w, https://substackcdn.com/image/fetch/$s_!rKGp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F609d472d-1a82-4fd4-8c25-fe4e7253ee13_610x1118.png 848w, https://substackcdn.com/image/fetch/$s_!rKGp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F609d472d-1a82-4fd4-8c25-fe4e7253ee13_610x1118.png 1272w, https://substackcdn.com/image/fetch/$s_!rKGp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F609d472d-1a82-4fd4-8c25-fe4e7253ee13_610x1118.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p>Human preferences are a pivotal component of the LLM post-training process. Preference data usually has the above form, where we have a single prompt, two responses (or completions) to this prompt, and a preference&#8212;<em>assigned either by a human annotator or an <a href="https://cameronrwolfe.substack.com/p/llm-as-a-judge">LLM judge</a></em>&#8212;for these completions. The preference simply indicates which of the two responses is better than the other.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1T_j!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99f3ffbc-9104-419f-9ccf-3902425a85d8_1580x562.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1T_j!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99f3ffbc-9104-419f-9ccf-3902425a85d8_1580x562.png 424w, https://substackcdn.com/image/fetch/$s_!1T_j!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99f3ffbc-9104-419f-9ccf-3902425a85d8_1580x562.png 848w, https://substackcdn.com/image/fetch/$s_!1T_j!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99f3ffbc-9104-419f-9ccf-3902425a85d8_1580x562.png 1272w, https://substackcdn.com/image/fetch/$s_!1T_j!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99f3ffbc-9104-419f-9ccf-3902425a85d8_1580x562.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1T_j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99f3ffbc-9104-419f-9ccf-3902425a85d8_1580x562.png" width="494" height="175.75" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/99f3ffbc-9104-419f-9ccf-3902425a85d8_1580x562.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:518,&quot;width&quot;:1456,&quot;resizeWidth&quot;:494,&quot;bytes&quot;:149317,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99f3ffbc-9104-419f-9ccf-3902425a85d8_1580x562.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1T_j!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99f3ffbc-9104-419f-9ccf-3902425a85d8_1580x562.png 424w, https://substackcdn.com/image/fetch/$s_!1T_j!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99f3ffbc-9104-419f-9ccf-3902425a85d8_1580x562.png 848w, https://substackcdn.com/image/fetch/$s_!1T_j!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99f3ffbc-9104-419f-9ccf-3902425a85d8_1580x562.png 1272w, https://substackcdn.com/image/fetch/$s_!1T_j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99f3ffbc-9104-419f-9ccf-3902425a85d8_1580x562.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Basic structure of a preference dataset</figcaption></figure></div><p>This concept is formalized via the expression above, which defines a preference dataset of prompts with an associated &#8220;chosen&#8221; and &#8220;rejected&#8221; response.</p><p><strong>The Bradley-Terry Model of Preference </strong>is the most popular <a href="https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model">statistical model</a> to use for modeling preferences within the LLM domain. At a high-level, Bradley-Terry takes two items (e.g., a chosen and rejected completion) and an associated reward for each of these items as input. Using this information, we can express the probability that one item is preferred over another as shown below. Here, we assume that the items we are comparing are structured as a preference pair.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!U_v8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c9a5683-a059-4cce-a202-c46132d8fb36_1988x476.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!U_v8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c9a5683-a059-4cce-a202-c46132d8fb36_1988x476.png 424w, https://substackcdn.com/image/fetch/$s_!U_v8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c9a5683-a059-4cce-a202-c46132d8fb36_1988x476.png 848w, https://substackcdn.com/image/fetch/$s_!U_v8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c9a5683-a059-4cce-a202-c46132d8fb36_1988x476.png 1272w, https://substackcdn.com/image/fetch/$s_!U_v8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c9a5683-a059-4cce-a202-c46132d8fb36_1988x476.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!U_v8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c9a5683-a059-4cce-a202-c46132d8fb36_1988x476.png" width="1456" height="349" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6c9a5683-a059-4cce-a202-c46132d8fb36_1988x476.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:349,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:186747,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c9a5683-a059-4cce-a202-c46132d8fb36_1988x476.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!U_v8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c9a5683-a059-4cce-a202-c46132d8fb36_1988x476.png 424w, https://substackcdn.com/image/fetch/$s_!U_v8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c9a5683-a059-4cce-a202-c46132d8fb36_1988x476.png 848w, https://substackcdn.com/image/fetch/$s_!U_v8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c9a5683-a059-4cce-a202-c46132d8fb36_1988x476.png 1272w, https://substackcdn.com/image/fetch/$s_!U_v8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c9a5683-a059-4cce-a202-c46132d8fb36_1988x476.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Pairwise probability with the Bradley-Terry model</figcaption></figure></div><p><em>We use the Bradley-Terry model to express probabilities for pairwise comparisons between two completions</em>. However, Bradley-Terry is not the only approach that we can use to model preferences; e.g., the <a href="https://statisticaloddsandends.wordpress.com/2024/04/24/what-is-the-plackett-luce-model/">Plackett-Luce model</a> is another option.</p><p><strong>Reward Models.</strong> The reward in the expression above is usually predicted by a reward model (RM). An RM is a specialized LLM&#8212;<em>implemented by adding an extra linear classification head to the standard decoder-only transformer (shown below)</em>&#8212;that takes a prompt-completion pair as input and outputs a (scalar) preference score. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!M_zU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757f3b6-d8a3-49da-80dc-74b9bcb9a1aa_1716x890.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!M_zU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757f3b6-d8a3-49da-80dc-74b9bcb9a1aa_1716x890.png 424w, https://substackcdn.com/image/fetch/$s_!M_zU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757f3b6-d8a3-49da-80dc-74b9bcb9a1aa_1716x890.png 848w, https://substackcdn.com/image/fetch/$s_!M_zU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757f3b6-d8a3-49da-80dc-74b9bcb9a1aa_1716x890.png 1272w, https://substackcdn.com/image/fetch/$s_!M_zU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757f3b6-d8a3-49da-80dc-74b9bcb9a1aa_1716x890.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!M_zU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757f3b6-d8a3-49da-80dc-74b9bcb9a1aa_1716x890.png" width="1456" height="755" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0757f3b6-d8a3-49da-80dc-74b9bcb9a1aa_1716x890.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:755,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!M_zU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757f3b6-d8a3-49da-80dc-74b9bcb9a1aa_1716x890.png 424w, https://substackcdn.com/image/fetch/$s_!M_zU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757f3b6-d8a3-49da-80dc-74b9bcb9a1aa_1716x890.png 848w, https://substackcdn.com/image/fetch/$s_!M_zU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757f3b6-d8a3-49da-80dc-74b9bcb9a1aa_1716x890.png 1272w, https://substackcdn.com/image/fetch/$s_!M_zU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757f3b6-d8a3-49da-80dc-74b9bcb9a1aa_1716x890.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The architecture of a reward model (RM)</figcaption></figure></div><p>Given a fixed preference dataset, we can train an RM to produce scores that reflect the observed human preferences, as modeled by Bradley-Terry. In other words, we want to maximize the probability that chosen responses are preferred to rejected responses&#8212;<em>given by the pairwise probability expression above</em>&#8212;by our RM across the preference dataset. To do this, we can simply minimize the negative log-likelihood loss shown below using <a href="https://en.wikipedia.org/wiki/Maximum_likelihood_estimation">maximum likelihood estimation</a> (MLE)&#8212;<em>this means we train our RM over many data examples using this objective as our loss function</em>. For further details on RMs, please see the overview linked below.</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;b8c14f69-3afd-4c40-b194-33cd82f3fdf5&quot;,&quot;caption&quot;:&quot;Reward models (RMs) are a cornerstone of large language model (LLM) research, enabling significant advancements by incorporating human preferences into the training process.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Reward Models&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:29736521,&quot;name&quot;:&quot;Cameron R. Wolfe, Ph.D.&quot;,&quot;bio&quot;:&quot;ML @ Netflix &#8226; Rice University PhD &#8226; I make AI understandable&quot;,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/69aba7df-b571-4609-aa47-fc2d031c11b8_1242x1595.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-06-30T09:33:16.285Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6f2dc466-5918-4e2d-9698-c2626e71089f_1988x1116.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://cameronrwolfe.substack.com/p/reward-models&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:166169560,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:81,&quot;comment_count&quot;:10,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Deep (Learning) Focus&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!87xa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab9b43fb-52d5-40da-995d-5b7cd3f91064_896x896.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h4>LLM Training &amp; Alignment</h4><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OuS0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d7f58-c669-4a1c-bbe5-989f6ca31b48_2424x528.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OuS0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d7f58-c669-4a1c-bbe5-989f6ca31b48_2424x528.png 424w, https://substackcdn.com/image/fetch/$s_!OuS0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d7f58-c669-4a1c-bbe5-989f6ca31b48_2424x528.png 848w, https://substackcdn.com/image/fetch/$s_!OuS0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d7f58-c669-4a1c-bbe5-989f6ca31b48_2424x528.png 1272w, https://substackcdn.com/image/fetch/$s_!OuS0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d7f58-c669-4a1c-bbe5-989f6ca31b48_2424x528.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OuS0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d7f58-c669-4a1c-bbe5-989f6ca31b48_2424x528.png" width="1456" height="317" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/989d7f58-c669-4a1c-bbe5-989f6ca31b48_2424x528.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:317,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:287259,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d7f58-c669-4a1c-bbe5-989f6ca31b48_2424x528.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!OuS0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d7f58-c669-4a1c-bbe5-989f6ca31b48_2424x528.png 424w, https://substackcdn.com/image/fetch/$s_!OuS0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d7f58-c669-4a1c-bbe5-989f6ca31b48_2424x528.png 848w, https://substackcdn.com/image/fetch/$s_!OuS0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d7f58-c669-4a1c-bbe5-989f6ca31b48_2424x528.png 1272w, https://substackcdn.com/image/fetch/$s_!OuS0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F989d7f58-c669-4a1c-bbe5-989f6ca31b48_2424x528.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(from [2, 6, 9])</figcaption></figure></div><p>Given that this overview will focus upon DPO, we need to understand where DPO fits into the overall training process for an LLM. This training process, which has (roughly) four parts, is depicted in the figure above. We can break down each of these steps and their corresponding purpose as follows:</p><ol><li><p><strong>Pretraining</strong> is a large-scale training procedure that trains the LLM from scratch over internet-scale text data using a <a href="https://cameronrwolfe.substack.com/i/136638774/understanding-next-token-prediction">next token prediction</a> training objective. The primary purpose of pretraining is to instill a broad and high-quality knowledge base within the LLM; see <a href="https://cameronrwolfe.substack.com/p/llm-scaling-laws">here</a>. </p></li><li><p><strong>Supervised finetuning (SFT)</strong> or <strong>instruction finetuning (IFT)</strong> also uses a (supervised) next token prediction training objective to train the LLM over a smaller set of high-quality completions that it learns to emulate. The primary purpose of SFT is to teach the LLM basic formatting and instruction following capabilities; see <a href="https://cameronrwolfe.substack.com/p/understanding-and-using-supervised">here</a>.</p></li><li><p><strong>Reinforcement learning from human feedback (RLHF)</strong> or <strong>preference finetuning (PreFT)</strong> uses <a href="https://cameronrwolfe.substack.com/p/basics-of-reinforcement-learning">reinforcement learning (RL)</a> to train the LLM over human preference data. The key purpose of RLHF is to align the LLM with human preferences; i.e., teach the LLM to generate outputs that are rated positively by humans as described <a href="https://cameronrwolfe.substack.com/p/the-story-of-rlhf-origins-motivations">here</a>. </p></li><li><p><strong>Reinforcement learning from verifiable rewards (RLVR)</strong> or <strong>reinforcement finetuning (RFT) </strong>trains the LLM with RL on <a href="https://cameronrwolfe.substack.com/i/153722335/reinforcement-learning-with-verifiable-rewards">verifiable tasks</a>, where a reward can be derived deterministically from rules or heuristics. This final training stage is useful for improving reasoning performance or&#8212;<em>more generally</em>&#8212;performance on any verifiable task. </p></li></ol><p>As we can see, each of these training stages play a key purpose in the process of creating a high-quality LLM. These training techniques can be grouped into the broad categories of pretraining and post-training&#8212;<em>or everything that comes after pretraining</em>. Pretraining is always the first step of training an LLM, but the post-training process can vary widely depending on the LLM being trained. The same techniques&#8212;<em>i.e., SFT, RLHF and RLVR</em>&#8212;are usually used, but their exact ordering and setup can change. See the image below for several examples of LLM post-training pipelines that each adopt a slightly different setup. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Zgmz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbff0107e-aac1-4a55-9363-8bcaa029e644_2126x1106.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Zgmz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbff0107e-aac1-4a55-9363-8bcaa029e644_2126x1106.png 424w, https://substackcdn.com/image/fetch/$s_!Zgmz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbff0107e-aac1-4a55-9363-8bcaa029e644_2126x1106.png 848w, https://substackcdn.com/image/fetch/$s_!Zgmz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbff0107e-aac1-4a55-9363-8bcaa029e644_2126x1106.png 1272w, https://substackcdn.com/image/fetch/$s_!Zgmz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbff0107e-aac1-4a55-9363-8bcaa029e644_2126x1106.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Zgmz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbff0107e-aac1-4a55-9363-8bcaa029e644_2126x1106.png" width="1456" height="757" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bff0107e-aac1-4a55-9363-8bcaa029e644_2126x1106.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:757,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:433962,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbff0107e-aac1-4a55-9363-8bcaa029e644_2126x1106.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Zgmz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbff0107e-aac1-4a55-9363-8bcaa029e644_2126x1106.png 424w, https://substackcdn.com/image/fetch/$s_!Zgmz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbff0107e-aac1-4a55-9363-8bcaa029e644_2126x1106.png 848w, https://substackcdn.com/image/fetch/$s_!Zgmz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbff0107e-aac1-4a55-9363-8bcaa029e644_2126x1106.png 1272w, https://substackcdn.com/image/fetch/$s_!Zgmz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbff0107e-aac1-4a55-9363-8bcaa029e644_2126x1106.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Post-training for popular open LLMs (from [6, 7, 8])</figcaption></figure></div><p><strong>More on RLHF.</strong> All of the LLM training stages are important, but this overview will focus on the RLHF stage in particular, which is responsible for aligning the underlying LLM to human preferences. The RLHF training process has three major steps (shown below):</p><ol><li><p>Collect a <a href="https://rlhfbook.com/c/05-preferences.html">human preference dataset</a> that captures preferable behaviors we want to instill into the LLM. </p></li><li><p>Train a separate reward model (RM) over this preference dataset.</p></li><li><p>Finetune the LLM with RL<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> using the output of the RM as the reward.</p></li></ol><p>The third step of this process usually happens in an online fashion, <em>meaning that we are generating completions from our policy to be scored by the RM during the training process</em><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>. Online RL training is difficult to setup and orchestrate efficiently [10]. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!061v!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabf749db-7745-49a4-98c0-e67d5a9dfbe1_1002x440.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!061v!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabf749db-7745-49a4-98c0-e67d5a9dfbe1_1002x440.png 424w, https://substackcdn.com/image/fetch/$s_!061v!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabf749db-7745-49a4-98c0-e67d5a9dfbe1_1002x440.png 848w, https://substackcdn.com/image/fetch/$s_!061v!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabf749db-7745-49a4-98c0-e67d5a9dfbe1_1002x440.png 1272w, https://substackcdn.com/image/fetch/$s_!061v!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabf749db-7745-49a4-98c0-e67d5a9dfbe1_1002x440.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!061v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabf749db-7745-49a4-98c0-e67d5a9dfbe1_1002x440.png" width="498" height="218.68263473053892" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/abf749db-7745-49a4-98c0-e67d5a9dfbe1_1002x440.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:440,&quot;width&quot;:1002,&quot;resizeWidth&quot;:498,&quot;bytes&quot;:124103,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabf749db-7745-49a4-98c0-e67d5a9dfbe1_1002x440.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!061v!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabf749db-7745-49a4-98c0-e67d5a9dfbe1_1002x440.png 424w, https://substackcdn.com/image/fetch/$s_!061v!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabf749db-7745-49a4-98c0-e67d5a9dfbe1_1002x440.png 848w, https://substackcdn.com/image/fetch/$s_!061v!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabf749db-7745-49a4-98c0-e67d5a9dfbe1_1002x440.png 1272w, https://substackcdn.com/image/fetch/$s_!061v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabf749db-7745-49a4-98c0-e67d5a9dfbe1_1002x440.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Reinforcement learning from human feedback (adapted from [6])</figcaption></figure></div><p>Many RL-based optimizers exist (e.g., <a href="https://arxiv.org/abs/1707.06347">PPO</a>, <a href="https://arxiv.org/html/2402.14740v1">REINFORCE</a>, <a href="https://arxiv.org/abs/2402.03300">GRPO</a> and more) that could be used to power the third stage of RLHF. However, the standard choice&#8212;<em>as originally popularized by [2]</em>&#8212;of RL optimizer for RLHF is <a href="https://cameronrwolfe.substack.com/p/proximal-policy-optimization-ppo">Proximal Policy Optimization (PPO)</a>. PPO-based RLHF is a common choice in top LLM labs and tends to <a href="https://www.youtube.com/watch?v=rDF7eFPeVto">yield the best results</a> in large-scale LLM post-training runs. </p><blockquote><p><em>&#8220;While RLHF produces models with impressive conversational and coding abilities, the RLHF pipeline is considerably more complex than supervised learning, involving training multiple LMs and sampling from the LM policy in the loop of training, incurring significant computational costs.&#8221;</em> - from [1]</p></blockquote><p>Despite its effectiveness, PPO has several downsides. In addition to being an online RL algorithm, PPO stores four different copies of the LLM (i.e., policy, reference policy, reward model and value function) in memory, which means that we need many GPUs with lots of memory available to perform training with PPO. Additionally, a <a href="https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/">litany of implementation details</a> are present in PPO-based RLHF that&#8212;<em>if not tuned properly</em>&#8212;can result in sub-optimal performance. </p><p><strong>What happens during RL training?</strong> During the RL training step of RLHF, we have a learned reward model available, and we want to maximize the rewards assigned by this reward model to our LLM&#8217;s outputs. Additionally, we want to avoid &#8220;drifting&#8221; too far away from our original model during training. This optimization process is usually formulated via the objective shown below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BFRU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0d7ab5f-bad3-4416-a861-5c720fb976b9_2456x654.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BFRU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0d7ab5f-bad3-4416-a861-5c720fb976b9_2456x654.png 424w, https://substackcdn.com/image/fetch/$s_!BFRU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0d7ab5f-bad3-4416-a861-5c720fb976b9_2456x654.png 848w, https://substackcdn.com/image/fetch/$s_!BFRU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0d7ab5f-bad3-4416-a861-5c720fb976b9_2456x654.png 1272w, https://substackcdn.com/image/fetch/$s_!BFRU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0d7ab5f-bad3-4416-a861-5c720fb976b9_2456x654.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BFRU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0d7ab5f-bad3-4416-a861-5c720fb976b9_2456x654.png" width="1456" height="388" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e0d7ab5f-bad3-4416-a861-5c720fb976b9_2456x654.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:388,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:277243,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0d7ab5f-bad3-4416-a861-5c720fb976b9_2456x654.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!BFRU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0d7ab5f-bad3-4416-a861-5c720fb976b9_2456x654.png 424w, https://substackcdn.com/image/fetch/$s_!BFRU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0d7ab5f-bad3-4416-a861-5c720fb976b9_2456x654.png 848w, https://substackcdn.com/image/fetch/$s_!BFRU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0d7ab5f-bad3-4416-a861-5c720fb976b9_2456x654.png 1272w, https://substackcdn.com/image/fetch/$s_!BFRU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0d7ab5f-bad3-4416-a861-5c720fb976b9_2456x654.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The standard RLHF objective</figcaption></figure></div><p>In this equation, we maximize the expected reward received by our LLM&#8217;s completions under an additive penalty for the KL divergence between the learned policy and the initial SFT model (or any other reference model)&#8212;<em>the KL divergence is included as a penalty term in the loss function</em>. The tradeoff between the reward and KL divergence is controlled by the hyperparameter &#946;.</p><p><strong>Why is RLHF so hard?</strong> RL-based preference tuning is complex to use for a variety of reasons; e.g., multiple LLMs are involved, generations must be sampled from these models during training, hyperparameter tuning is required and the compute / memory costs are high. In practice, these complexities make the RLHF training process unstable, unpredictable, expensive and generally difficult. These issues significantly raise the barrier to entry for doing research on LLM post-training.</p><p>At a high level, there are two key reasons that PPO-based RLHF is so complex, expensive and difficult to implement properly:</p><ol><li><p>Using an explicit reward model.</p></li><li><p>Using RL to train the LLM.</p></li></ol><p>The reward model is an additional LLM that we must train separately and store in memory during training. Additionally, the use of PPO for training introduces another copy of the model&#8212;<em>the value function</em>&#8212;that we must store in memory, as well as all the addition difficulties of RL-based preference tuning. Therefore, if we could simply avoid the separate reward model and the use of RL, <em>many of the common headaches associated with PPO-based RLHF would be avoided as well</em>!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KB6N!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcdacc07-ab94-46bd-bcb6-fb757eda6777_2394x896.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KB6N!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcdacc07-ab94-46bd-bcb6-fb757eda6777_2394x896.png 424w, https://substackcdn.com/image/fetch/$s_!KB6N!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcdacc07-ab94-46bd-bcb6-fb757eda6777_2394x896.png 848w, https://substackcdn.com/image/fetch/$s_!KB6N!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcdacc07-ab94-46bd-bcb6-fb757eda6777_2394x896.png 1272w, https://substackcdn.com/image/fetch/$s_!KB6N!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcdacc07-ab94-46bd-bcb6-fb757eda6777_2394x896.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KB6N!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcdacc07-ab94-46bd-bcb6-fb757eda6777_2394x896.png" width="1456" height="545" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fcdacc07-ab94-46bd-bcb6-fb757eda6777_2394x896.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:545,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:433799,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcdacc07-ab94-46bd-bcb6-fb757eda6777_2394x896.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KB6N!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcdacc07-ab94-46bd-bcb6-fb757eda6777_2394x896.png 424w, https://substackcdn.com/image/fetch/$s_!KB6N!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcdacc07-ab94-46bd-bcb6-fb757eda6777_2394x896.png 848w, https://substackcdn.com/image/fetch/$s_!KB6N!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcdacc07-ab94-46bd-bcb6-fb757eda6777_2394x896.png 1272w, https://substackcdn.com/image/fetch/$s_!KB6N!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffcdacc07-ab94-46bd-bcb6-fb757eda6777_2394x896.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1, 2, 6, 9])</figcaption></figure></div><p><strong>Where does DPO fit in?</strong> As shown above, DPO is an alignment algorithm that serves as an alternative to RLHF. Unlike RLHF, however, DPO optimizes the policy via gradient ascent to solve the RLHF objective in an indirect manner, without using a separate reward model or any form of RL training. </p><blockquote><p><em>&#8220;We show how to directly optimize a language model to adhere to human preferences, without explicit reward modeling or reinforcement learning. We propose DPO, an algorithm that implicitly optimizes the same objective as existing RLHF algorithms but is simple to implement and straightforward to train.&#8221;</em> - from [1]</p></blockquote><p>DPO addresses the RLHF objective by introducing a novel reparameterization of the reward, deriving it directly from the policy rather than from a separate reward model&#8212;<em>this is referred to as an &#8220;implicit&#8221; reward</em>. When training LLMs with DPO, we learn this implicit reward using an offline preference dataset in a manner similar to training a conventional reward model. The key insight of DPO is that we can extract the optimal policy for RLHF directly from this implicit reward. Fundamentally, DPO learns an implicit reward model&#8212;<em>grounded in the Bradley-Terry model</em>&#8212;and indirectly derives the optimal policy from this implicit reward.</p><p>Because DPO does not require training a separate, explicit reward model, some practitioners mistakenly believe that DPO &#8220;avoids&#8221; reward modeling altogether and directly optimizes the policy via RLHF without any RL or reward model. In reality, DPO is still a reward modeling approach: <em>its training objective and process are identical to those of traditional reward modeling</em>. In DPO, we are indeed training a reward model&#8212;<em>the only difference is that this reward model is implicit within the policy itself</em>. By training our policy to optimize this implicit reward, DPO enables us to find a policy that optimally solves the RLHF objective as well.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7uYx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89c5d093-f121-4698-8efd-7205356da4f8_1820x630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7uYx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89c5d093-f121-4698-8efd-7205356da4f8_1820x630.png 424w, https://substackcdn.com/image/fetch/$s_!7uYx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89c5d093-f121-4698-8efd-7205356da4f8_1820x630.png 848w, https://substackcdn.com/image/fetch/$s_!7uYx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89c5d093-f121-4698-8efd-7205356da4f8_1820x630.png 1272w, https://substackcdn.com/image/fetch/$s_!7uYx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89c5d093-f121-4698-8efd-7205356da4f8_1820x630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7uYx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89c5d093-f121-4698-8efd-7205356da4f8_1820x630.png" width="1456" height="504" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/89c5d093-f121-4698-8efd-7205356da4f8_1820x630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:504,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:326518,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89c5d093-f121-4698-8efd-7205356da4f8_1820x630.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7uYx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89c5d093-f121-4698-8efd-7205356da4f8_1820x630.png 424w, https://substackcdn.com/image/fetch/$s_!7uYx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89c5d093-f121-4698-8efd-7205356da4f8_1820x630.png 848w, https://substackcdn.com/image/fetch/$s_!7uYx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89c5d093-f121-4698-8efd-7205356da4f8_1820x630.png 1272w, https://substackcdn.com/image/fetch/$s_!7uYx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89c5d093-f121-4698-8efd-7205356da4f8_1820x630.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>As depicted above, DPO avoids external reward models, online sampling, and RL as a whole. Instead, we directly optimize the LLM using basic gradient descent to (implicitly) solve the RLHF objective. These simplifications make DPO more stable&#8212;<em>requiring less hyperparameter tuning</em>&#8212;and lightweight compared to RL-based preference tuning, which helps to democratize post-training research.</p><h4>Kullback-Leibler (KL) Divergence</h4><p>Throughout LLM post-training, there are many cases where we optimize our model subject to a KL divergence constraint. For example, the canonical optimization objective used within RLHF has the form shown below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kyeM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kyeM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png 424w, https://substackcdn.com/image/fetch/$s_!kyeM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png 848w, https://substackcdn.com/image/fetch/$s_!kyeM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png 1272w, https://substackcdn.com/image/fetch/$s_!kyeM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kyeM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png" width="1456" height="263" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:263,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:184844,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kyeM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png 424w, https://substackcdn.com/image/fetch/$s_!kyeM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png 848w, https://substackcdn.com/image/fetch/$s_!kyeM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png 1272w, https://substackcdn.com/image/fetch/$s_!kyeM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7464e10-d669-4f6b-ab83-f1980b8918d4_2416x436.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The standard RLHF objective with a KL constraint</figcaption></figure></div><p>As we can see, we want to maximize rewards while minimizing a penalty term&#8212;<em>the KL divergence weighted by &#946;</em>&#8212;that is subtracted from these rewards. The goal of the penalty term is to avoid our policy<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a> drifting too far away from a reference policy during training. Let&#8217;s dive deeper to understand exactly what this means.</p><p><strong>KL divergence </strong>is a concept from <a href="https://en.wikipedia.org/wiki/Information_theory">information theory</a> that measures how different<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a> a probability distribution is from some reference distribution. For a discrete probability distribution, the KL divergence has the form shown below. Notably, KL divergence is not symmetric&#8212;<em>the order of arguments matters</em>. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MIIg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F298d0b86-f69b-42c3-87fd-ac8b88f6ba74_1456x510.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MIIg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F298d0b86-f69b-42c3-87fd-ac8b88f6ba74_1456x510.png 424w, https://substackcdn.com/image/fetch/$s_!MIIg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F298d0b86-f69b-42c3-87fd-ac8b88f6ba74_1456x510.png 848w, https://substackcdn.com/image/fetch/$s_!MIIg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F298d0b86-f69b-42c3-87fd-ac8b88f6ba74_1456x510.png 1272w, https://substackcdn.com/image/fetch/$s_!MIIg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F298d0b86-f69b-42c3-87fd-ac8b88f6ba74_1456x510.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MIIg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F298d0b86-f69b-42c3-87fd-ac8b88f6ba74_1456x510.png" width="452" height="158.32417582417582" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/298d0b86-f69b-42c3-87fd-ac8b88f6ba74_1456x510.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:510,&quot;width&quot;:1456,&quot;resizeWidth&quot;:452,&quot;bytes&quot;:104323,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F298d0b86-f69b-42c3-87fd-ac8b88f6ba74_1456x510.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MIIg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F298d0b86-f69b-42c3-87fd-ac8b88f6ba74_1456x510.png 424w, https://substackcdn.com/image/fetch/$s_!MIIg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F298d0b86-f69b-42c3-87fd-ac8b88f6ba74_1456x510.png 848w, https://substackcdn.com/image/fetch/$s_!MIIg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F298d0b86-f69b-42c3-87fd-ac8b88f6ba74_1456x510.png 1272w, https://substackcdn.com/image/fetch/$s_!MIIg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F298d0b86-f69b-42c3-87fd-ac8b88f6ba74_1456x510.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">KL divergence for continuous and discrete probability distributions</figcaption></figure></div><p>In the case of a continuous probability distribution, we can formulate the KL divergence as an expectation; see above. If this concept is not clear, read <a href="https://www.probabilitycourse.com/chapter3/3_2_2_expectation.php">this</a>.</p><p><strong>Relation to LLMs.</strong> In the LLM domain, KL divergence is commonly used to compare two LLMs or policies. Typically, we will compare the policy that we are currently trying to train to a reference policy. For example, in the case of DPO, we begin with an SFT policy (i.e., an LLM that has already undergone both pretraining and SFT), then optimize the standard RLHF objective, where the KL divergence is computed between this SFT (reference) policy and the policy that we are training. Specifically, the form of this KL divergence would be:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8HMq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9812728e-a0f6-4aa2-9ac2-ed46a76ed056_2074x728.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8HMq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9812728e-a0f6-4aa2-9ac2-ed46a76ed056_2074x728.png 424w, https://substackcdn.com/image/fetch/$s_!8HMq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9812728e-a0f6-4aa2-9ac2-ed46a76ed056_2074x728.png 848w, https://substackcdn.com/image/fetch/$s_!8HMq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9812728e-a0f6-4aa2-9ac2-ed46a76ed056_2074x728.png 1272w, https://substackcdn.com/image/fetch/$s_!8HMq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9812728e-a0f6-4aa2-9ac2-ed46a76ed056_2074x728.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8HMq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9812728e-a0f6-4aa2-9ac2-ed46a76ed056_2074x728.png" width="626" height="219.70192307692307" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9812728e-a0f6-4aa2-9ac2-ed46a76ed056_2074x728.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:511,&quot;width&quot;:1456,&quot;resizeWidth&quot;:626,&quot;bytes&quot;:208844,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9812728e-a0f6-4aa2-9ac2-ed46a76ed056_2074x728.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8HMq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9812728e-a0f6-4aa2-9ac2-ed46a76ed056_2074x728.png 424w, https://substackcdn.com/image/fetch/$s_!8HMq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9812728e-a0f6-4aa2-9ac2-ed46a76ed056_2074x728.png 848w, https://substackcdn.com/image/fetch/$s_!8HMq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9812728e-a0f6-4aa2-9ac2-ed46a76ed056_2074x728.png 1272w, https://substackcdn.com/image/fetch/$s_!8HMq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9812728e-a0f6-4aa2-9ac2-ed46a76ed056_2074x728.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">KL divergence between two LLMs</figcaption></figure></div><p>This form of the KL divergence looks at the ratio of probabilities predicted by both the current and reference model for a completion <code>y</code> given a prompt <code>x</code> as input. The probability of a completion <code>y</code> is simply the product of <a href="https://cameronrwolfe.substack.com/p/language-model-training-and-inference?open=false#%C2%A7understanding-next-token-prediction">next token probabilities</a> predicted by the LLM for each token within a completion. By computing the KL divergence over these completion probabilities, we capture the similarity between the token distributions predicted by the two models.</p><p><strong>Estimating KL divergence in practice.</strong> We usually want to estimate the KL divergence between distributions predicted by our current policy and a fixed reference policy (e.g., the SFT model<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>) during RL training. Intuitively, adding this constraint to the reward used during RL training (as shown below) ensures that the policy being trained does not become too different from the reference policy.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!er6I!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e0a1a60-bfe2-4225-b728-183c5f6e36c1_2104x256.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!er6I!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e0a1a60-bfe2-4225-b728-183c5f6e36c1_2104x256.png 424w, https://substackcdn.com/image/fetch/$s_!er6I!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e0a1a60-bfe2-4225-b728-183c5f6e36c1_2104x256.png 848w, https://substackcdn.com/image/fetch/$s_!er6I!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e0a1a60-bfe2-4225-b728-183c5f6e36c1_2104x256.png 1272w, https://substackcdn.com/image/fetch/$s_!er6I!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e0a1a60-bfe2-4225-b728-183c5f6e36c1_2104x256.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!er6I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e0a1a60-bfe2-4225-b728-183c5f6e36c1_2104x256.png" width="1456" height="177" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9e0a1a60-bfe2-4225-b728-183c5f6e36c1_2104x256.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:177,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:142639,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e0a1a60-bfe2-4225-b728-183c5f6e36c1_2104x256.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!er6I!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e0a1a60-bfe2-4225-b728-183c5f6e36c1_2104x256.png 424w, https://substackcdn.com/image/fetch/$s_!er6I!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e0a1a60-bfe2-4225-b728-183c5f6e36c1_2104x256.png 848w, https://substackcdn.com/image/fetch/$s_!er6I!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e0a1a60-bfe2-4225-b728-183c5f6e36c1_2104x256.png 1272w, https://substackcdn.com/image/fetch/$s_!er6I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e0a1a60-bfe2-4225-b728-183c5f6e36c1_2104x256.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>In practice, we usually approximate the KL divergence, which&#8212;<em>as we will see</em>&#8212;is simple to do. However, there are <a href="http://joschu.net/blog/kl-approx.html">several different options</a> for how we perform this approximation. Usually, approximating KL divergence uses the expectation (continuous) form of the KL divergence. As outlined above, this form of the KL divergence simply subtracts the log probabilities of the two distributions from each other and takes an expectation of this difference. Given that token log probabilities are already used in various aspects of RL training (e.g., <a href="https://rlhfbook.com/c/11-policy-gradients.html">the PPO objective</a>), such an expression is pretty easy for us to compute!</p><p>Specifically, assume we are trying to compute the KL divergence between the current and reference policy given a prompt <code>x</code>. To do this, we would:</p><ol><li><p>Generate a completion to the prompt with the current policy (not the reference policy). </p></li><li><p>Get the log probabilities for each token in this completion from both the current and reference policies. </p></li><li><p>Sum over token log probabilities to get the sequence log probability. </p></li><li><p>Take the difference of sequence log probabilities between the current and reference policy.</p></li></ol><p>For the last step of this process, there are several options we have for computing the approximation of the KL divergence, all of which are shown in the code below. See <a href="https://github.com/huggingface/trl/blob/5c21de30ae210e4251ead85517ba8dfe3f210e81/trl/trainer/ppo_trainer.py#L1150">here</a> for an example of these implementations being used in the wild. </p><pre><code>"""
Assume we already have necessary logprobs available.

logprob: completion logprob from the policy
ref_logprob: completion logprob from the reference policy
"""

kl_div = logprob - ref_logprob  # difference

kl_div = (logprob - ref_logprob).abs() # absolute

kl_div = 0.5 * (logprob - ref_logprob).square() # mse

kl_div = F.kl_div(ref_logprob, logprob, reduction='batchmean') # per token</code></pre><p>This KL divergence estimate would then be subtracted from the reward for our sequence as part of the objective used for RL finetuning as described <a href="https://rlhfbook.com/c/11-policy-gradients.html">here</a>.</p><h2><a href="https://arxiv.org/abs/2305.18290">Direct Preference Optimization (DPO)</a> [1]</h2><p>Having established the fundamentals of LLM training and the role of DPO in this framework, we can now focus on learning the mechanics of DPO itself. DPO is a preference-tuning method that serves as an alternative to (or can be used with) standard RLHF. In this section, we derive the DPO training process from scratch, beginning with the training objective used in RLHF. We will then discuss the practical implementation of DPO, including a step by step implementation from scratch and concrete examples of training LLMs using DPO.</p><h4>TL;DR: What is DPO?</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yQz2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7107abbb-358e-48d4-a200-64ca6b5d1d72_2050x1092.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yQz2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7107abbb-358e-48d4-a200-64ca6b5d1d72_2050x1092.png 424w, https://substackcdn.com/image/fetch/$s_!yQz2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7107abbb-358e-48d4-a200-64ca6b5d1d72_2050x1092.png 848w, https://substackcdn.com/image/fetch/$s_!yQz2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7107abbb-358e-48d4-a200-64ca6b5d1d72_2050x1092.png 1272w, https://substackcdn.com/image/fetch/$s_!yQz2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7107abbb-358e-48d4-a200-64ca6b5d1d72_2050x1092.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yQz2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7107abbb-358e-48d4-a200-64ca6b5d1d72_2050x1092.png" width="1456" height="776" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7107abbb-358e-48d4-a200-64ca6b5d1d72_2050x1092.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:776,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!yQz2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7107abbb-358e-48d4-a200-64ca6b5d1d72_2050x1092.png 424w, https://substackcdn.com/image/fetch/$s_!yQz2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7107abbb-358e-48d4-a200-64ca6b5d1d72_2050x1092.png 848w, https://substackcdn.com/image/fetch/$s_!yQz2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7107abbb-358e-48d4-a200-64ca6b5d1d72_2050x1092.png 1272w, https://substackcdn.com/image/fetch/$s_!yQz2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7107abbb-358e-48d4-a200-64ca6b5d1d72_2050x1092.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">DPO training loss (from [1])</figcaption></figure></div><p>As we have learned, DPO is a preference tuning approach that avoids explicit reward models and RL, instead indirectly solving the RLHF objective via a more straightforward gradient descent approach. The DPO loss&#8212;<em>shown above for a single preference pair</em>&#8212;trains an LLM by:</p><ol><li><p>Increasing the relative&#8212;<em>with respect to the reference policy</em>&#8212;probability of chosen completions.</p></li><li><p>Decreasing the relative probability of rejected completions.</p></li></ol><p>This loss function is simple to optimize over an offline preference dataset using MLE. Therefore, we can train the LLM similarly to a reward model, without the need for RL. Additionally, this approach&#8212;<em>despite being lightweight and simple</em>&#8212;still yields a policy that solves the same objective that we are optimizing in RLHF!</p><blockquote><p><em>&#8220;Given a dataset of human preferences over model responses, DPO can therefore optimize a policy using a simple binary cross entropy objective, producing the optimal policy to an implicit reward function fit to the preference data.&#8221;</em> - from [1]</p></blockquote><p>If we study this loss, we will notice that it is very similar to the loss function used to train reward models, which is copied below for reference. The main difference is that we replace the reward model&#8217;s output with the implicit reward derived from our policy. As we will see later, the DPO objective&#8212;<em>in addition to adjusting the log probabilities of chosen and rejected completions</em>&#8212;naturally places emphasis upon examples where the LLM&#8217;s implicit reward estimate is incorrect.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iPQn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84db389-0e57-4a3c-808b-d48b28a192d6_1204x392.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iPQn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84db389-0e57-4a3c-808b-d48b28a192d6_1204x392.png 424w, https://substackcdn.com/image/fetch/$s_!iPQn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84db389-0e57-4a3c-808b-d48b28a192d6_1204x392.png 848w, https://substackcdn.com/image/fetch/$s_!iPQn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84db389-0e57-4a3c-808b-d48b28a192d6_1204x392.png 1272w, https://substackcdn.com/image/fetch/$s_!iPQn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84db389-0e57-4a3c-808b-d48b28a192d6_1204x392.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iPQn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84db389-0e57-4a3c-808b-d48b28a192d6_1204x392.png" width="606" height="197.30232558139534" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c84db389-0e57-4a3c-808b-d48b28a192d6_1204x392.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:392,&quot;width&quot;:1204,&quot;resizeWidth&quot;:606,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iPQn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84db389-0e57-4a3c-808b-d48b28a192d6_1204x392.png 424w, https://substackcdn.com/image/fetch/$s_!iPQn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84db389-0e57-4a3c-808b-d48b28a192d6_1204x392.png 848w, https://substackcdn.com/image/fetch/$s_!iPQn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84db389-0e57-4a3c-808b-d48b28a192d6_1204x392.png 1272w, https://substackcdn.com/image/fetch/$s_!iPQn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84db389-0e57-4a3c-808b-d48b28a192d6_1204x392.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(<a href="https://cameronrwolfe.substack.com/p/reward-models">source</a>)</figcaption></figure></div><h4>Deriving the DPO Loss</h4><p>Now that we understand the key ideas behind DPO, we need to understand where DPO comes from and how we know that it is solving the same optimization problem as standard RLHF. To do this, we will rely upon theory, meaning that this section will contain many equations. Although the theory can be difficult to parse, understanding it is beneficial for gaining a fundamental grasp of why DPO works. To make the theory digestible, we will break the derivation down step by step with corresponding explanations for each step.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kDO9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae8d328e-9f10-4dd4-8ce7-2f92ca6a7e81_1996x1030.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kDO9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae8d328e-9f10-4dd4-8ce7-2f92ca6a7e81_1996x1030.png 424w, https://substackcdn.com/image/fetch/$s_!kDO9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae8d328e-9f10-4dd4-8ce7-2f92ca6a7e81_1996x1030.png 848w, https://substackcdn.com/image/fetch/$s_!kDO9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae8d328e-9f10-4dd4-8ce7-2f92ca6a7e81_1996x1030.png 1272w, https://substackcdn.com/image/fetch/$s_!kDO9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae8d328e-9f10-4dd4-8ce7-2f92ca6a7e81_1996x1030.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kDO9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae8d328e-9f10-4dd4-8ce7-2f92ca6a7e81_1996x1030.png" width="1456" height="751" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ae8d328e-9f10-4dd4-8ce7-2f92ca6a7e81_1996x1030.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:751,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:316096,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae8d328e-9f10-4dd4-8ce7-2f92ca6a7e81_1996x1030.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kDO9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae8d328e-9f10-4dd4-8ce7-2f92ca6a7e81_1996x1030.png 424w, https://substackcdn.com/image/fetch/$s_!kDO9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae8d328e-9f10-4dd4-8ce7-2f92ca6a7e81_1996x1030.png 848w, https://substackcdn.com/image/fetch/$s_!kDO9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae8d328e-9f10-4dd4-8ce7-2f92ca6a7e81_1996x1030.png 1272w, https://substackcdn.com/image/fetch/$s_!kDO9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae8d328e-9f10-4dd4-8ce7-2f92ca6a7e81_1996x1030.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Steps followed to derive the DPO loss function</figcaption></figure></div><p><strong>Proof sketch.</strong> Beginning with the standard RLHF training objective, we can derive the training loss used in DPO by following four key steps (shown above):</p><ol><li><p>Deriving an expression for the optimal policy in RLHF. </p></li><li><p>Rearranging this expression to form an implicit reward function.</p></li><li><p>Putting the implicit reward into the Bradley-Terry preference model.</p></li><li><p>Training an LLM to match this implicit preference model&#8212;<em>this is what we are doing in the DPO training process</em>.</p></li></ol><p>The above steps start with the objective used to train LLMs in RLHF and ends with the DPO loss function. In this derivation, we reformulate the RLHF optimization problem to arrive at the DPO training methodology. As we will see, RLHF and DPO are intricately related&#8212;<em>they are trying to solve the same optimization problem</em>! By studying the derivation below, we gain a deeper grasp of the relationship between these techniques. </p><p><strong>(Step One) Optimal solution to RLHF. </strong>To derive the DPO loss, we need to begin from the initial RLHF objective that we are trying to solve, which has been copied again below for readability. However, instead of using our learned reward model <code>RM</code> in this notation, we use a general reward function <code>r(x, y)</code>. The general reward function can include&#8212;<em>but is not limited to</em>&#8212;our reward model.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6Box!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F205b27bd-b961-48a7-a60d-37ddded1a7e5_1594x118.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6Box!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F205b27bd-b961-48a7-a60d-37ddded1a7e5_1594x118.png 424w, https://substackcdn.com/image/fetch/$s_!6Box!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F205b27bd-b961-48a7-a60d-37ddded1a7e5_1594x118.png 848w, https://substackcdn.com/image/fetch/$s_!6Box!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F205b27bd-b961-48a7-a60d-37ddded1a7e5_1594x118.png 1272w, https://substackcdn.com/image/fetch/$s_!6Box!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F205b27bd-b961-48a7-a60d-37ddded1a7e5_1594x118.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6Box!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F205b27bd-b961-48a7-a60d-37ddded1a7e5_1594x118.png" width="590" height="43.76373626373626" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/205b27bd-b961-48a7-a60d-37ddded1a7e5_1594x118.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:108,&quot;width&quot;:1456,&quot;resizeWidth&quot;:590,&quot;bytes&quot;:52396,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F205b27bd-b961-48a7-a60d-37ddded1a7e5_1594x118.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6Box!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F205b27bd-b961-48a7-a60d-37ddded1a7e5_1594x118.png 424w, https://substackcdn.com/image/fetch/$s_!6Box!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F205b27bd-b961-48a7-a60d-37ddded1a7e5_1594x118.png 848w, https://substackcdn.com/image/fetch/$s_!6Box!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F205b27bd-b961-48a7-a60d-37ddded1a7e5_1594x118.png 1272w, https://substackcdn.com/image/fetch/$s_!6Box!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F205b27bd-b961-48a7-a60d-37ddded1a7e5_1594x118.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Standard RLHF objective with a general reward function</figcaption></figure></div><p>Starting with this objective, we can follow the steps below to find a closed-form expression for the optimal solution to this objective. Put simply, we are solving for the value of <code>&#960;</code> that actually maximizes the RLHF objective shown below!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7qBS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60bc6255-3a88-4267-aa3f-cc535b8c8751_1890x996.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7qBS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60bc6255-3a88-4267-aa3f-cc535b8c8751_1890x996.png 424w, https://substackcdn.com/image/fetch/$s_!7qBS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60bc6255-3a88-4267-aa3f-cc535b8c8751_1890x996.png 848w, https://substackcdn.com/image/fetch/$s_!7qBS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60bc6255-3a88-4267-aa3f-cc535b8c8751_1890x996.png 1272w, https://substackcdn.com/image/fetch/$s_!7qBS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60bc6255-3a88-4267-aa3f-cc535b8c8751_1890x996.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7qBS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60bc6255-3a88-4267-aa3f-cc535b8c8751_1890x996.png" width="1456" height="767" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/60bc6255-3a88-4267-aa3f-cc535b8c8751_1890x996.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:767,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:398209,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60bc6255-3a88-4267-aa3f-cc535b8c8751_1890x996.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7qBS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60bc6255-3a88-4267-aa3f-cc535b8c8751_1890x996.png 424w, https://substackcdn.com/image/fetch/$s_!7qBS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60bc6255-3a88-4267-aa3f-cc535b8c8751_1890x996.png 848w, https://substackcdn.com/image/fetch/$s_!7qBS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60bc6255-3a88-4267-aa3f-cc535b8c8751_1890x996.png 1272w, https://substackcdn.com/image/fetch/$s_!7qBS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60bc6255-3a88-4267-aa3f-cc535b8c8751_1890x996.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In the last two steps of the derivation above, we introduce a function <code>Z(x)</code>, which we will call the <em>partition function</em>. The partition function is defined below. As we can see, the partition function only depends upon the reference policy and the input prompt <code>x</code>; there is no dependence upon the current policy or completion.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iLmx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cfbdd1a-069a-4110-8a74-29a6137bcfe3_1072x190.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iLmx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cfbdd1a-069a-4110-8a74-29a6137bcfe3_1072x190.png 424w, https://substackcdn.com/image/fetch/$s_!iLmx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cfbdd1a-069a-4110-8a74-29a6137bcfe3_1072x190.png 848w, https://substackcdn.com/image/fetch/$s_!iLmx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cfbdd1a-069a-4110-8a74-29a6137bcfe3_1072x190.png 1272w, https://substackcdn.com/image/fetch/$s_!iLmx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cfbdd1a-069a-4110-8a74-29a6137bcfe3_1072x190.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iLmx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cfbdd1a-069a-4110-8a74-29a6137bcfe3_1072x190.png" width="386" height="68.41417910447761" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8cfbdd1a-069a-4110-8a74-29a6137bcfe3_1072x190.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:190,&quot;width&quot;:1072,&quot;resizeWidth&quot;:386,&quot;bytes&quot;:39661,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cfbdd1a-069a-4110-8a74-29a6137bcfe3_1072x190.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iLmx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cfbdd1a-069a-4110-8a74-29a6137bcfe3_1072x190.png 424w, https://substackcdn.com/image/fetch/$s_!iLmx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cfbdd1a-069a-4110-8a74-29a6137bcfe3_1072x190.png 848w, https://substackcdn.com/image/fetch/$s_!iLmx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cfbdd1a-069a-4110-8a74-29a6137bcfe3_1072x190.png 1272w, https://substackcdn.com/image/fetch/$s_!iLmx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cfbdd1a-069a-4110-8a74-29a6137bcfe3_1072x190.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The partition function used in DPO</figcaption></figure></div><p>The name &#8220;partition function&#8221; is borrowed from fields like probability theory and statistical mechanics; see <a href="https://en.wikipedia.org/wiki/Partition_function_(mathematics)">here</a>. At the simplest level, the partition function is just a normalization term used in the theoretical derivation of DPO. We use <code>Z(x)</code> to ensure that the probability distribution we derive&#8212;<em>in this case the optimal policy to the RLHF objective</em>&#8212;sums to one and, therefore, forms a valid distribution.</p><p>Now that we understand the partition function, we will pick up the derivation from the equation in the red box shown above. Specifically, we will extract a portion of this term to define the expression below. We refer to this term as the &#8220;optimal policy&#8221;&#8212;<em>the reason for this will become clear soon</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Vnbb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41ac0cad-62d2-429a-a305-70d93f4bd2b3_1930x856.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Vnbb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41ac0cad-62d2-429a-a305-70d93f4bd2b3_1930x856.png 424w, https://substackcdn.com/image/fetch/$s_!Vnbb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41ac0cad-62d2-429a-a305-70d93f4bd2b3_1930x856.png 848w, https://substackcdn.com/image/fetch/$s_!Vnbb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41ac0cad-62d2-429a-a305-70d93f4bd2b3_1930x856.png 1272w, https://substackcdn.com/image/fetch/$s_!Vnbb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41ac0cad-62d2-429a-a305-70d93f4bd2b3_1930x856.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Vnbb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41ac0cad-62d2-429a-a305-70d93f4bd2b3_1930x856.png" width="578" height="256.4478021978022" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/41ac0cad-62d2-429a-a305-70d93f4bd2b3_1930x856.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:646,&quot;width&quot;:1456,&quot;resizeWidth&quot;:578,&quot;bytes&quot;:278965,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41ac0cad-62d2-429a-a305-70d93f4bd2b3_1930x856.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Vnbb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41ac0cad-62d2-429a-a305-70d93f4bd2b3_1930x856.png 424w, https://substackcdn.com/image/fetch/$s_!Vnbb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41ac0cad-62d2-429a-a305-70d93f4bd2b3_1930x856.png 848w, https://substackcdn.com/image/fetch/$s_!Vnbb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41ac0cad-62d2-429a-a305-70d93f4bd2b3_1930x856.png 1272w, https://substackcdn.com/image/fetch/$s_!Vnbb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41ac0cad-62d2-429a-a305-70d93f4bd2b3_1930x856.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As mentioned before, the partition function is used as a normalization term for the optimal policy in the above expression. We know that the optimal policy defined above is a valid probability distribution because:</p><ol><li><p>The value of the optimal policy is <code>&#8805;</code> <code>0</code> for all possible completions <code>y</code>.</p></li><li><p>The sum of the optimal policy across all completions <code>y</code> is equal to <code>1</code>.</p></li></ol><p>The first property is obvious&#8212;<em>all components of the optimal policy are non-negative</em><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a>. Proof of the second property is provided below, where we directly see how the partition function <code>Z(x)</code> is used to normalize the optimal policy distribution.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!n-fh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7970973e-d3d9-4465-8bda-85016450fa53_2252x672.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!n-fh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7970973e-d3d9-4465-8bda-85016450fa53_2252x672.png 424w, https://substackcdn.com/image/fetch/$s_!n-fh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7970973e-d3d9-4465-8bda-85016450fa53_2252x672.png 848w, https://substackcdn.com/image/fetch/$s_!n-fh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7970973e-d3d9-4465-8bda-85016450fa53_2252x672.png 1272w, https://substackcdn.com/image/fetch/$s_!n-fh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7970973e-d3d9-4465-8bda-85016450fa53_2252x672.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!n-fh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7970973e-d3d9-4465-8bda-85016450fa53_2252x672.png" width="724" height="215.80769230769232" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7970973e-d3d9-4465-8bda-85016450fa53_2252x672.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:434,&quot;width&quot;:1456,&quot;resizeWidth&quot;:724,&quot;bytes&quot;:240802,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7970973e-d3d9-4465-8bda-85016450fa53_2252x672.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!n-fh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7970973e-d3d9-4465-8bda-85016450fa53_2252x672.png 424w, https://substackcdn.com/image/fetch/$s_!n-fh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7970973e-d3d9-4465-8bda-85016450fa53_2252x672.png 848w, https://substackcdn.com/image/fetch/$s_!n-fh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7970973e-d3d9-4465-8bda-85016450fa53_2252x672.png 1272w, https://substackcdn.com/image/fetch/$s_!n-fh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7970973e-d3d9-4465-8bda-85016450fa53_2252x672.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Now that we have defined (and verified the validity of) the optimal policy, we can return to the original expression in which this term appeared and substitute in the expression for the optimal policy. This yields the equation shown below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-N5w!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73fce593-3f5f-4119-bb5c-aba95cfd124c_2018x982.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-N5w!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73fce593-3f5f-4119-bb5c-aba95cfd124c_2018x982.png 424w, https://substackcdn.com/image/fetch/$s_!-N5w!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73fce593-3f5f-4119-bb5c-aba95cfd124c_2018x982.png 848w, https://substackcdn.com/image/fetch/$s_!-N5w!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73fce593-3f5f-4119-bb5c-aba95cfd124c_2018x982.png 1272w, https://substackcdn.com/image/fetch/$s_!-N5w!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73fce593-3f5f-4119-bb5c-aba95cfd124c_2018x982.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-N5w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73fce593-3f5f-4119-bb5c-aba95cfd124c_2018x982.png" width="1456" height="709" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73fce593-3f5f-4119-bb5c-aba95cfd124c_2018x982.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:709,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:323003,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73fce593-3f5f-4119-bb5c-aba95cfd124c_2018x982.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-N5w!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73fce593-3f5f-4119-bb5c-aba95cfd124c_2018x982.png 424w, https://substackcdn.com/image/fetch/$s_!-N5w!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73fce593-3f5f-4119-bb5c-aba95cfd124c_2018x982.png 848w, https://substackcdn.com/image/fetch/$s_!-N5w!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73fce593-3f5f-4119-bb5c-aba95cfd124c_2018x982.png 1272w, https://substackcdn.com/image/fetch/$s_!-N5w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73fce593-3f5f-4119-bb5c-aba95cfd124c_2018x982.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In the final term above, we see the crux of this derivation: <em>the standard RLHF objective is minimized by finding the policy &#960; that minimizes the KL divergence with the optimal policy</em>. Since the KL divergence reaches its minimum value (zero) when the two probability distributions are identical<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a>, the solution to this optimization is the optimal policy itself&#8212;<em>hence the name</em>. Therefore, we can express the optimal solution to the standard RLHF objective as shown in the equation below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RLii!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a923561-76b7-4900-8566-15adb58a71d8_2178x680.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RLii!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a923561-76b7-4900-8566-15adb58a71d8_2178x680.png 424w, https://substackcdn.com/image/fetch/$s_!RLii!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a923561-76b7-4900-8566-15adb58a71d8_2178x680.png 848w, https://substackcdn.com/image/fetch/$s_!RLii!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a923561-76b7-4900-8566-15adb58a71d8_2178x680.png 1272w, https://substackcdn.com/image/fetch/$s_!RLii!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a923561-76b7-4900-8566-15adb58a71d8_2178x680.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RLii!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a923561-76b7-4900-8566-15adb58a71d8_2178x680.png" width="514" height="160.625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8a923561-76b7-4900-8566-15adb58a71d8_2178x680.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:455,&quot;width&quot;:1456,&quot;resizeWidth&quot;:514,&quot;bytes&quot;:263404,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a923561-76b7-4900-8566-15adb58a71d8_2178x680.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RLii!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a923561-76b7-4900-8566-15adb58a71d8_2178x680.png 424w, https://substackcdn.com/image/fetch/$s_!RLii!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a923561-76b7-4900-8566-15adb58a71d8_2178x680.png 848w, https://substackcdn.com/image/fetch/$s_!RLii!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a923561-76b7-4900-8566-15adb58a71d8_2178x680.png 1272w, https://substackcdn.com/image/fetch/$s_!RLii!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a923561-76b7-4900-8566-15adb58a71d8_2178x680.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Optimally solving the standard RLHF objective</figcaption></figure></div><p><strong>(Step Two) Deriving an implicit reward.</strong> From here, we can take our expression for the optimal policy shown above and rearrange it to derive an expression for the reward function&#8212;<em>in terms of the optimal policy</em>&#8212;as shown below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rZ7H!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd3ddc8a-8e48-4573-82f8-1cd97e3f2d6f_1908x1048.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rZ7H!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd3ddc8a-8e48-4573-82f8-1cd97e3f2d6f_1908x1048.png 424w, https://substackcdn.com/image/fetch/$s_!rZ7H!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd3ddc8a-8e48-4573-82f8-1cd97e3f2d6f_1908x1048.png 848w, https://substackcdn.com/image/fetch/$s_!rZ7H!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd3ddc8a-8e48-4573-82f8-1cd97e3f2d6f_1908x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!rZ7H!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd3ddc8a-8e48-4573-82f8-1cd97e3f2d6f_1908x1048.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rZ7H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd3ddc8a-8e48-4573-82f8-1cd97e3f2d6f_1908x1048.png" width="1456" height="800" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fd3ddc8a-8e48-4573-82f8-1cd97e3f2d6f_1908x1048.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:800,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:412856,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd3ddc8a-8e48-4573-82f8-1cd97e3f2d6f_1908x1048.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rZ7H!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd3ddc8a-8e48-4573-82f8-1cd97e3f2d6f_1908x1048.png 424w, https://substackcdn.com/image/fetch/$s_!rZ7H!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd3ddc8a-8e48-4573-82f8-1cd97e3f2d6f_1908x1048.png 848w, https://substackcdn.com/image/fetch/$s_!rZ7H!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd3ddc8a-8e48-4573-82f8-1cd97e3f2d6f_1908x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!rZ7H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd3ddc8a-8e48-4573-82f8-1cd97e3f2d6f_1908x1048.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Now, we have derived a reparameterization of our reward. However, this reward function does not depend upon any explicit reward model. Rather, we estimate the reward purely using probabilities computed from the optimal policy and the reference policy&#8212;<em>we will call this an &#8220;implicit&#8221; reward</em>.</p><blockquote><p><em>&#8220;This change-of-variables approach avoids fitting an explicit, standalone reward model&#8230; the policy network represents both the language model and the (implicit) reward.&#8221;</em> - from [1]</p></blockquote><p>Now, the only remaining issue is the <code>Z(x)</code> term in our implicit reward. The partition function takes a sum over all possible completions <code>y</code>, so computing the value of <code>Z(x)</code> is expensive in practice. Going further, the reward function <code>r(x, y)</code>, which we cannot directly compute without training a standalone reward model, also appears in the expression for <code>Z(x)</code>. To solve this, we need to revisit the Bradley-Terry model and combine it with our implicit reward function.</p><p><strong>(Step Three) Bradley-Terry preference model.</strong> Under the Bradley-Terry model of preference, we can compute the probability that a given completion is preferred to another. In most cases, the input to this preference model is the explicit reward&#8212;<em>predicted by a reward model</em>&#8212;for each completion. In the case of DPO, we replace this explicit reward with our implicit reward function; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!S8jx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbb1b28-93cc-4105-9108-1e3949bb3101_1928x968.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!S8jx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbb1b28-93cc-4105-9108-1e3949bb3101_1928x968.png 424w, https://substackcdn.com/image/fetch/$s_!S8jx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbb1b28-93cc-4105-9108-1e3949bb3101_1928x968.png 848w, https://substackcdn.com/image/fetch/$s_!S8jx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbb1b28-93cc-4105-9108-1e3949bb3101_1928x968.png 1272w, https://substackcdn.com/image/fetch/$s_!S8jx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbb1b28-93cc-4105-9108-1e3949bb3101_1928x968.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!S8jx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbb1b28-93cc-4105-9108-1e3949bb3101_1928x968.png" width="1456" height="731" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3dbb1b28-93cc-4105-9108-1e3949bb3101_1928x968.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:731,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:295594,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbb1b28-93cc-4105-9108-1e3949bb3101_1928x968.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!S8jx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbb1b28-93cc-4105-9108-1e3949bb3101_1928x968.png 424w, https://substackcdn.com/image/fetch/$s_!S8jx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbb1b28-93cc-4105-9108-1e3949bb3101_1928x968.png 848w, https://substackcdn.com/image/fetch/$s_!S8jx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbb1b28-93cc-4105-9108-1e3949bb3101_1928x968.png 1272w, https://substackcdn.com/image/fetch/$s_!S8jx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dbb1b28-93cc-4105-9108-1e3949bb3101_1928x968.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As shown in the final equation above, we now have an expression for the Bradley-Terry model of preference that uses our implicit reward function, where the implicit reward depends only upon the optimal policy and a reference policy. Due to the pairwise nature of the Bradley-Terry expression and the fact that the value of <code>Z(x)</code> depends only upon <code>x</code> (and not <code>y</code>), the <code>Z(x)</code> components of the implicit reward function actually cancel out when subtracting the implicit reward for the chosen completion from the implicit reward for the rejected completion. </p><p><strong>(Step Four) Training our policy.</strong> The expression above depends upon the optimal policy, which is fixed&#8212;<em>this optimal policy is the solution to the RLHF objective that we are trying to solve</em>. From here, we must determine how to derive a training objective that can recover this optimal policy. To do this, DPO substitutes the optimal policy in the above expression with a learned policy, as shown below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7-U7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50e1e6d3-2cef-4f1c-8567-31877238906b_2294x318.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7-U7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50e1e6d3-2cef-4f1c-8567-31877238906b_2294x318.png 424w, https://substackcdn.com/image/fetch/$s_!7-U7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50e1e6d3-2cef-4f1c-8567-31877238906b_2294x318.png 848w, https://substackcdn.com/image/fetch/$s_!7-U7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50e1e6d3-2cef-4f1c-8567-31877238906b_2294x318.png 1272w, https://substackcdn.com/image/fetch/$s_!7-U7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50e1e6d3-2cef-4f1c-8567-31877238906b_2294x318.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7-U7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50e1e6d3-2cef-4f1c-8567-31877238906b_2294x318.png" width="1456" height="202" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/50e1e6d3-2cef-4f1c-8567-31877238906b_2294x318.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:202,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:135991,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50e1e6d3-2cef-4f1c-8567-31877238906b_2294x318.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7-U7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50e1e6d3-2cef-4f1c-8567-31877238906b_2294x318.png 424w, https://substackcdn.com/image/fetch/$s_!7-U7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50e1e6d3-2cef-4f1c-8567-31877238906b_2294x318.png 848w, https://substackcdn.com/image/fetch/$s_!7-U7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50e1e6d3-2cef-4f1c-8567-31877238906b_2294x318.png 1272w, https://substackcdn.com/image/fetch/$s_!7-U7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50e1e6d3-2cef-4f1c-8567-31877238906b_2294x318.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><em>How can we make these two expressions equal?</em> We need to train our learned policy! Specifically, we can formulate a ranking loss that optimizes our learned policy to empirically maximize the probability of chosen responses being preferred to rejected responses based on our implicit reward function. By doing this, we ensure that our preference model is accurate and, therefore, matches that of the optimal policy. Besides replacing explicit rewards with implicit rewards, <em>this loss function is the same exact training objective used by standard reward models</em>; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0MIh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F277b6c07-adf3-4567-8e6f-ebb49820993b_2272x498.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0MIh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F277b6c07-adf3-4567-8e6f-ebb49820993b_2272x498.png 424w, https://substackcdn.com/image/fetch/$s_!0MIh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F277b6c07-adf3-4567-8e6f-ebb49820993b_2272x498.png 848w, https://substackcdn.com/image/fetch/$s_!0MIh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F277b6c07-adf3-4567-8e6f-ebb49820993b_2272x498.png 1272w, https://substackcdn.com/image/fetch/$s_!0MIh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F277b6c07-adf3-4567-8e6f-ebb49820993b_2272x498.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0MIh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F277b6c07-adf3-4567-8e6f-ebb49820993b_2272x498.png" width="1456" height="319" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/277b6c07-adf3-4567-8e6f-ebb49820993b_2272x498.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:319,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:214613,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F277b6c07-adf3-4567-8e6f-ebb49820993b_2272x498.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0MIh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F277b6c07-adf3-4567-8e6f-ebb49820993b_2272x498.png 424w, https://substackcdn.com/image/fetch/$s_!0MIh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F277b6c07-adf3-4567-8e6f-ebb49820993b_2272x498.png 848w, https://substackcdn.com/image/fetch/$s_!0MIh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F277b6c07-adf3-4567-8e6f-ebb49820993b_2272x498.png 1272w, https://substackcdn.com/image/fetch/$s_!0MIh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F277b6c07-adf3-4567-8e6f-ebb49820993b_2272x498.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The final loss expression derived for DPO</figcaption></figure></div><p>We might also notice that this loss function is identical to the training objective for DPO&#8212;<em>we have now fully derived the DPO training objective starting from the training objective for RLHF</em>. The training process for DPO learns an implicit reward model based upon our policy. By learning this implicit reward function, we obtain a policy that matches the optimal policy from RLHF. </p><h4>Does DPO <em>actually</em> yield an optimal policy?</h4><blockquote><p><em>&#8220;The [DPO] optimization objective is equivalent to a Bradley-Terry model with an [implicit] reward parameterization and we optimize our parametric model equivalently to the reward model optimization&#8230; we show that [this objective] does not constrain the class of learned reward models and allows for the exact recovery of the optimal policy.&#8221;</em> - from [1]</p></blockquote><p>Based on the above derivation, training an LLM using the DPO loss will yield a model that has the same preference distribution&#8212;<em>induced by the implicit reward</em>&#8212;as the optimal policy. In other words, the implicit reward function learned by our policy via the DPO loss will correctly rank chosen and rejected completions in our preference dataset. However, the goal of DPO is not to train a model with a good implicit reward function&#8212;<em>we want to align our LLM and derive a policy that generates high-quality completions</em>! Luckily, authors in [1] provide a final proof showing that, in addition to learning a high-quality implicit reward function, the policy derived via DPO should match the optimal policy from RLHF. </p><div class="pullquote"><p>Two reward functions <code>r(x, y)</code> and <code>r&#8217;(x, y)</code> are equivalent if and only if <code>r(x, y) - r&#8217;(x, y) = f(x)</code> for some function <code>f(&#8226;)</code>.</p></div><p><strong>Equivalent rewards.</strong> To begin the proof, we can first specify an <a href="https://en.wikipedia.org/wiki/Equivalence_relation">equivalence relation</a> for reward functions. This is just a definition that captures what it means for two reward functions to be equal; see above. Put simply, reward functions are considered equivalent if their difference in reward only depends upon the prompt and not the completion. Using this definition, we show below that two equivalent reward functions are guaranteed to yield the same preference distribution<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!E4Zh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9019ae85-f02a-4366-8df5-7abd7ad6afe9_1928x1132.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!E4Zh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9019ae85-f02a-4366-8df5-7abd7ad6afe9_1928x1132.png 424w, https://substackcdn.com/image/fetch/$s_!E4Zh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9019ae85-f02a-4366-8df5-7abd7ad6afe9_1928x1132.png 848w, https://substackcdn.com/image/fetch/$s_!E4Zh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9019ae85-f02a-4366-8df5-7abd7ad6afe9_1928x1132.png 1272w, https://substackcdn.com/image/fetch/$s_!E4Zh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9019ae85-f02a-4366-8df5-7abd7ad6afe9_1928x1132.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!E4Zh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9019ae85-f02a-4366-8df5-7abd7ad6afe9_1928x1132.png" width="1456" height="855" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9019ae85-f02a-4366-8df5-7abd7ad6afe9_1928x1132.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:855,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:346142,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9019ae85-f02a-4366-8df5-7abd7ad6afe9_1928x1132.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!E4Zh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9019ae85-f02a-4366-8df5-7abd7ad6afe9_1928x1132.png 424w, https://substackcdn.com/image/fetch/$s_!E4Zh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9019ae85-f02a-4366-8df5-7abd7ad6afe9_1928x1132.png 848w, https://substackcdn.com/image/fetch/$s_!E4Zh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9019ae85-f02a-4366-8df5-7abd7ad6afe9_1928x1132.png 1272w, https://substackcdn.com/image/fetch/$s_!E4Zh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9019ae85-f02a-4366-8df5-7abd7ad6afe9_1928x1132.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p>We can also write a similar proof to show that two equivalent reward functions, when plugged into the standard RLHF objective that we explored in the prior section, are guaranteed to yield the same optimal policy; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zkxp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3e3e6e7-0abb-4085-a1fd-6e6b50f93437_2242x1068.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zkxp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3e3e6e7-0abb-4085-a1fd-6e6b50f93437_2242x1068.png 424w, https://substackcdn.com/image/fetch/$s_!zkxp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3e3e6e7-0abb-4085-a1fd-6e6b50f93437_2242x1068.png 848w, https://substackcdn.com/image/fetch/$s_!zkxp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3e3e6e7-0abb-4085-a1fd-6e6b50f93437_2242x1068.png 1272w, https://substackcdn.com/image/fetch/$s_!zkxp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3e3e6e7-0abb-4085-a1fd-6e6b50f93437_2242x1068.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zkxp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3e3e6e7-0abb-4085-a1fd-6e6b50f93437_2242x1068.png" width="1456" height="694" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d3e3e6e7-0abb-4085-a1fd-6e6b50f93437_2242x1068.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:694,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:328564,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3e3e6e7-0abb-4085-a1fd-6e6b50f93437_2242x1068.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zkxp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3e3e6e7-0abb-4085-a1fd-6e6b50f93437_2242x1068.png 424w, https://substackcdn.com/image/fetch/$s_!zkxp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3e3e6e7-0abb-4085-a1fd-6e6b50f93437_2242x1068.png 848w, https://substackcdn.com/image/fetch/$s_!zkxp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3e3e6e7-0abb-4085-a1fd-6e6b50f93437_2242x1068.png 1272w, https://substackcdn.com/image/fetch/$s_!zkxp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3e3e6e7-0abb-4085-a1fd-6e6b50f93437_2242x1068.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Proving an optimal policy.</strong> Given the above results, the last step in this proof is to simply show that the implicit reward function used within DPO is equivalent to the actual reward used within RLHF. If these two reward functions satisfy the equivalence relation, then we know that DPO will yield the same optimal policy as RLHF based on the findings shown above. To prove this final result, we can start by considering an arbitrary reward function <code>r(x, y)</code> used by RLHF. Our goal is to show that the implicit reward from DPO is equivalent to <code>r(x, y)</code>.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jqck!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea1671b7-f195-4365-99e7-fde33bc0c3d1_1360x638.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jqck!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea1671b7-f195-4365-99e7-fde33bc0c3d1_1360x638.png 424w, https://substackcdn.com/image/fetch/$s_!jqck!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea1671b7-f195-4365-99e7-fde33bc0c3d1_1360x638.png 848w, https://substackcdn.com/image/fetch/$s_!jqck!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea1671b7-f195-4365-99e7-fde33bc0c3d1_1360x638.png 1272w, https://substackcdn.com/image/fetch/$s_!jqck!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea1671b7-f195-4365-99e7-fde33bc0c3d1_1360x638.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jqck!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea1671b7-f195-4365-99e7-fde33bc0c3d1_1360x638.png" width="412" height="193.2764705882353" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ea1671b7-f195-4365-99e7-fde33bc0c3d1_1360x638.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:638,&quot;width&quot;:1360,&quot;resizeWidth&quot;:412,&quot;bytes&quot;:132635,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea1671b7-f195-4365-99e7-fde33bc0c3d1_1360x638.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jqck!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea1671b7-f195-4365-99e7-fde33bc0c3d1_1360x638.png 424w, https://substackcdn.com/image/fetch/$s_!jqck!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea1671b7-f195-4365-99e7-fde33bc0c3d1_1360x638.png 848w, https://substackcdn.com/image/fetch/$s_!jqck!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea1671b7-f195-4365-99e7-fde33bc0c3d1_1360x638.png 1272w, https://substackcdn.com/image/fetch/$s_!jqck!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea1671b7-f195-4365-99e7-fde33bc0c3d1_1360x638.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Given an arbitrary reward, we can define the modified<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-9" href="#footnote-9" target="_self">9</a> reward expression shown above. This expression just subtracts an extra term (i.e., the log of the partition function) from <code>r(x, y)</code>. Notice also that the term we subtract from <code>r(x, y)</code> only depends on <code>x</code>. For this reason, the modified reward expression is equivalent to <code>r(x, y)</code> according to the equivalence relation that we defined earlier. </p><blockquote><p><em>&#8220;The second lemma states that all reward functions from the same class yield the same optimal policy, hence for our final objective, we are only interested in recovering an arbitrary reward function from the optimal class.&#8221;</em> - from [1]</p></blockquote><p>To prove the desired result, we have to draw upon our prior expression that rearranges the optimal RLHF solution to produce an implicit reward. If we plug this implicit reward into the modified reward expression above, we get a reward&#8212;<em>which is known to be equivalent to </em><code>r(x, y)</code><em>!</em>&#8212;that matches the implicit reward in DPO; see below. As a result, we now know that the implicit reward used by DPO satisfies the equivalence relation with <code>r(x,y)</code>, which completes the proof.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wtj1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03f0f2ad-1ca3-432b-b7a0-b53e12b8ef05_1802x1114.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wtj1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03f0f2ad-1ca3-432b-b7a0-b53e12b8ef05_1802x1114.png 424w, https://substackcdn.com/image/fetch/$s_!wtj1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03f0f2ad-1ca3-432b-b7a0-b53e12b8ef05_1802x1114.png 848w, https://substackcdn.com/image/fetch/$s_!wtj1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03f0f2ad-1ca3-432b-b7a0-b53e12b8ef05_1802x1114.png 1272w, https://substackcdn.com/image/fetch/$s_!wtj1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03f0f2ad-1ca3-432b-b7a0-b53e12b8ef05_1802x1114.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wtj1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03f0f2ad-1ca3-432b-b7a0-b53e12b8ef05_1802x1114.png" width="496" height="306.5934065934066" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/03f0f2ad-1ca3-432b-b7a0-b53e12b8ef05_1802x1114.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:900,&quot;width&quot;:1456,&quot;resizeWidth&quot;:496,&quot;bytes&quot;:288833,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03f0f2ad-1ca3-432b-b7a0-b53e12b8ef05_1802x1114.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wtj1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03f0f2ad-1ca3-432b-b7a0-b53e12b8ef05_1802x1114.png 424w, https://substackcdn.com/image/fetch/$s_!wtj1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03f0f2ad-1ca3-432b-b7a0-b53e12b8ef05_1802x1114.png 848w, https://substackcdn.com/image/fetch/$s_!wtj1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03f0f2ad-1ca3-432b-b7a0-b53e12b8ef05_1802x1114.png 1272w, https://substackcdn.com/image/fetch/$s_!wtj1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03f0f2ad-1ca3-432b-b7a0-b53e12b8ef05_1802x1114.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Key takeaway.</strong> Before we conclude this section, we should quickly contextualize the result that we just proved. In the prior section, we derived an expression for the preference distribution induced by the implicit reward of the optimal policy (or solution) to the standard RLHF objective. After this expression is derived, we can easily train a model to have an implicit reward function that matches this preference distribution by adopting the same training strategy as a normal reward model. Therefore, <em>the key training procedure behind DPO centers around training an (implicit) reward model</em>, hence the name of the paper; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!x7cQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e90bb66-5833-4f58-bfe5-b17eef65c3e2_2092x602.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!x7cQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e90bb66-5833-4f58-bfe5-b17eef65c3e2_2092x602.png 424w, https://substackcdn.com/image/fetch/$s_!x7cQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e90bb66-5833-4f58-bfe5-b17eef65c3e2_2092x602.png 848w, https://substackcdn.com/image/fetch/$s_!x7cQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e90bb66-5833-4f58-bfe5-b17eef65c3e2_2092x602.png 1272w, https://substackcdn.com/image/fetch/$s_!x7cQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e90bb66-5833-4f58-bfe5-b17eef65c3e2_2092x602.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!x7cQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e90bb66-5833-4f58-bfe5-b17eef65c3e2_2092x602.png" width="544" height="156.54945054945054" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3e90bb66-5833-4f58-bfe5-b17eef65c3e2_2092x602.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:419,&quot;width&quot;:1456,&quot;resizeWidth&quot;:544,&quot;bytes&quot;:188994,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e90bb66-5833-4f58-bfe5-b17eef65c3e2_2092x602.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!x7cQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e90bb66-5833-4f58-bfe5-b17eef65c3e2_2092x602.png 424w, https://substackcdn.com/image/fetch/$s_!x7cQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e90bb66-5833-4f58-bfe5-b17eef65c3e2_2092x602.png 848w, https://substackcdn.com/image/fetch/$s_!x7cQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e90bb66-5833-4f58-bfe5-b17eef65c3e2_2092x602.png 1272w, https://substackcdn.com/image/fetch/$s_!x7cQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e90bb66-5833-4f58-bfe5-b17eef65c3e2_2092x602.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>A common misconception of DPO is that it removes the reward model, which is not true. In fact, <em>DPO is completely based upon reward modeling</em>. The reward model is just implicit, which means we can avoid training an explicit reward model.</p><blockquote><p><em>&#8220;What is often misunderstood is that DPO is learning a reward model at its core, hence the subtitle of the paper Your Language Model is Secretly a Reward Model. It is easy to confuse this with the DPO objective training a policy directly&#8221;</em> - <a href="https://rlhfbook.com/c/12-direct-alignment.html">RLHF book</a></p></blockquote><p>Given that the training procedure for DPO is based upon reward modeling, it&#8217;s not immediately obvious that training an LLM in this way will actually yield an optimal policy. <em>Could our resulting model have an accurate implicit reward function but still not generate high-quality completions?</em> In this section, we prove this should not be the case. If we train a model to match the implicit preference distribution of the optimal policy, then the resulting policy is also guaranteed to be optimal! Put simply, DPO indirectly provides us with a policy that is comparable in quality to one derived via training with RLHF, <em>making it a valid preference tuning alternative that is significantly less complex than techniques like PPO-based RLHF</em>. </p><h4>Why does DPO work?</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SFV3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd15cddcb-926d-478b-8ba3-d59e42672c57_2108x832.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SFV3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd15cddcb-926d-478b-8ba3-d59e42672c57_2108x832.png 424w, https://substackcdn.com/image/fetch/$s_!SFV3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd15cddcb-926d-478b-8ba3-d59e42672c57_2108x832.png 848w, https://substackcdn.com/image/fetch/$s_!SFV3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd15cddcb-926d-478b-8ba3-d59e42672c57_2108x832.png 1272w, https://substackcdn.com/image/fetch/$s_!SFV3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd15cddcb-926d-478b-8ba3-d59e42672c57_2108x832.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SFV3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd15cddcb-926d-478b-8ba3-d59e42672c57_2108x832.png" width="1456" height="575" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d15cddcb-926d-478b-8ba3-d59e42672c57_2108x832.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:575,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:320768,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd15cddcb-926d-478b-8ba3-d59e42672c57_2108x832.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SFV3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd15cddcb-926d-478b-8ba3-d59e42672c57_2108x832.png 424w, https://substackcdn.com/image/fetch/$s_!SFV3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd15cddcb-926d-478b-8ba3-d59e42672c57_2108x832.png 848w, https://substackcdn.com/image/fetch/$s_!SFV3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd15cddcb-926d-478b-8ba3-d59e42672c57_2108x832.png 1272w, https://substackcdn.com/image/fetch/$s_!SFV3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd15cddcb-926d-478b-8ba3-d59e42672c57_2108x832.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Gradient of DPO loss function</figcaption></figure></div><p>To gain a deeper understanding of DPO and why it works well, we can look at the structure of the gradient for DPO&#8217;s loss function; see above<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-10" href="#footnote-10" target="_self">10</a>. There are three key terms in this expression, colored in red (with part of the term in orange), blue and green for clarity. The purpose for each of these terms is as follows:</p><ol><li><p>The first (red) term is a weight&#8212;<em>falling in the range </em><code>[0, 1]</code><em> due to the sigmoid function</em>&#8212;that increases as the implicit reward of the rejected completion increases relative to that of the chosen completion. In other words, this term assigns a higher weight to implicit reward estimates that are wrong.</p></li><li><p>The second (blue) term is the positive gradient of the likelihood for the chosen completion with respect to the LLM&#8217;s parameters, which serves the purpose of increasing the likelihood for the chosen completion.</p></li><li><p>The third (green) term is the negative gradient of the likelihood for the rejected completion with respect to the LLM&#8217;s parameters, which serves the purpose of decreasing the likelihood for the rejected completion.</p></li></ol><p>These terms work together to simultaneously <em>i)</em> increase the likelihood of chosen completions and <em>ii)</em> decrease the likelihood of rejected completions, where extra emphasis (i.e., a larger update to our LLM&#8217;s parameters) is placed upon cases where the implicit reward estimate assigned by our LLM is incorrect. </p><blockquote><p><em>&#8220;Examples are weighed by how much higher the implicit reward model rates the dispreferred completions scaled by beta&#8230; how incorrectly the implicit reward model orders the completions, accounting for the strength of the KL constraint.&#8221;</em> - from [1]</p></blockquote><p><strong>Weighting coefficient.</strong> Authors in [1] observe that all three sub-components of DPO&#8217;s loss gradient are necessary for the algorithm to work well. Notably, if we remove the first weighting term from this gradient&#8212;<em>creating a gradient that uniformly increases the likelihood of all chosen completions and decreases the likelihood of all rejected completions</em>&#8212;the resulting policy is low-quality and even tends to completely degenerate when generating text; see below. Such a training algorithm is called unlikelihood training and has been explored in the past [5]. The simple weighting term added to the loss gradient by DPO completely transforms this approach, making it capable of performing high-quality LLM alignment.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3pFg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dee4f26-06ed-47da-9a32-611969a208d6_1546x786.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3pFg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dee4f26-06ed-47da-9a32-611969a208d6_1546x786.png 424w, https://substackcdn.com/image/fetch/$s_!3pFg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dee4f26-06ed-47da-9a32-611969a208d6_1546x786.png 848w, https://substackcdn.com/image/fetch/$s_!3pFg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dee4f26-06ed-47da-9a32-611969a208d6_1546x786.png 1272w, https://substackcdn.com/image/fetch/$s_!3pFg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dee4f26-06ed-47da-9a32-611969a208d6_1546x786.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3pFg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dee4f26-06ed-47da-9a32-611969a208d6_1546x786.png" width="580" height="294.7802197802198" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7dee4f26-06ed-47da-9a32-611969a208d6_1546x786.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:740,&quot;width&quot;:1456,&quot;resizeWidth&quot;:580,&quot;bytes&quot;:197784,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dee4f26-06ed-47da-9a32-611969a208d6_1546x786.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3pFg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dee4f26-06ed-47da-9a32-611969a208d6_1546x786.png 424w, https://substackcdn.com/image/fetch/$s_!3pFg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dee4f26-06ed-47da-9a32-611969a208d6_1546x786.png 848w, https://substackcdn.com/image/fetch/$s_!3pFg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dee4f26-06ed-47da-9a32-611969a208d6_1546x786.png 1272w, https://substackcdn.com/image/fetch/$s_!3pFg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dee4f26-06ed-47da-9a32-611969a208d6_1546x786.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">LLMs trained with unlikelihood training tend to degenerate (from [1])</figcaption></figure></div><h4>Implementing DPO from Scratch</h4><p>Although the derivation of DPO is complex, the technique is actually quite simple to use practically. In fact, DPO played a huge role in democratizing research on LLM post-training for those outside of top labs [3]. Algorithms like PPO-based RLHF are harder to tune and require significant compute resources. In contrast, DPO uses a standard classification (or ranking) loss with no RL and only keeps two copies of the model&#8212;<em>instead of four</em>&#8212;throughout the training process. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WwdK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb092dc43-cad8-4b1d-8129-2ce8a7860936_2420x876.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WwdK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb092dc43-cad8-4b1d-8129-2ce8a7860936_2420x876.png 424w, https://substackcdn.com/image/fetch/$s_!WwdK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb092dc43-cad8-4b1d-8129-2ce8a7860936_2420x876.png 848w, https://substackcdn.com/image/fetch/$s_!WwdK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb092dc43-cad8-4b1d-8129-2ce8a7860936_2420x876.png 1272w, https://substackcdn.com/image/fetch/$s_!WwdK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb092dc43-cad8-4b1d-8129-2ce8a7860936_2420x876.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WwdK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb092dc43-cad8-4b1d-8129-2ce8a7860936_2420x876.png" width="1456" height="527" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b092dc43-cad8-4b1d-8129-2ce8a7860936_2420x876.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:527,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:226862,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb092dc43-cad8-4b1d-8129-2ce8a7860936_2420x876.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WwdK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb092dc43-cad8-4b1d-8129-2ce8a7860936_2420x876.png 424w, https://substackcdn.com/image/fetch/$s_!WwdK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb092dc43-cad8-4b1d-8129-2ce8a7860936_2420x876.png 848w, https://substackcdn.com/image/fetch/$s_!WwdK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb092dc43-cad8-4b1d-8129-2ce8a7860936_2420x876.png 1272w, https://substackcdn.com/image/fetch/$s_!WwdK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb092dc43-cad8-4b1d-8129-2ce8a7860936_2420x876.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Standard DPO training pipeline</figcaption></figure></div><p><strong>DPO training pipeline.</strong> The standard training process with DPO is depicted above. We begin the process with a diverse set of prompts that capture the use case(s) for which we are training our model. From here, we use our reference policy to generate pairs of completions for each prompt and have human raters provide preference annotations for each pair. Once this preference dataset is available, we perform maximum likelihood estimation by training our model to minimize the DPO loss that we derived earlier over the preference dataset. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!S6aC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28ea6924-dbc4-49e0-be79-d9392492efa5_2004x940.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!S6aC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28ea6924-dbc4-49e0-be79-d9392492efa5_2004x940.png 424w, https://substackcdn.com/image/fetch/$s_!S6aC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28ea6924-dbc4-49e0-be79-d9392492efa5_2004x940.png 848w, https://substackcdn.com/image/fetch/$s_!S6aC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28ea6924-dbc4-49e0-be79-d9392492efa5_2004x940.png 1272w, https://substackcdn.com/image/fetch/$s_!S6aC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28ea6924-dbc4-49e0-be79-d9392492efa5_2004x940.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!S6aC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28ea6924-dbc4-49e0-be79-d9392492efa5_2004x940.png" width="1456" height="683" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/28ea6924-dbc4-49e0-be79-d9392492efa5_2004x940.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:683,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:411151,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28ea6924-dbc4-49e0-be79-d9392492efa5_2004x940.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!S6aC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28ea6924-dbc4-49e0-be79-d9392492efa5_2004x940.png 424w, https://substackcdn.com/image/fetch/$s_!S6aC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28ea6924-dbc4-49e0-be79-d9392492efa5_2004x940.png 848w, https://substackcdn.com/image/fetch/$s_!S6aC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28ea6924-dbc4-49e0-be79-d9392492efa5_2004x940.png 1272w, https://substackcdn.com/image/fetch/$s_!S6aC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28ea6924-dbc4-49e0-be79-d9392492efa5_2004x940.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Computing the loss for DPO in PyTorch (from [1])</figcaption></figure></div><p><strong>Loss implementation.</strong> We can pretty easily implement the loss function for DPO in PyTorch&#8212;<em>it is just a ranking loss applied over implicit rewards derived from the current and reference policies</em>. The example implementation of the loss from [1] is copied below for reference, where we see that the loss is computed by:</p><ol><li><p>Getting the log probabilities assigned to each completion&#8212;<em>both chosen and rejected</em>&#8212;by the current policy and the reference policy. </p></li><li><p>Computing the probability ratio between chosen and rejected completions for both the current policy and the reference policy.</p></li><li><p>Using the above probability ratios to construct the final DPO loss. </p></li></ol><p><strong>Handling offline preference data.</strong> DPO is fundamentally an <a href="https://huggingface.co/learn/deep-rl-course/en/unitbonus3/offline-online">offline preference learning algorithm</a>&#8212;<em>we are optimizing our model over a static preference dataset</em>. In the pipeline outlined above, we use our reference model to generate completions in our preference dataset. In most practical applications, however, this may not be the case. As a practitioner, we may simply download a preference dataset like UltraFeedback [4] online and train our model over this static dataset using DPO. In such cases, the actual reference model is unknown and may be different from the reference model we used in DPO training, creating a distribution shift.</p><blockquote><p><em>&#8220;Since the preference datasets are sampled using the SFT model, we initialize the reference policy to the SFT model whenever available. However, when the SFT model is not available, we initialize the reference policy by maximizing likelihood of preferred completions. This procedure helps mitigate the distribution shift between the true reference distribution and the reference policy used by DPO.&#8221;</em> - from [1]</p></blockquote><p>To minimize this distribution shift and ensure that the actual reference model aligns well with the completions present in our preference dataset, authors in [1] recommend the procedure depicted below. In this procedure, we first perform supervised finetuning of our reference model on the chosen completions in the preference dataset, then further train this model with DPO afterwards. This preliminary SFT training stage ensures the reference policy in DPO is not too different from the true reference policy used to create the preference dataset.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pN0e!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa70d83b0-01c7-41b6-b6f4-e86c446c036d_1954x1248.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pN0e!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa70d83b0-01c7-41b6-b6f4-e86c446c036d_1954x1248.png 424w, https://substackcdn.com/image/fetch/$s_!pN0e!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa70d83b0-01c7-41b6-b6f4-e86c446c036d_1954x1248.png 848w, https://substackcdn.com/image/fetch/$s_!pN0e!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa70d83b0-01c7-41b6-b6f4-e86c446c036d_1954x1248.png 1272w, https://substackcdn.com/image/fetch/$s_!pN0e!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa70d83b0-01c7-41b6-b6f4-e86c446c036d_1954x1248.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pN0e!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa70d83b0-01c7-41b6-b6f4-e86c446c036d_1954x1248.png" width="1456" height="930" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a70d83b0-01c7-41b6-b6f4-e86c446c036d_1954x1248.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:930,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:347211,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/167254905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa70d83b0-01c7-41b6-b6f4-e86c446c036d_1954x1248.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pN0e!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa70d83b0-01c7-41b6-b6f4-e86c446c036d_1954x1248.png 424w, https://substackcdn.com/image/fetch/$s_!pN0e!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa70d83b0-01c7-41b6-b6f4-e86c446c036d_1954x1248.png 848w, https://substackcdn.com/image/fetch/$s_!pN0e!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa70d83b0-01c7-41b6-b6f4-e86c446c036d_1954x1248.png 1272w, https://substackcdn.com/image/fetch/$s_!pN0e!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa70d83b0-01c7-41b6-b6f4-e86c446c036d_1954x1248.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Mitigating distribution shift from offline preference data in DPO</figcaption></figure></div><p>The last consideration for implementing DPO is correctly setting the &#946; hyperparameter, which controls the amount that the trained policy can differ from the reference policy. Remember, &#946; is the weight by which we multiply the KL constraint in the RLHF objective, which controls the strength of preference alignment in DPO&#8212;<em>lower &#946; values mean that the model is updated more aggressively to adapt to observed preference in the data</em>. Usually, &#946; is set to a value in the range <code>[0, 1]</code>, where lower values are more common. For example, <code>&#946; = 0.1</code> is a popular choice, though authors in [1] explore both <code>&#946; = 0.1</code> and <code>&#946; = 0.5</code>.</p><p><strong>Full DPO example.</strong> One of the easiest ways to finetune your own LLM with DPO is by using the <a href="https://huggingface.co/docs/trl/en/dpo_trainer">DPOTrainer</a> in the <a href="https://huggingface.co/docs/trl/en/index">HuggingFace TRL package</a>. To perform a DPO training run, you just need to <em>i)</em> load a preference dataset like <a href="https://huggingface.co/datasets/openbmb/UltraFeedback">UltraFeedback</a>; <em>ii) </em>choose a model / tokenizer (e.g., a smaller model like <a href="https://huggingface.co/Qwen/Qwen3-0.6B">Qwen3-0.6B</a> is great if we don&#8217;t have big GPUs); and <em>iii)</em> execute the DPO trainer as shown in the code below.</p><pre><code>from trl import DPOConfig, DPOTrainer

# load model and data
model = &lt;load our model&gt;
tokenizer = &lt;load our tokenizer&gt;
train_dataset = &lt;load our preference dataset&gt;

# configure DPO training process training_args = DPOConfig(output_dir="./dpo_logs/")
trainer = DPOTrainer(
    model=model,
    args=training_args,
    processing_class=tokenizer,
    train_dataset=train_dataset,
)

# execute DPO training
# run the below command to execute this script
# &gt; accelerate launch &lt;script name&gt;
trainer.train()</code></pre><h2>Summary and Key Takeaways</h2><p>Direct Preference Optimization (DPO) is a preference-tuning method for LLMs that indirectly solves the RLHF objective while avoiding explicit reward models and RL. In DPO, we reparameterize the RLHF objective to form an implicit reward function derived from the policy itself (and a reference policy). Then, we train our LLM over a static preference dataset to optimize this implicit reward function, similarly to a standard reward model. By solving this implicit reward modeling objective, <em>we indirectly yield a policy that solves the RLHF objective</em>.</p><p>This approach offers a simpler, more stable, and computationally efficient alternative to RL-based alignment methods, making high-quality LLM alignment more accessible. However, several works have studied the differences between (offline) direct alignment algorithms like DPO and alignment techniques that use online RL (e.g., PPO-based RLHF), finding that a performance gap can exist [11, 12]. Despite this fact, DPO is still heavily used in LLM post-training&#8212;<em>often in tandem with online algorithms</em>&#8212;due to its simplicity and effectiveness. </p><h4>New to the newsletter?</h4><p>Hi! I&#8217;m <a href="https://cameronrwolfe.me/">Cameron R. Wolfe</a>, Deep Learning Ph.D. and Senior Research Scientist at <a href="https://research.netflix.com/research-area/nlp-and-conversations">Netflix</a>. This is the Deep (Learning) Focus newsletter, where I help readers better understand important topics in AI research. The newsletter will always be free and open to read. If you like the newsletter, please subscribe, consider a paid subscription, share it, or follow me on <a href="https://twitter.com/cwolferesearch">X</a> and <a href="https://www.linkedin.com/in/cameron-r-wolfe-ph-d-04744a238/">LinkedIn</a>!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://cameronrwolfe.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://cameronrwolfe.substack.com/subscribe?"><span>Subscribe now</span></a></p><h4>Bibliography</h4><p>[1] Rafailov, Rafael, et al. "Direct preference optimization: Your language model is secretly a reward model." <em>Advances in neural information processing systems</em> 36 (2023): 53728-53741.</p><p>[2] Stiennon, Nisan, et al. "Learning to summarize with human feedback." <em>Advances in neural information processing systems</em> 33 (2020): 3008-3021.</p><p>[3] Tunstall, Lewis, et al. "Zephyr: Direct distillation of lm alignment." <em>arXiv preprint arXiv:2310.16944</em> (2023).</p><p>[4] Cui, Ganqu, et al. "Ultrafeedback: Boosting language models with scaled ai feedback, 2024." <em>URL https://arxiv. org/abs/2310.01377</em>.</p><p>[5] Welleck, Sean, et al. "Neural text generation with unlikelihood training." <em>arXiv preprint arXiv:1908.04319</em> (2019).</p><p>[6] Lambert, Nathan, et al. "Tulu 3: Pushing frontiers in open language model post-training, 2024." <em>URL https://arxiv. org/abs/2411.15124</em> 297 (2025).</p><p>[7] Yang, An, et al. "Qwen3 technical report." <em>arXiv preprint arXiv:2505.09388</em> (2025).</p><p>[8] Dubey, Abhimanyu, et al. "The llama 3 herd of models." <em>arXiv e-prints</em> (2024): arXiv-2407.</p><p>[9] Kaplan, Jared, et al. "Scaling laws for neural language models." <em>arXiv preprint arXiv:2001.08361</em> (2020).</p><p>[10] Sheng, Guangming, et al. "Hybridflow: A flexible and efficient rlhf framework." <em>Proceedings of the Twentieth European Conference on Computer Systems</em>. 2025.</p><p>[11] Tang, Yunhao, et al. "Understanding the performance gap between online and offline alignment algorithms." <em>arXiv preprint arXiv:2405.08448</em> (2024).</p><p>[12] Ivison, Hamish, et al. "Unpacking dpo and ppo: Disentangling best practices for learning from preference feedback." <em>Advances in neural information processing systems</em> 37 (2024): 36602-36633.</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>See <a href="https://rlhfbook.com/c/11-policy-gradients.html">here</a> for an in-depth explanation of reinforcement learning in the context of LLMs. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>More specifically, "online" means that the policy is updated iteratively with new samples generated at each step, while "offline" means that all training data is fixed in advance.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>The word &#8220;policy&#8221; is RL jargon for the LLM or model that we are training (with RL). </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>More specifically, the KL divergence is measuring how much information is lost when the given distribution is used to approximate the reference distribution. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>The reference model is not always the SFT model. It can also be a previous model checkpoint from RL training. For example, if four phases or rounds of RLHF are performed sequentially, then the reference model for the second phase of RLHF could be the model resulting from the first phase of RLHF. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>The optimal policy is a product of the partition function, the reference policy, and an exponential function, all of which cannot be less than zero. Therefore, the product of these terms, which form the optimal policy, must also be non-negative. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>This is known due to <a href="https://en.wikipedia.org/wiki/Gibbs%27_inequality">Gibbs&#8217; inequality</a>. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>In [1], this proof is provided assuming the more general <a href="https://statisticaloddsandends.wordpress.com/2024/04/24/what-is-the-plackett-luce-model/">Plackett-Luce model</a> (see Appendix A.5 on page 17), but we rewrite this proof using the Bradley-Terry model for simplicity and to match the rest of the explanation in this overview. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-9" href="#footnote-anchor-9" class="footnote-number" contenteditable="false" target="_self">9</a><div class="footnote-content"><p>In [1], authors describe this modified function as a &#8220;projection&#8221; of the reward function. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-10" href="#footnote-anchor-10" class="footnote-number" contenteditable="false" target="_self">10</a><div class="footnote-content"><p>Remember, DPO trains the LLM using MLE. In other words, our LLM&#8217;s parameters are directly updated by repeatedly <em>i)</em> computing this gradient over a batch of data, <em>ii)</em> multiplying the gradient by a scalar factor (i.e., the learning rate), and <em>iii)</em> subtracting this scaled gradient from our model parameters. If you want to understand how this gradient is derived, please see page 17 of <a href="https://arxiv.org/abs/2305.18290">this paper</a>.</p></div></div>]]></content:encoded></item><item><title><![CDATA[Reward Models]]></title><description><![CDATA[Modeling human preferences for LLMs in the age of reasoning models...]]></description><link>https://cameronrwolfe.substack.com/p/reward-models</link><guid isPermaLink="false">https://cameronrwolfe.substack.com/p/reward-models</guid><dc:creator><![CDATA[Cameron R. Wolfe, Ph.D.]]></dc:creator><pubDate>Mon, 30 Jun 2025 09:33:16 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/6f2dc466-5918-4e2d-9698-c2626e71089f_1988x1116.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!D_ya!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa99610ef-e0a2-4eba-b3d5-8326f75b0cb0_1988x1114.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!D_ya!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa99610ef-e0a2-4eba-b3d5-8326f75b0cb0_1988x1114.png 424w, https://substackcdn.com/image/fetch/$s_!D_ya!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa99610ef-e0a2-4eba-b3d5-8326f75b0cb0_1988x1114.png 848w, https://substackcdn.com/image/fetch/$s_!D_ya!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa99610ef-e0a2-4eba-b3d5-8326f75b0cb0_1988x1114.png 1272w, https://substackcdn.com/image/fetch/$s_!D_ya!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa99610ef-e0a2-4eba-b3d5-8326f75b0cb0_1988x1114.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!D_ya!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa99610ef-e0a2-4eba-b3d5-8326f75b0cb0_1988x1114.png" width="1456" height="816" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a99610ef-e0a2-4eba-b3d5-8326f75b0cb0_1988x1114.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:816,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1176394,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/166169560?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa99610ef-e0a2-4eba-b3d5-8326f75b0cb0_1988x1114.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!D_ya!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa99610ef-e0a2-4eba-b3d5-8326f75b0cb0_1988x1114.png 424w, https://substackcdn.com/image/fetch/$s_!D_ya!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa99610ef-e0a2-4eba-b3d5-8326f75b0cb0_1988x1114.png 848w, https://substackcdn.com/image/fetch/$s_!D_ya!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa99610ef-e0a2-4eba-b3d5-8326f75b0cb0_1988x1114.png 1272w, https://substackcdn.com/image/fetch/$s_!D_ya!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa99610ef-e0a2-4eba-b3d5-8326f75b0cb0_1988x1114.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1, 2, 4, 14])</figcaption></figure></div><p>Reward models (RMs) are a cornerstone of large language model (LLM) research, enabling significant advancements by incorporating human preferences into the training process. Despite their critical role, RMs are often overlooked. Practical guidance on how to train and use them effectively remains scarce&#8212;<em>particularly as RM-free techniques like reinforcement learning with verifiable rewards gain popularity</em>. Nevertheless, training LLMs with <a href="https://rlhfbook.com/c/11-policy-gradients.html">PPO-based reinforcement learning</a> continues to be a crucial factor in developing top foundation models. In this overview, we will build a deep understanding of RMs from the ground up, clarifying their historical and ongoing significance in the rapidly evolving LLM ecosystem.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://cameronrwolfe.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://cameronrwolfe.substack.com/subscribe?"><span>Subscribe now</span></a></p><h2>What is a Reward Model?</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UkPk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94ec9186-9bf4-4b06-a6b7-7eb119b91e1a_2020x650.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UkPk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94ec9186-9bf4-4b06-a6b7-7eb119b91e1a_2020x650.png 424w, https://substackcdn.com/image/fetch/$s_!UkPk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94ec9186-9bf4-4b06-a6b7-7eb119b91e1a_2020x650.png 848w, https://substackcdn.com/image/fetch/$s_!UkPk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94ec9186-9bf4-4b06-a6b7-7eb119b91e1a_2020x650.png 1272w, https://substackcdn.com/image/fetch/$s_!UkPk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94ec9186-9bf4-4b06-a6b7-7eb119b91e1a_2020x650.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UkPk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94ec9186-9bf4-4b06-a6b7-7eb119b91e1a_2020x650.png" width="1456" height="469" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/94ec9186-9bf4-4b06-a6b7-7eb119b91e1a_2020x650.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:469,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:159062,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/166169560?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94ec9186-9bf4-4b06-a6b7-7eb119b91e1a_2020x650.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UkPk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94ec9186-9bf4-4b06-a6b7-7eb119b91e1a_2020x650.png 424w, https://substackcdn.com/image/fetch/$s_!UkPk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94ec9186-9bf4-4b06-a6b7-7eb119b91e1a_2020x650.png 848w, https://substackcdn.com/image/fetch/$s_!UkPk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94ec9186-9bf4-4b06-a6b7-7eb119b91e1a_2020x650.png 1272w, https://substackcdn.com/image/fetch/$s_!UkPk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94ec9186-9bf4-4b06-a6b7-7eb119b91e1a_2020x650.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p><em>&#8220;Reward models broadly have been used extensively in reinforcement learning research as a proxy for environment rewards&#8230; The most common reward model predicts the probability that a piece of text was close to a preferred piece of text from the training comparisons.&#8221;</em> - <a href="https://rlhfbook.com/c/07-reward-models.html">RLHF book</a></p></blockquote><p>Reward models (RMs) are specialized LLMs&#8212;<em>usually derived from an LLM that we are currently training</em>&#8212;that are trained to predict a human preference score given a prompt and a candidate completion as input; see above. A higher score from the RM indicates that a given completion is likely to be preferred by humans. </p><p>As a first step, we must build a fundamental understanding of reward models (RMs), how they are created, and how we use them in the context of LLMs. In this section, we will focus on understanding the following:</p><ul><li><p>The motivation for RMs, as derived from statistical models of preferences.</p></li><li><p>The architecture and structure used by most RMs.</p></li><li><p>The training process for an RM.</p></li></ul><p>To understand how RMs are used, we need more context around reinforcement learning (RL) and LLM post-training, which will be covered in the next section. </p><h4>The Bradley-Terry Model of Preference</h4><p>The standard implementation of an RM is derived from the <a href="https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model">Bradley-Terry model of preference</a>&#8212;<em>a statistical model used to rank paired comparison data based on the relative strength or performance of items in the pair</em>. Given two events <code>i</code> and <code>j</code> drawn from the same distribution, the Bradley-Terry model defines the probability that item <code>i</code> wins&#8212;<em>or is preferred</em>&#8212;compared to item <code>j</code> as follows:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zgW4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bb1622e-98a9-4e69-9220-433c8085cd93_884x612.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zgW4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bb1622e-98a9-4e69-9220-433c8085cd93_884x612.png 424w, https://substackcdn.com/image/fetch/$s_!zgW4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bb1622e-98a9-4e69-9220-433c8085cd93_884x612.png 848w, https://substackcdn.com/image/fetch/$s_!zgW4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bb1622e-98a9-4e69-9220-433c8085cd93_884x612.png 1272w, https://substackcdn.com/image/fetch/$s_!zgW4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bb1622e-98a9-4e69-9220-433c8085cd93_884x612.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zgW4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bb1622e-98a9-4e69-9220-433c8085cd93_884x612.png" width="320" height="221.53846153846155" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8bb1622e-98a9-4e69-9220-433c8085cd93_884x612.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:612,&quot;width&quot;:884,&quot;resizeWidth&quot;:320,&quot;bytes&quot;:78242,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/166169560?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bb1622e-98a9-4e69-9220-433c8085cd93_884x612.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!zgW4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bb1622e-98a9-4e69-9220-433c8085cd93_884x612.png 424w, https://substackcdn.com/image/fetch/$s_!zgW4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bb1622e-98a9-4e69-9220-433c8085cd93_884x612.png 848w, https://substackcdn.com/image/fetch/$s_!zgW4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bb1622e-98a9-4e69-9220-433c8085cd93_884x612.png 1272w, https://substackcdn.com/image/fetch/$s_!zgW4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bb1622e-98a9-4e69-9220-433c8085cd93_884x612.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Pairwise comparison probability from the Bradley-Terry model</figcaption></figure></div><p>In the context of LLMs, items <code>i</code> and <code>j</code> are two completions generated by the same LLM and from the same prompt (i.e., these completions are sampled from the same distribution). The RM assigns a score to each of these completions, then we use the above expression from the Bradley-Terry model to derive a probability that completion <code>i</code> is preferred to completion <code>j</code>. Put simply, <em>we use the Bradley-Terry model to express probabilities for pairwise comparisons between two completions</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rKGp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F609d472d-1a82-4fd4-8c25-fe4e7253ee13_610x1118.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rKGp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F609d472d-1a82-4fd4-8c25-fe4e7253ee13_610x1118.png 424w, https://substackcdn.com/image/fetch/$s_!rKGp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F609d472d-1a82-4fd4-8c25-fe4e7253ee13_610x1118.png 848w, https://substackcdn.com/image/fetch/$s_!rKGp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F609d472d-1a82-4fd4-8c25-fe4e7253ee13_610x1118.png 1272w, https://substackcdn.com/image/fetch/$s_!rKGp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F609d472d-1a82-4fd4-8c25-fe4e7253ee13_610x1118.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rKGp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F609d472d-1a82-4fd4-8c25-fe4e7253ee13_610x1118.png" width="288" height="527.8426229508196" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/609d472d-1a82-4fd4-8c25-fe4e7253ee13_610x1118.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1118,&quot;width&quot;:610,&quot;resizeWidth&quot;:288,&quot;bytes&quot;:104466,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/166169560?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F609d472d-1a82-4fd4-8c25-fe4e7253ee13_610x1118.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rKGp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F609d472d-1a82-4fd4-8c25-fe4e7253ee13_610x1118.png 424w, https://substackcdn.com/image/fetch/$s_!rKGp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F609d472d-1a82-4fd4-8c25-fe4e7253ee13_610x1118.png 848w, https://substackcdn.com/image/fetch/$s_!rKGp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F609d472d-1a82-4fd4-8c25-fe4e7253ee13_610x1118.png 1272w, https://substackcdn.com/image/fetch/$s_!rKGp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F609d472d-1a82-4fd4-8c25-fe4e7253ee13_610x1118.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [14])</figcaption></figure></div><p><strong>Preference data.</strong> Pairwise preference data is used&#8212;<em>and has been used for quite some time [14]</em>&#8212;extensively in LLM post-training. Such data is comprised of many different prompts, and we aim to maximize the diversity of prompts in our data. The prompt distribution should be representative of prompts a model will see in the wild. For each prompt, we have a pair of candidate completions<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>, where one completion has been identified&#8212;<em>usually by a human, but <a href="https://cameronrwolfe.substack.com/p/rlaif-reinforcement-learning-from">sometimes by a model</a></em>&#8212;as preferable to the other; see above. A dataset of prompts with associated chosen and rejected completions is referred to as a (human) preference dataset. </p><h4>How do RMs work?</h4><p>We know that RMs are based upon the Bradley-Terry model of preference, but there are many ways that we could implement such a statistical model practically. In the domain of LLMs, these models are implemented&#8212;<em>perhaps unsurprisingly</em>&#8212;with an LLM. Compared to standard (generative) <a href="https://cameronrwolfe.substack.com/p/decoder-only-transformers-the-workhorse">decoder-only LLMs</a>, however, RMs modify both the underlying architecture and training objective.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!M_zU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757f3b6-d8a3-49da-80dc-74b9bcb9a1aa_1716x890.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!M_zU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757f3b6-d8a3-49da-80dc-74b9bcb9a1aa_1716x890.png 424w, https://substackcdn.com/image/fetch/$s_!M_zU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757f3b6-d8a3-49da-80dc-74b9bcb9a1aa_1716x890.png 848w, https://substackcdn.com/image/fetch/$s_!M_zU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757f3b6-d8a3-49da-80dc-74b9bcb9a1aa_1716x890.png 1272w, https://substackcdn.com/image/fetch/$s_!M_zU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757f3b6-d8a3-49da-80dc-74b9bcb9a1aa_1716x890.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!M_zU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757f3b6-d8a3-49da-80dc-74b9bcb9a1aa_1716x890.png" width="1456" height="755" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0757f3b6-d8a3-49da-80dc-74b9bcb9a1aa_1716x890.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:755,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:173286,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/166169560?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757f3b6-d8a3-49da-80dc-74b9bcb9a1aa_1716x890.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!M_zU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757f3b6-d8a3-49da-80dc-74b9bcb9a1aa_1716x890.png 424w, https://substackcdn.com/image/fetch/$s_!M_zU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757f3b6-d8a3-49da-80dc-74b9bcb9a1aa_1716x890.png 848w, https://substackcdn.com/image/fetch/$s_!M_zU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757f3b6-d8a3-49da-80dc-74b9bcb9a1aa_1716x890.png 1272w, https://substackcdn.com/image/fetch/$s_!M_zU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0757f3b6-d8a3-49da-80dc-74b9bcb9a1aa_1716x890.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Schematic depiction of RM architecture</figcaption></figure></div><p><strong>RM architecture.</strong> An RM takes a prompt-completion pair from an LLM as input and outputs a (scalar) preference score. In practice, the RM is implemented with an LLM by adding a linear head to the end of the decoder-only architecture; see above. Specifically, the LLM outputs a list of token vectors&#8212;<em>one for each input token vector</em>&#8212;and we pass the final vector from this list through the linear head to produce a single, scalar score. <em>We can think of the RM as an LLM with an extra classification head used to classify a given completion as preferred or not preferred.</em> </p><p><strong>Training process.</strong> The parameters of the RM are usually initialized with an existing policy<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>, which we will refer to as the RM&#8217;s &#8220;base&#8221; model. Several choices exist for the policy with which to initialize the RM; e.g., the LLM being trained or a prior version of this model like the pretrained base or SFT model. Once the RM is initialized, we add the linear head to this model and train it over a <a href="https://rlhfbook.com/c/06-preference-data.html">preference dataset</a> (i.e., pairs of chosen and rejected model responses to a prompt).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RqTs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd836bca0-f804-4052-8b7f-4eb2b1e2356b_1392x530.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RqTs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd836bca0-f804-4052-8b7f-4eb2b1e2356b_1392x530.png 424w, https://substackcdn.com/image/fetch/$s_!RqTs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd836bca0-f804-4052-8b7f-4eb2b1e2356b_1392x530.png 848w, https://substackcdn.com/image/fetch/$s_!RqTs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd836bca0-f804-4052-8b7f-4eb2b1e2356b_1392x530.png 1272w, https://substackcdn.com/image/fetch/$s_!RqTs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd836bca0-f804-4052-8b7f-4eb2b1e2356b_1392x530.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RqTs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd836bca0-f804-4052-8b7f-4eb2b1e2356b_1392x530.png" width="1392" height="530" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d836bca0-f804-4052-8b7f-4eb2b1e2356b_1392x530.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:530,&quot;width&quot;:1392,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:136743,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/166169560?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd836bca0-f804-4052-8b7f-4eb2b1e2356b_1392x530.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!RqTs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd836bca0-f804-4052-8b7f-4eb2b1e2356b_1392x530.png 424w, https://substackcdn.com/image/fetch/$s_!RqTs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd836bca0-f804-4052-8b7f-4eb2b1e2356b_1392x530.png 848w, https://substackcdn.com/image/fetch/$s_!RqTs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd836bca0-f804-4052-8b7f-4eb2b1e2356b_1392x530.png 1272w, https://substackcdn.com/image/fetch/$s_!RqTs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd836bca0-f804-4052-8b7f-4eb2b1e2356b_1392x530.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Pairwise probability expressed with respect to the output of an RM</figcaption></figure></div><p>Given a preference pair, we want our RM to assign a higher score to the chosen response relative to the rejected response. In other words, the optimal RM should maximize the probability that the chosen response is preferred to the rejected response. As we learned before, we can use the Bradley-Terry model to express this probability; see above. Rearranging this probability expression, we can derive the loss function shown below, which is a <a href="https://gombru.github.io/2019/04/03/ranking_loss/">pairwise ranking loss</a> that simply encourages the model to assign a higher score to the chosen response. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iPQn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84db389-0e57-4a3c-808b-d48b28a192d6_1204x392.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iPQn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84db389-0e57-4a3c-808b-d48b28a192d6_1204x392.png 424w, https://substackcdn.com/image/fetch/$s_!iPQn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84db389-0e57-4a3c-808b-d48b28a192d6_1204x392.png 848w, https://substackcdn.com/image/fetch/$s_!iPQn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84db389-0e57-4a3c-808b-d48b28a192d6_1204x392.png 1272w, https://substackcdn.com/image/fetch/$s_!iPQn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84db389-0e57-4a3c-808b-d48b28a192d6_1204x392.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iPQn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84db389-0e57-4a3c-808b-d48b28a192d6_1204x392.png" width="644" height="209.67441860465115" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c84db389-0e57-4a3c-808b-d48b28a192d6_1204x392.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:392,&quot;width&quot;:1204,&quot;resizeWidth&quot;:644,&quot;bytes&quot;:81656,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/166169560?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84db389-0e57-4a3c-808b-d48b28a192d6_1204x392.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!iPQn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84db389-0e57-4a3c-808b-d48b28a192d6_1204x392.png 424w, https://substackcdn.com/image/fetch/$s_!iPQn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84db389-0e57-4a3c-808b-d48b28a192d6_1204x392.png 848w, https://substackcdn.com/image/fetch/$s_!iPQn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84db389-0e57-4a3c-808b-d48b28a192d6_1204x392.png 1272w, https://substackcdn.com/image/fetch/$s_!iPQn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc84db389-0e57-4a3c-808b-d48b28a192d6_1204x392.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Standard loss function formulation for an RM</figcaption></figure></div><p>We can think of this as a <a href="https://docs.pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html">negative log likelihood (NLL) loss</a>, where the probability for NLL is given by the Bradley-Terry model. A visualization of the landscape for this loss is shown below, <em>where we see that the loss is minimized when the chosen score is maximized and the rejected score is minimized</em>. By empirically minimizing this loss function over a large preference dataset, we can (approximately) maximize the expected probability that chosen responses are preferred to rejected responses.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qlGB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9004e549-5346-4e1b-8c00-faa76fa72bf6_1568x1288.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qlGB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9004e549-5346-4e1b-8c00-faa76fa72bf6_1568x1288.png 424w, https://substackcdn.com/image/fetch/$s_!qlGB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9004e549-5346-4e1b-8c00-faa76fa72bf6_1568x1288.png 848w, https://substackcdn.com/image/fetch/$s_!qlGB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9004e549-5346-4e1b-8c00-faa76fa72bf6_1568x1288.png 1272w, https://substackcdn.com/image/fetch/$s_!qlGB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9004e549-5346-4e1b-8c00-faa76fa72bf6_1568x1288.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qlGB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9004e549-5346-4e1b-8c00-faa76fa72bf6_1568x1288.png" width="534" height="438.64285714285717" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9004e549-5346-4e1b-8c00-faa76fa72bf6_1568x1288.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1196,&quot;width&quot;:1456,&quot;resizeWidth&quot;:534,&quot;bytes&quot;:1095703,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/166169560?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9004e549-5346-4e1b-8c00-faa76fa72bf6_1568x1288.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qlGB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9004e549-5346-4e1b-8c00-faa76fa72bf6_1568x1288.png 424w, https://substackcdn.com/image/fetch/$s_!qlGB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9004e549-5346-4e1b-8c00-faa76fa72bf6_1568x1288.png 848w, https://substackcdn.com/image/fetch/$s_!qlGB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9004e549-5346-4e1b-8c00-faa76fa72bf6_1568x1288.png 1272w, https://substackcdn.com/image/fetch/$s_!qlGB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9004e549-5346-4e1b-8c00-faa76fa72bf6_1568x1288.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Normalizing the reward.</strong> After training, the RM outputs unnormalized scalar values. To lower the variance of the reward function (i.e., make sure the RM&#8217;s output falls in a standard range), we can normalize the RM&#8217;s output such that it assigns an average reward of zero over our preference dataset used for training. Authors in [14] mention using this reward normalization approach. </p><blockquote><p><em>&#8220;At the end of training, we normalize the reward model outputs such that the reference summaries from our dataset achieve a mean score of 0.&#8221;</em> - from [14]</p></blockquote><h4>Implementing an RM</h4><p>To make this discussion more practical, let&#8217;s learn how RMs&#8212;<em>including both the architecture and loss function</em>&#8212;can be implemented using common deep learning frameworks. An RM is just a classification model&#8212; <em>it performs</em> <em><a href="https://huggingface.co/docs/transformers/en/tasks/sequence_classification">text classification</a> over a sequence of text.</em> Given a prompt and response as input, the RM predicts the likelihood (i.e., a single scalar score) that this prompt-response pair is preferred.</p><p><strong>Toy example.</strong> We can implement this via an abstraction like HuggingFace&#8217;s <code>AutoModelForSequenceClassification</code>. An implementation of a small (<a href="https://cameronrwolfe.substack.com/p/language-understanding-with-bert">BERT</a>-based) RM that can be run locally is provided below, where we:</p><ul><li><p>Create the RM using <code>AutoModelForSequenceClassification</code>.</p></li><li><p>Compute the RM&#8217;s output&#8212;<em>in the form of a single logit</em>&#8212;for all chosen and rejected sequences<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>.</p></li><li><p>Compute the RM&#8217;s loss as described above.</p></li></ul><pre><code><code>from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
)
import torch

# Load a tiny model for sequence classification
model_name = "google-bert/bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True,
)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    trust_remote_code=True,
)

# Chosen prompt-response sequences
chosen_seqs = [
    "I love deep (learning) focus!",
    "Cameron is great at explaining stuff",
    "AGI is coming very soon...",
]

# Rejected prompt-response sequences
rejected_seqs = [
    "I'm not a fan of deep (learning) focus",
    "Cameron doesn't know what he's talking about",
    "AGI is fake and LLMs can't reason!",
]

# Tokenize the chosen / rejected sequences
chosen_inps = tokenizer(
    chosen_seqs,
    return_tensors="pt",
    padding=True,
)
rejected_inps = tokenizer(
    rejected_seqs,
    return_tensors="pt",
    padding=True,
)

# Compute the RM's output
rewards_chosen = model(**chosen_inps).logits[:, 0]
rewards_rejected = model(**rejected_inps).logits[:, 0]

# Compute the RM's loss
loss = -torch.nn.functional.logsigmoid(
    rewards_chosen - rewards_rejected
).mean()
print(loss)</code></code></pre><p>From here, we train the RM <a href="https://docs.pytorch.org/tutorials/beginner/introyt/trainingyt.html">similarly to any other model</a>; i.e., by <em>i)</em> looping over a preference dataset, <em>ii)</em> computing the loss as outlined above, <em>iii)</em> obtaining a gradient via <a href="https://www.youtube.com/watch?v=Ilg3gGewQ5U">backpropagation</a>, <em>iv)</em> performing a gradient update and <em>v)</em> repeating.</p><p><strong>Real RM training example.</strong> For a more practical view of what training an RM looks like at an LLM research lab, we can look at the <a href="https://github.com/allenai/open-instruct/blob/main/open_instruct/reward_modeling.py">RM training script</a> in AI2&#8217;s <a href="https://github.com/allenai/open-instruct">OpenInstruct</a>. This script implements distributed training of an RM&#8212;<em>based upon <a href="https://arxiv.org/abs/2501.00656">OLMo-2</a> or <a href="https://arxiv.org/abs/2409.02060">OLMoE</a></em>&#8212;using <a href="https://huggingface.co/docs/accelerate/en/index">accelerate</a>. The script is quite simple, and most of the code is actually just configuring the training process. We can parse through this training script to find the core RM training loop, copied below for reference. </p><pre><code><code>for _ in range(args.num_train_epochs):
    for data in dataloader:
        training_step += 1

        # Concat the chosen / rejected sequences
        query_responses = torch.cat(
            (
                data[CHOSEN_INPUT_IDS_KEY],
                data[REJECTED_INPUT_IDS_KEY]
            ),
            dim=0,
        )
        with accelerator.accumulate(model):
            # Predict reward for each sequence with RM
            _, predicted_reward, _ = get_reward(
                model,
                query_responses,
                tokenizer.pad_token_id,
                0,
            )

            # Parse chosen / rejected rewards from output
            chosen_reward = predicted_reward[
                :data[CHOSEN_INPUT_IDS_KEY].shape[0]
            ]
            rejected_reward = predicted_reward[
                data[CHOSEN_INPUT_IDS_KEY].shape[0] :
            ]

            # Compute loss and gradient for RM
            loss = -F.logsigmoid(chosen_reward - rejected_reward).mean()
            accelerator.backward(loss)

            # Perform parameter update for RM
            optimizer.step()
            optimizer.zero_grad()</code></code></pre><p>As we can see, this code, which is used for training large-scale RMs at a top research lab, is not much different from our toy example! Of course, the training loop is largely made simple by abstractions provided by modern deep learning packages like HuggingFace. However, <em>the key takeaway here is that the concepts we have learned so far directly translate to practical training and usage of RMs</em>. </p><h4>Different Types of RMs</h4><p>So far, we have focused on the standard form of an RM, typically referred to as a classifier-based RM. However, RMs are just models that predict a preference score given a prompt and response, which we can implement in many ways. For example, we can train a custom classifier like <a href="http://arxiv.org/abs/2406.12845">ArmoRM</a> to serve as an RM.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MLb6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15fe250b-0749-4303-8052-2641bc1dff20_1086x452.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MLb6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15fe250b-0749-4303-8052-2641bc1dff20_1086x452.png 424w, https://substackcdn.com/image/fetch/$s_!MLb6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15fe250b-0749-4303-8052-2641bc1dff20_1086x452.png 848w, https://substackcdn.com/image/fetch/$s_!MLb6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15fe250b-0749-4303-8052-2641bc1dff20_1086x452.png 1272w, https://substackcdn.com/image/fetch/$s_!MLb6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15fe250b-0749-4303-8052-2641bc1dff20_1086x452.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MLb6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15fe250b-0749-4303-8052-2641bc1dff20_1086x452.png" width="613" height="255.134438305709" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/15fe250b-0749-4303-8052-2641bc1dff20_1086x452.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:452,&quot;width&quot;:1086,&quot;resizeWidth&quot;:613,&quot;bytes&quot;:90364,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/166169560?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15fe250b-0749-4303-8052-2641bc1dff20_1086x452.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!MLb6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15fe250b-0749-4303-8052-2641bc1dff20_1086x452.png 424w, https://substackcdn.com/image/fetch/$s_!MLb6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15fe250b-0749-4303-8052-2641bc1dff20_1086x452.png 848w, https://substackcdn.com/image/fetch/$s_!MLb6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15fe250b-0749-4303-8052-2641bc1dff20_1086x452.png 1272w, https://substackcdn.com/image/fetch/$s_!MLb6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15fe250b-0749-4303-8052-2641bc1dff20_1086x452.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [9])</figcaption></figure></div><p><strong>LLM-as-a-Judge</strong> models can also be used as an RM by simply prompting an LLM judge to provide a preference score; see above. These preference scores can then be taken as the reward signal during training with RL. For a more in-depth overview of LLM-as-a-Judge, please see the article linked below.</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;e4e20a13-0a11-45f8-849a-3df1dad19eb8&quot;,&quot;caption&quot;:&quot;As large language models (LLMs) have become more and more capable, one of the most difficult aspects of working with these models is determining how to properly evaluate them. Many powerful models exist, and they each solve a wide variety of complex, open-ended tasks. As a result, discerning differences in performance between these mo&#8230;&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Using LLMs for Evaluation&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:29736521,&quot;name&quot;:&quot;Cameron R. Wolfe, Ph.D.&quot;,&quot;bio&quot;:&quot;ML @ Netflix &#8226; Rice University PhD &#8226; I make AI understandable&quot;,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/69aba7df-b571-4609-aa47-fc2d031c11b8_1242x1595.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-07-22T09:34:01.735Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3cca744e-8ad5-4266-9680-7da4fe94f497_1878x1052.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://cameronrwolfe.substack.com/p/llm-as-a-judge&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:141159804,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:108,&quot;comment_count&quot;:14,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Deep (Learning) Focus&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!87xa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab9b43fb-52d5-40da-995d-5b7cd3f91064_896x896.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p>Alternatively, we can use LLM judges to collect synthetic preference data&#8212;<em>using prompts like the one shown below from <a href="https://github.com/tatsu-lab/alpaca_eval">AlpacaEval</a></em>&#8212;and train an RM normally over this synthetic data, as is done by <a href="https://cameronrwolfe.substack.com/i/136751520/constitutional-ai-harmlessness-from-ai-feedback">Constitutional AI</a> [10] and <a href="https://cameronrwolfe.substack.com/i/136751520/rlaif-scaling-reinforcement-learning-from-human-feedback-with-ai-feedback">RLAIF</a> [11]. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tms4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda52e00b-d851-46f4-89a8-0d46e8badbe9_2650x1444.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tms4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda52e00b-d851-46f4-89a8-0d46e8badbe9_2650x1444.png 424w, https://substackcdn.com/image/fetch/$s_!tms4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda52e00b-d851-46f4-89a8-0d46e8badbe9_2650x1444.png 848w, https://substackcdn.com/image/fetch/$s_!tms4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda52e00b-d851-46f4-89a8-0d46e8badbe9_2650x1444.png 1272w, https://substackcdn.com/image/fetch/$s_!tms4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda52e00b-d851-46f4-89a8-0d46e8badbe9_2650x1444.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tms4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda52e00b-d851-46f4-89a8-0d46e8badbe9_2650x1444.png" width="1456" height="793" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/da52e00b-d851-46f4-89a8-0d46e8badbe9_2650x1444.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:793,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:416440,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/166169560?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda52e00b-d851-46f4-89a8-0d46e8badbe9_2650x1444.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!tms4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda52e00b-d851-46f4-89a8-0d46e8badbe9_2650x1444.png 424w, https://substackcdn.com/image/fetch/$s_!tms4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda52e00b-d851-46f4-89a8-0d46e8badbe9_2650x1444.png 848w, https://substackcdn.com/image/fetch/$s_!tms4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda52e00b-d851-46f4-89a8-0d46e8badbe9_2650x1444.png 1272w, https://substackcdn.com/image/fetch/$s_!tms4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda52e00b-d851-46f4-89a8-0d46e8badbe9_2650x1444.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(<a href="https://github.com/tatsu-lab/alpaca_eval">source</a>)</figcaption></figure></div><p><strong>Outcome Reward Models (ORMs)</strong> [12] and <strong>Process Reward Models (PRMs)</strong> [11] are two other commonly-used variants of RMs in the literature. ORMs, which are mostly used for reasoning tasks, predict the probability that a completion is the correct answer to a task. To train an ORM, we collect a preference dataset similarly to before, but each preference pair contains both an incorrect and a correct answer to a given question.  Unlike a standard RM that predicts the reward at a sequence level, the ORM predicts correctness on a per-token basis.</p><blockquote><p><em>&#8220;Our verifiers are language models, with a small scalar head that outputs predictions on a per-token basis.&#8221;</em> - from [12]</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fqNz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf5d0e0c-de50-4f50-9b4a-b11d3a2c45fb_1964x960.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fqNz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf5d0e0c-de50-4f50-9b4a-b11d3a2c45fb_1964x960.png 424w, https://substackcdn.com/image/fetch/$s_!fqNz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf5d0e0c-de50-4f50-9b4a-b11d3a2c45fb_1964x960.png 848w, https://substackcdn.com/image/fetch/$s_!fqNz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf5d0e0c-de50-4f50-9b4a-b11d3a2c45fb_1964x960.png 1272w, https://substackcdn.com/image/fetch/$s_!fqNz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf5d0e0c-de50-4f50-9b4a-b11d3a2c45fb_1964x960.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fqNz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf5d0e0c-de50-4f50-9b4a-b11d3a2c45fb_1964x960.png" width="1456" height="712" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/af5d0e0c-de50-4f50-9b4a-b11d3a2c45fb_1964x960.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:712,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:199612,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/166169560?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf5d0e0c-de50-4f50-9b4a-b11d3a2c45fb_1964x960.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!fqNz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf5d0e0c-de50-4f50-9b4a-b11d3a2c45fb_1964x960.png 424w, https://substackcdn.com/image/fetch/$s_!fqNz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf5d0e0c-de50-4f50-9b4a-b11d3a2c45fb_1964x960.png 848w, https://substackcdn.com/image/fetch/$s_!fqNz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf5d0e0c-de50-4f50-9b4a-b11d3a2c45fb_1964x960.png 1272w, https://substackcdn.com/image/fetch/$s_!fqNz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf5d0e0c-de50-4f50-9b4a-b11d3a2c45fb_1964x960.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [12])</figcaption></figure></div><p>Like ORMs, PRMs are used primarily for reasoning tasks and predict more granular outputs, but PRMs make predictions after every step of the reasoning process rather than after every token. Although PRMs have been used in a <a href="https://arxiv.org/abs/2501.07301">variety of papers</a>, collecting training data for PRMs is difficult, as they require granular supervision (i.e., a correctness signal at each step of the reasoning process). </p><blockquote><p><em>&#8220;PRMs are reward models trained to output scores at every step in a chain of thought reasoning process. These differ from a standard RM that outputs a score only at an EOS token or a ORM that outputs a score at every token. Process Reward Models require supervision at the end of each reasoning step.&#8221;</em> - <a href="https://rlhfbook.com/c/07-reward-models.html">source</a></p></blockquote><h2>The Role of Reward Models in Post-Training</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Dtl3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bde3170-7f57-4f2f-aebb-3af9eb7b6a62_1556x948.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Dtl3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bde3170-7f57-4f2f-aebb-3af9eb7b6a62_1556x948.png 424w, https://substackcdn.com/image/fetch/$s_!Dtl3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bde3170-7f57-4f2f-aebb-3af9eb7b6a62_1556x948.png 848w, https://substackcdn.com/image/fetch/$s_!Dtl3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bde3170-7f57-4f2f-aebb-3af9eb7b6a62_1556x948.png 1272w, https://substackcdn.com/image/fetch/$s_!Dtl3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bde3170-7f57-4f2f-aebb-3af9eb7b6a62_1556x948.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Dtl3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bde3170-7f57-4f2f-aebb-3af9eb7b6a62_1556x948.png" width="1456" height="887" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6bde3170-7f57-4f2f-aebb-3af9eb7b6a62_1556x948.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:887,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:289662,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/166169560?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bde3170-7f57-4f2f-aebb-3af9eb7b6a62_1556x948.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Dtl3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bde3170-7f57-4f2f-aebb-3af9eb7b6a62_1556x948.png 424w, https://substackcdn.com/image/fetch/$s_!Dtl3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bde3170-7f57-4f2f-aebb-3af9eb7b6a62_1556x948.png 848w, https://substackcdn.com/image/fetch/$s_!Dtl3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bde3170-7f57-4f2f-aebb-3af9eb7b6a62_1556x948.png 1272w, https://substackcdn.com/image/fetch/$s_!Dtl3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bde3170-7f57-4f2f-aebb-3af9eb7b6a62_1556x948.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><p>Early post-ChatGPT LLMs were almost always post-trained using the three-step alignment procedure (shown above) proposed by InstructGPT [3]. This procedure is comprised of the following three steps:</p><ol><li><p><a href="https://cameronrwolfe.substack.com/p/understanding-and-using-supervised">Supervised finetuning (SFT)</a>&#8212;<em>a.k.a. instruction finetuning (IFT)</em>&#8212;trains the model using <a href="https://cameronrwolfe.substack.com/p/language-model-training-and-inference">next-token prediction</a> over examples of good completions.</p></li><li><p>A reward model (RM) is trained over a <a href="https://rlhfbook.com/c/05-preferences.html">human preference dataset</a>.</p></li><li><p>Reinforcement learning (RL) is used to finetune the LLM by using the output of the RM as a training signal. </p></li></ol><p>Collectively, steps two and three in this procedure are called <a href="https://cameronrwolfe.substack.com/p/the-story-of-rlhf-origins-motivations">reinforcement learning from human feedback (RLHF)</a>&#8212;<em>we use a reinforcement learning (RL) optimizer to finetune the LLM and incorporate human feedback via preference labels</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QTAv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2914a412-f20e-40d5-8d44-abdb4d77f1be_1754x746.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QTAv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2914a412-f20e-40d5-8d44-abdb4d77f1be_1754x746.png 424w, https://substackcdn.com/image/fetch/$s_!QTAv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2914a412-f20e-40d5-8d44-abdb4d77f1be_1754x746.png 848w, https://substackcdn.com/image/fetch/$s_!QTAv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2914a412-f20e-40d5-8d44-abdb4d77f1be_1754x746.png 1272w, https://substackcdn.com/image/fetch/$s_!QTAv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2914a412-f20e-40d5-8d44-abdb4d77f1be_1754x746.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QTAv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2914a412-f20e-40d5-8d44-abdb4d77f1be_1754x746.png" width="1456" height="619" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2914a412-f20e-40d5-8d44-abdb4d77f1be_1754x746.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:619,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:269412,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/166169560?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2914a412-f20e-40d5-8d44-abdb4d77f1be_1754x746.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!QTAv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2914a412-f20e-40d5-8d44-abdb4d77f1be_1754x746.png 424w, https://substackcdn.com/image/fetch/$s_!QTAv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2914a412-f20e-40d5-8d44-abdb4d77f1be_1754x746.png 848w, https://substackcdn.com/image/fetch/$s_!QTAv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2914a412-f20e-40d5-8d44-abdb4d77f1be_1754x746.png 1272w, https://substackcdn.com/image/fetch/$s_!QTAv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2914a412-f20e-40d5-8d44-abdb4d77f1be_1754x746.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [4])</figcaption></figure></div><p>Today, the story is a bit more complicated; an example of a more modern post-training pipeline (used for <a href="https://arxiv.org/abs/2411.15124">Tulu-3</a> [4]) is provided above. Key differences from the original three step alignment procedure include:</p><ul><li><p>The SFT phase&#8212;<em>although still very common</em>&#8212; is not always used, especially for <a href="https://cameronrwolfe.substack.com/p/demystifying-reasoning-models">recent reasoning models</a>; e.g., some variants of <a href="https://cameronrwolfe.substack.com/p/demystifying-reasoning-models">DeepSeek-R1</a> forgo SFT and apply RL directly to the pretrained model.</p></li><li><p>RL training is usually performed in several rounds, where fresh data is collected for each round to further improve the LLM&#8217;s capabilities.</p></li><li><p>Several variants of RL (and non-RL-based alternatives) are used&#8212;<em>potentially in tandem</em>&#8212;that may or may not require an RM.</p></li></ul><p>Despite the extra complexity, data quality remains the key determinant of successful post-training even today. In this section, we will cover RL training frameworks at a high level, <em>focusing on the role (if any) of RMs in each of them</em>. </p><h4>RL Training Strategies for LLMs</h4><p>For those who are unfamiliar with the high-level setup used for training LLMs with RL, please see the overview below. A basic understanding of RL in the context of LLMs is a necessary prerequisite for this discussion.</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;e16f8d5b-e3fa-41b1-8268-9975ef9fdf59&quot;,&quot;caption&quot;:&quot;Recent AI research has revealed that reinforcement learning&#8212;more specifically, reinforcement learning from human feedback (RLHF)&#8212;is a key component of training a state-of-the-art large language model (LLM). Despite this fact, most open-source research on language models heavily emphasizes supervised learning strategies, such as&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Basics of Reinforcement Learning for LLMs&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:29736521,&quot;name&quot;:&quot;Cameron R. Wolfe, Ph.D.&quot;,&quot;bio&quot;:&quot;ML @ Netflix &#8226; Rice University PhD &#8226; I make AI understandable&quot;,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/69aba7df-b571-4609-aa47-fc2d031c11b8_1242x1595.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2023-09-25T09:12:12.520Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ef02a687-cf34-4407-ad59-1527571e1a65_2410x1354.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://cameronrwolfe.substack.com/p/basics-of-reinforcement-learning&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:137266538,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:192,&quot;comment_count&quot;:5,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Deep (Learning) Focus&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!87xa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab9b43fb-52d5-40da-995d-5b7cd3f91064_896x896.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p><strong>RL for LLMs.</strong> There are two broad categories of reinforcement learning (RL) training that are heavily leveraged by LLMs: RLHF (i.e., steps two and three of the post-training setup that we outlined above) and <a href="https://cameronrwolfe.substack.com/i/153722335/reinforcement-learning-with-verifiable-rewards">reinforcement learning with verifiable rewards (RLVR)</a>. These two variants of RL are depicted below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CJn6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CJn6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 424w, https://substackcdn.com/image/fetch/$s_!CJn6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 848w, https://substackcdn.com/image/fetch/$s_!CJn6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 1272w, https://substackcdn.com/image/fetch/$s_!CJn6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CJn6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png" width="1456" height="430" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:430,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:312842,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/166169560?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CJn6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 424w, https://substackcdn.com/image/fetch/$s_!CJn6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 848w, https://substackcdn.com/image/fetch/$s_!CJn6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 1272w, https://substackcdn.com/image/fetch/$s_!CJn6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0fd3791-df29-4a92-b185-21f6be4f2ddc_2176x642.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [4])</figcaption></figure></div><p>From an RL perspective, these techniques are similar. They follow the same high-level training setup and both use RL optimizers based upon <a href="https://cameronrwolfe.substack.com/p/policy-gradients-the-foundation-of">policy gradient algorithms</a><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a> to derive parameter updates. The primary difference between these techniques lies in how we define the reward:</p><ul><li><p>In RLHF, the reward comes from the RM, which provides a human preference score for each of the completions provided by the LLM.</p></li><li><p>RLVR uses <a href="https://cameronrwolfe.substack.com/i/153722335/reinforcement-learning-with-verifiable-rewards">deterministic (or verifiable) rewards</a>, where the answer provided by the LLM is marked as either correct or incorrect.</p></li></ul><p>Notably, the deterministic&#8212;<em>usually rules-based</em>&#8212;rewards in RLVR eliminate the need for an RM! Usually, rewards are derived by extracting the LLM&#8217;s final answer from its generated output and comparing this answer (e.g., via exact string match or some form of fuzzy matching) to a known, ground truth answer. From this comparison, we can determine whether the LLM&#8217;s output is correct or not and use this binary signal as a reward for training with RL.</p><p><strong>RLHF vs RLVR.</strong> In <a href="https://www.interconnects.ai/p/the-state-of-post-training-2025">more recent frontier models</a>, both styles of RL play a role in the post-training process. We still perform the three step post-training procedure (SFT &#8594; RLHF), which teaches the LLM correct formatting and aligns it to human preferences. However, we now have an additional RLVR step that boosts reasoning capabilities and performance on verifiable tasks; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qyUt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabaeef87-7ac7-4442-babd-1d4741b4255d_2608x1184.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qyUt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabaeef87-7ac7-4442-babd-1d4741b4255d_2608x1184.png 424w, https://substackcdn.com/image/fetch/$s_!qyUt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabaeef87-7ac7-4442-babd-1d4741b4255d_2608x1184.png 848w, https://substackcdn.com/image/fetch/$s_!qyUt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabaeef87-7ac7-4442-babd-1d4741b4255d_2608x1184.png 1272w, https://substackcdn.com/image/fetch/$s_!qyUt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabaeef87-7ac7-4442-babd-1d4741b4255d_2608x1184.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qyUt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabaeef87-7ac7-4442-babd-1d4741b4255d_2608x1184.png" width="1456" height="661" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/abaeef87-7ac7-4442-babd-1d4741b4255d_2608x1184.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:661,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:381871,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/166169560?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabaeef87-7ac7-4442-babd-1d4741b4255d_2608x1184.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qyUt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabaeef87-7ac7-4442-babd-1d4741b4255d_2608x1184.png 424w, https://substackcdn.com/image/fetch/$s_!qyUt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabaeef87-7ac7-4442-babd-1d4741b4255d_2608x1184.png 848w, https://substackcdn.com/image/fetch/$s_!qyUt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabaeef87-7ac7-4442-babd-1d4741b4255d_2608x1184.png 1272w, https://substackcdn.com/image/fetch/$s_!qyUt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabaeef87-7ac7-4442-babd-1d4741b4255d_2608x1184.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(<a href="https://docs.google.com/presentation/d/1FL6pzRT3tjCfJ985emS_2YfujCe_iz6dsyRcDIUFPqs/edit?usp=sharing">source</a>)</figcaption></figure></div><p>More generally, the amount of compute being invested into RL finetuning&#8212;<em>and RLVR in particular</em>&#8212;is also rapidly increasing. This change is motivated by recent results on reasoning models that show clear <a href="https://cameronrwolfe.substack.com/p/llm-scaling-laws">scaling laws</a> of model performance with respect to the amount of compute used for RL training; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JEX5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02a4cd97-6c90-456b-9b51-65eaaa4fe677_608x636.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JEX5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02a4cd97-6c90-456b-9b51-65eaaa4fe677_608x636.png 424w, https://substackcdn.com/image/fetch/$s_!JEX5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02a4cd97-6c90-456b-9b51-65eaaa4fe677_608x636.png 848w, https://substackcdn.com/image/fetch/$s_!JEX5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02a4cd97-6c90-456b-9b51-65eaaa4fe677_608x636.png 1272w, https://substackcdn.com/image/fetch/$s_!JEX5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02a4cd97-6c90-456b-9b51-65eaaa4fe677_608x636.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JEX5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02a4cd97-6c90-456b-9b51-65eaaa4fe677_608x636.png" width="298" height="311.7236842105263" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/02a4cd97-6c90-456b-9b51-65eaaa4fe677_608x636.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:636,&quot;width&quot;:608,&quot;resizeWidth&quot;:298,&quot;bytes&quot;:54789,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/166169560?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02a4cd97-6c90-456b-9b51-65eaaa4fe677_608x636.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JEX5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02a4cd97-6c90-456b-9b51-65eaaa4fe677_608x636.png 424w, https://substackcdn.com/image/fetch/$s_!JEX5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02a4cd97-6c90-456b-9b51-65eaaa4fe677_608x636.png 848w, https://substackcdn.com/image/fetch/$s_!JEX5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02a4cd97-6c90-456b-9b51-65eaaa4fe677_608x636.png 1272w, https://substackcdn.com/image/fetch/$s_!JEX5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02a4cd97-6c90-456b-9b51-65eaaa4fe677_608x636.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [5])</figcaption></figure></div><blockquote><p><em>&#8220;RLHF is a complex and often unstable procedure&#8230; we introduce a new parameterization of the reward model in RLHF that [allows] us to solve the standard RLHF problem with only a simple classification loss.&#8221;</em> - from [6]</p></blockquote><p><strong>Direct alignment.</strong> RLVR is not the only way to avoid using an RM. In fact, we can still align a model to human preferences&#8212;<em>similarly to RLHF</em>&#8212;while foregoing the RM completely. Such techniques are referred to as <a href="https://rlhfbook.com/c/12-direct-alignment.html">direct alignment algorithms</a>, and the most widely-used algorithm in this class is direct preference optimization (DPO) [6]. Not only do direct alignment algorithms like DPO forego the RM while optimizing the same training objective as RLHF, but they avoid RL training altogether. A comparison between RLHF and DPO is provided below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vj3B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cae8395-d481-49f9-b630-f5120b9abe7e_1642x576.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vj3B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cae8395-d481-49f9-b630-f5120b9abe7e_1642x576.png 424w, https://substackcdn.com/image/fetch/$s_!vj3B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cae8395-d481-49f9-b630-f5120b9abe7e_1642x576.png 848w, https://substackcdn.com/image/fetch/$s_!vj3B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cae8395-d481-49f9-b630-f5120b9abe7e_1642x576.png 1272w, https://substackcdn.com/image/fetch/$s_!vj3B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cae8395-d481-49f9-b630-f5120b9abe7e_1642x576.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vj3B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cae8395-d481-49f9-b630-f5120b9abe7e_1642x576.png" width="1456" height="511" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8cae8395-d481-49f9-b630-f5120b9abe7e_1642x576.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:511,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:287850,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/166169560?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cae8395-d481-49f9-b630-f5120b9abe7e_1642x576.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vj3B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cae8395-d481-49f9-b630-f5120b9abe7e_1642x576.png 424w, https://substackcdn.com/image/fetch/$s_!vj3B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cae8395-d481-49f9-b630-f5120b9abe7e_1642x576.png 848w, https://substackcdn.com/image/fetch/$s_!vj3B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cae8395-d481-49f9-b630-f5120b9abe7e_1642x576.png 1272w, https://substackcdn.com/image/fetch/$s_!vj3B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cae8395-d481-49f9-b630-f5120b9abe7e_1642x576.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [6])</figcaption></figure></div><p>The loss function used for training in DPO is presented below. As we can see, this loss function looks very similar to the loss function used by an RM. However, we are no longer predicting the reward with an RM. Instead, we directly use our policy to estimate a reward implicitly by using the probabilities of chosen and rejected completions assigned by the current policy and a reference policy. Intuitively, this loss is minimized when the log-ratio of the chosen completion is larger than that of the rejected completion. <em>DPO trains the current policy to assign higher (implicit) rewards to chosen responses relative to rejected responses.</em> </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yQz2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7107abbb-358e-48d4-a200-64ca6b5d1d72_2050x1092.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yQz2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7107abbb-358e-48d4-a200-64ca6b5d1d72_2050x1092.png 424w, https://substackcdn.com/image/fetch/$s_!yQz2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7107abbb-358e-48d4-a200-64ca6b5d1d72_2050x1092.png 848w, https://substackcdn.com/image/fetch/$s_!yQz2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7107abbb-358e-48d4-a200-64ca6b5d1d72_2050x1092.png 1272w, https://substackcdn.com/image/fetch/$s_!yQz2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7107abbb-358e-48d4-a200-64ca6b5d1d72_2050x1092.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yQz2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7107abbb-358e-48d4-a200-64ca6b5d1d72_2050x1092.png" width="1456" height="776" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7107abbb-358e-48d4-a200-64ca6b5d1d72_2050x1092.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:776,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:330834,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/166169560?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7107abbb-358e-48d4-a200-64ca6b5d1d72_2050x1092.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yQz2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7107abbb-358e-48d4-a200-64ca6b5d1d72_2050x1092.png 424w, https://substackcdn.com/image/fetch/$s_!yQz2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7107abbb-358e-48d4-a200-64ca6b5d1d72_2050x1092.png 848w, https://substackcdn.com/image/fetch/$s_!yQz2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7107abbb-358e-48d4-a200-64ca6b5d1d72_2050x1092.png 1272w, https://substackcdn.com/image/fetch/$s_!yQz2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7107abbb-358e-48d4-a200-64ca6b5d1d72_2050x1092.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">DPO training loss (from [6])</figcaption></figure></div><p>DPO does not require the creation of an intermediate RM. However, the loss function is still derived from the Bradley-Terry model, and we are still learning an RM. The key distinction here is that the RM is learned implicitly rather than explicitly; hence the title of the DPO paper [6] <em>&#8220;Your Language Model is Secretly a Reward Model&#8221;.</em> We can directly obtain this implicit reward estimate from a DPO model similarly to an RM. For a full derivation and analysis of DPO; see <a href="https://rlhfbook.com/c/12-direct-alignment.html">here</a>. </p><h4>Why are RMs useful?</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qAWR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50c593a0-6b23-4033-806e-0407afaebc6c_1552x636.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qAWR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50c593a0-6b23-4033-806e-0407afaebc6c_1552x636.png 424w, https://substackcdn.com/image/fetch/$s_!qAWR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50c593a0-6b23-4033-806e-0407afaebc6c_1552x636.png 848w, https://substackcdn.com/image/fetch/$s_!qAWR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50c593a0-6b23-4033-806e-0407afaebc6c_1552x636.png 1272w, https://substackcdn.com/image/fetch/$s_!qAWR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50c593a0-6b23-4033-806e-0407afaebc6c_1552x636.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qAWR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50c593a0-6b23-4033-806e-0407afaebc6c_1552x636.png" width="1456" height="597" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/50c593a0-6b23-4033-806e-0407afaebc6c_1552x636.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:597,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:230392,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/166169560?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50c593a0-6b23-4033-806e-0407afaebc6c_1552x636.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qAWR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50c593a0-6b23-4033-806e-0407afaebc6c_1552x636.png 424w, https://substackcdn.com/image/fetch/$s_!qAWR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50c593a0-6b23-4033-806e-0407afaebc6c_1552x636.png 848w, https://substackcdn.com/image/fetch/$s_!qAWR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50c593a0-6b23-4033-806e-0407afaebc6c_1552x636.png 1272w, https://substackcdn.com/image/fetch/$s_!qAWR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50c593a0-6b23-4033-806e-0407afaebc6c_1552x636.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Orchestrating reference, reward, value and policy models in RL (<a href="https://arxiv.org/abs/2409.19256v2">source</a>)</figcaption></figure></div><p>Without a doubt, using an RM adds extra complexity to the LLM training process. First, we need to train a separate model over a large preference dataset, which already introduces added costs and complexity. From here, this model is used in an online fashion during RL training&#8212;<em>the RM scores completions generated by the current policy during training</em>. Given that the RM is also an LLM, this means that we have to separately host and run inference for a separate LLM during training, which can be difficult to efficiently orchestrate; see above. </p><blockquote><p><em>&#8220;We find that the neural RMs may suffer from reward hacking in the large-scale reinforcement learning process. Retraining the reward model needs additional training resources and it complicates the whole training pipeline.&#8221;</em> - from [7]</p></blockquote><p><strong>Reward hacking.</strong> Going further, RMs are subject to <a href="https://lilianweng.github.io/posts/2024-11-28-reward-hacking/">reward hacking</a>. The RM may spuriously assign high rewards to low quality completions or&#8212;<em>more generally</em>&#8212;be exploited in a way that allows the policy to receive high rewards without actually solving the desired task. Interestingly, reward hacking is a key limitation that prevents scaling up training with RLHF&#8212;<em>our policy will eventually find an exploit for the RM if we continue to train it for long enough</em>. In contrast, verifiable rewards are more difficult (though <a href="https://lilianweng.github.io/posts/2024-11-28-reward-hacking/#reward-hacking-examples-in-llm-tasks">not impossible</a>) to hack, allowing reasoning models to be trained more extensively (i.e., for more iterations) when using RLVR. </p><p><strong>Should we avoid RMs? </strong>Given the added costs and complexity of RMs, we might wonder: <em>Should we just avoid RMs altogether?</em> There is no definitive answer to this question. Impressive results have been achieved via RLVR, and we can still align models to human preferences with techniques like DPO that avoid an RM. <a href="https://www.interconnects.ai/p/the-dpo-debate">Many works</a> have argued whether there is (or is not) a performance gap between RLHF and DPO with differing results. Whether DPO is an effective RM-free preference tuning alternative is dependent on the use case, but the fact that there is a gap in performance between these techniques is generally accepted to be true. </p><blockquote><p><em>&#8220;The prevalence of RLHF stems from its efficacy at circumventing one of the greatest difficulties in integrating human values and preferences into language models: specifying an explicit reward&#8221;</em> - from [1]</p></blockquote><p><strong>The utility of RMs.</strong> Despite these findings, we should not lose sight of the fact that RMs are an incredibly important and powerful concept. One of the most difficult tasks in any form of RL training is specifying a reward. For LLMs, this task is especially difficult&#8212;<em>how do we explicitly define what constitutes a &#8220;good&#8221; response from an LLM?</em> Unfortunately, there is no single property or quality that can be used. The scope of valid model responses is nearly infinite.</p><p>With RMs, we circumvent the problem of specifying an explicit reward by distilling this process into a simpler task of asking humans to provide preference feedback (i.e., choosing among pairs of model responses); see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JBCh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9510abc4-232c-446b-a0b0-cf949efd9045_2046x1540.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JBCh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9510abc4-232c-446b-a0b0-cf949efd9045_2046x1540.png 424w, https://substackcdn.com/image/fetch/$s_!JBCh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9510abc4-232c-446b-a0b0-cf949efd9045_2046x1540.png 848w, https://substackcdn.com/image/fetch/$s_!JBCh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9510abc4-232c-446b-a0b0-cf949efd9045_2046x1540.png 1272w, https://substackcdn.com/image/fetch/$s_!JBCh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9510abc4-232c-446b-a0b0-cf949efd9045_2046x1540.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JBCh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9510abc4-232c-446b-a0b0-cf949efd9045_2046x1540.png" width="1456" height="1096" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9510abc4-232c-446b-a0b0-cf949efd9045_2046x1540.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1096,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2779525,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/166169560?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9510abc4-232c-446b-a0b0-cf949efd9045_2046x1540.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JBCh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9510abc4-232c-446b-a0b0-cf949efd9045_2046x1540.png 424w, https://substackcdn.com/image/fetch/$s_!JBCh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9510abc4-232c-446b-a0b0-cf949efd9045_2046x1540.png 848w, https://substackcdn.com/image/fetch/$s_!JBCh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9510abc4-232c-446b-a0b0-cf949efd9045_2046x1540.png 1272w, https://substackcdn.com/image/fetch/$s_!JBCh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9510abc4-232c-446b-a0b0-cf949efd9045_2046x1540.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Interface for collecting human preference data (from [8])</figcaption></figure></div><p>Choosing the better model in a pair is a much simpler task compared to manually writing or evaluating individual model responses&#8212;<em>the human just has to provide a binary preference</em>. We can train an RM over this preference feedback, allowing us to derive a reward for RL training without ever making an explicit specification of the reward. Such an approach provides us with a flexible and effective approach for training LLMs with generic human feedback, which is transformational.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3gHA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb649fc49-a1df-4b76-9f6f-6408c1838ed9_2484x576.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3gHA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb649fc49-a1df-4b76-9f6f-6408c1838ed9_2484x576.png 424w, https://substackcdn.com/image/fetch/$s_!3gHA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb649fc49-a1df-4b76-9f6f-6408c1838ed9_2484x576.png 848w, https://substackcdn.com/image/fetch/$s_!3gHA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb649fc49-a1df-4b76-9f6f-6408c1838ed9_2484x576.png 1272w, https://substackcdn.com/image/fetch/$s_!3gHA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb649fc49-a1df-4b76-9f6f-6408c1838ed9_2484x576.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3gHA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb649fc49-a1df-4b76-9f6f-6408c1838ed9_2484x576.png" width="1456" height="338" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b649fc49-a1df-4b76-9f6f-6408c1838ed9_2484x576.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:338,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:186573,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/166169560?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb649fc49-a1df-4b76-9f6f-6408c1838ed9_2484x576.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3gHA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb649fc49-a1df-4b76-9f6f-6408c1838ed9_2484x576.png 424w, https://substackcdn.com/image/fetch/$s_!3gHA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb649fc49-a1df-4b76-9f6f-6408c1838ed9_2484x576.png 848w, https://substackcdn.com/image/fetch/$s_!3gHA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb649fc49-a1df-4b76-9f6f-6408c1838ed9_2484x576.png 1272w, https://substackcdn.com/image/fetch/$s_!3gHA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb649fc49-a1df-4b76-9f6f-6408c1838ed9_2484x576.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Using an RM to perform Best-of-N sampling</figcaption></figure></div><p><strong>Other use cases for RMs.</strong> Beyond their use in RL training, RMs have a variety of other use cases. For example, RMs are commonly used for <em>i)</em> Best-of-N sampling and inference-time scaling (see above), <em>ii)</em> evaluation, <em>iii)</em> <a href="https://rlhfbook.com/c/10-rejection-sampling.html">rejection sampling</a>, <em>iv)</em> data filtering and much more! Despite these many use cases, we usually evaluate the performance of an RM based upon:</p><ul><li><p><em>Accuracy</em>: an RM&#8217;s ability to correctly identify the chosen response in a pair. </p></li><li><p><em>Downstream performance</em>: the performance of an LLM that is RL finetuned with a particular RM. </p></li><li><p><em>Inference-time scaling</em>: the performance boost achieved by using a particular RM in a Best-of-N sampling pipeline. </p></li></ul><h2>Reward Models in Practice</h2><p>Now that we have an understanding of RMs, we will study some recent papers on this topic. Specifically, we will focus on RewardBench [1], which is a benchmark for evaluating the effectiveness of RMs. This benchmark has been used to evaluate hundreds of different RMs across a variety of use cases, allowing us to derive useful takeaways for effectively training and using RMs in practice. Recently, a new version of RewardBench&#8212;<em>called RewardBench 2 [2]</em>&#8212;was also proposed, which modernized and expanded upon these findings.</p><h4><strong><a href="https://arxiv.org/abs/2403.13787">RewardBench: Evaluating Reward Models for Language Modeling</a> [1]</strong></h4><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Rmwn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6539e2fa-ef3b-404e-bb3a-934b3d51de05_2232x502.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Rmwn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6539e2fa-ef3b-404e-bb3a-934b3d51de05_2232x502.png 424w, https://substackcdn.com/image/fetch/$s_!Rmwn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6539e2fa-ef3b-404e-bb3a-934b3d51de05_2232x502.png 848w, https://substackcdn.com/image/fetch/$s_!Rmwn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6539e2fa-ef3b-404e-bb3a-934b3d51de05_2232x502.png 1272w, https://substackcdn.com/image/fetch/$s_!Rmwn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6539e2fa-ef3b-404e-bb3a-934b3d51de05_2232x502.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Rmwn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6539e2fa-ef3b-404e-bb3a-934b3d51de05_2232x502.png" width="1456" height="327" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6539e2fa-ef3b-404e-bb3a-934b3d51de05_2232x502.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:327,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Rmwn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6539e2fa-ef3b-404e-bb3a-934b3d51de05_2232x502.png 424w, https://substackcdn.com/image/fetch/$s_!Rmwn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6539e2fa-ef3b-404e-bb3a-934b3d51de05_2232x502.png 848w, https://substackcdn.com/image/fetch/$s_!Rmwn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6539e2fa-ef3b-404e-bb3a-934b3d51de05_2232x502.png 1272w, https://substackcdn.com/image/fetch/$s_!Rmwn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6539e2fa-ef3b-404e-bb3a-934b3d51de05_2232x502.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>Many practical choices are involved in training an RM; e.g., selecting the type of reward model to be used, choosing a policy to initialize the RM, setting the number of training epochs and more. However, most practical details of creating RMs are poorly documented. In [1], authors solve this issue by creating a standard benchmark&#8212;<em>called RewardBench</em>&#8212;for evaluating RMs. By evaluating a wide range of RMs on RewardBench, we can determine the impact of various practical choices on both RM performance and the performance of downstream LLMs trained with a given RM. From this analysis, we emerge with a better grasp of how RMs work and a set of best practices for creating high-quality RMs. </p><blockquote><p><em>&#8220;Reward models (RMs) are at the crux of successful RLHF to align pretrained models to human preferences, yet there has been relatively little study that focuses on evaluation of those reward models.&#8221;</em> - from [8]</p></blockquote><p><strong>What is RewardBench? </strong>RewardBench is a framework and dataset for evaluating RMs. This open (i.e., <a href="https://huggingface.co/datasets/allenai/reward-bench">data</a> and <a href="https://github.com/allenai/reward-bench">evaluation code</a> are released) benchmark is used in [1] to chart the landscape of publicly-available RMs; see <a href="https://huggingface.co/spaces/allenai/reward-bench">here</a> for a leaderboard. By providing structured evaluations of RMs across many capabilities, RewardBench helps us to better understand how and why certain types of RMs work. </p><p><strong>Quantifying RM performance.</strong> RewardBench is comprised of prompts paired with two responses&#8212;<em>one chosen (preferred) and one rejected</em>. To evaluate an RM, we can simply test whether the RM is capable of identifying the preferred response. Specifically, this is done by computing the RM&#8217;s output for both the chosen and rejected responses, then comparing their scores. The &#8220;correct&#8221; behavior from an RM would be to assign a higher score to the chosen response; see below.  <em>We can also evaluate DPO models as an RM in this way using their implicit reward estimate.</em> </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RDRA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0de7117d-4544-4294-b907-8ce711fe597b_1610x704.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RDRA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0de7117d-4544-4294-b907-8ce711fe597b_1610x704.png 424w, https://substackcdn.com/image/fetch/$s_!RDRA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0de7117d-4544-4294-b907-8ce711fe597b_1610x704.png 848w, https://substackcdn.com/image/fetch/$s_!RDRA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0de7117d-4544-4294-b907-8ce711fe597b_1610x704.png 1272w, https://substackcdn.com/image/fetch/$s_!RDRA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0de7117d-4544-4294-b907-8ce711fe597b_1610x704.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RDRA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0de7117d-4544-4294-b907-8ce711fe597b_1610x704.png" width="1456" height="637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0de7117d-4544-4294-b907-8ce711fe597b_1610x704.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:637,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:234070,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!RDRA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0de7117d-4544-4294-b907-8ce711fe597b_1610x704.png 424w, https://substackcdn.com/image/fetch/$s_!RDRA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0de7117d-4544-4294-b907-8ce711fe597b_1610x704.png 848w, https://substackcdn.com/image/fetch/$s_!RDRA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0de7117d-4544-4294-b907-8ce711fe597b_1610x704.png 1272w, https://substackcdn.com/image/fetch/$s_!RDRA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0de7117d-4544-4294-b907-8ce711fe597b_1610x704.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Scoring technique used by RewardBench (from [1])</figcaption></figure></div><p>This ability to correctly identify the preferred response can be easily captured via an accuracy metric that counts the number of correct RM outputs across a dataset of prompts with chosen and rejected responses. To compare different RMs, we can just compute this accuracy metric over a fixed dataset. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZOMs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1ee8aee-08da-4942-bb7f-4d0d34f83d9d_1652x1328.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZOMs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1ee8aee-08da-4942-bb7f-4d0d34f83d9d_1652x1328.png 424w, https://substackcdn.com/image/fetch/$s_!ZOMs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1ee8aee-08da-4942-bb7f-4d0d34f83d9d_1652x1328.png 848w, https://substackcdn.com/image/fetch/$s_!ZOMs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1ee8aee-08da-4942-bb7f-4d0d34f83d9d_1652x1328.png 1272w, https://substackcdn.com/image/fetch/$s_!ZOMs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1ee8aee-08da-4942-bb7f-4d0d34f83d9d_1652x1328.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZOMs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1ee8aee-08da-4942-bb7f-4d0d34f83d9d_1652x1328.png" width="1456" height="1170" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f1ee8aee-08da-4942-bb7f-4d0d34f83d9d_1652x1328.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1170,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:448105,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!ZOMs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1ee8aee-08da-4942-bb7f-4d0d34f83d9d_1652x1328.png 424w, https://substackcdn.com/image/fetch/$s_!ZOMs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1ee8aee-08da-4942-bb7f-4d0d34f83d9d_1652x1328.png 848w, https://substackcdn.com/image/fetch/$s_!ZOMs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1ee8aee-08da-4942-bb7f-4d0d34f83d9d_1652x1328.png 1272w, https://substackcdn.com/image/fetch/$s_!ZOMs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1ee8aee-08da-4942-bb7f-4d0d34f83d9d_1652x1328.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p><strong>Data composition.</strong> Depending on the application, RMs are expected to capture a wide scope of different capabilities. To provide a comprehensive view of RM performance, RewardBench chooses to measure RM quality in several different domains (summarized in the table above):</p><ol><li><p><em>Chat</em>: tests the RM&#8217;s ability to distinguish correct chat responses.</p></li><li><p><em>Chat Hard</em>: tests the RM&#8217;s ability to identify trick questions and subtle differences between responses.</p></li><li><p><em>Safety</em>: tests refusals of unsafe prompts and the ability to avoid false refusals.</p></li><li><p><em>Reasoning</em>: tests ability to distinguish good coding and reasoning responses.</p></li><li><p><em>Prior datasets</em>: existing preference datasets (e.g., <a href="https://huggingface.co/datasets/Anthropic/hh-rlhf">Anthropic&#8217;s HH dataset</a>, <a href="https://huggingface.co/datasets/stanfordnlp/SHP">Stanford Human Preferences dataset</a>, and <a href="https://huggingface.co/datasets/openai/summarize_from_feedback">OpenAI&#8217;s learning to summarize dataset</a>) are also included for consistency with prior work.</p></li></ol><p>Within each category of RewardBench, models are evaluated in terms of their accuracy. To generate an aggregate score per category, we take a weighted average of examples within that category. By evaluating RMs across several domains, we gain a more granular view of their performance&#8212;<em>certain categories of RMs oftentimes perform well in some domains but not others</em>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SJFG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c87f9e9-d002-4b7c-9621-20fc42387b1d_1056x710.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SJFG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c87f9e9-d002-4b7c-9621-20fc42387b1d_1056x710.png 424w, https://substackcdn.com/image/fetch/$s_!SJFG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c87f9e9-d002-4b7c-9621-20fc42387b1d_1056x710.png 848w, https://substackcdn.com/image/fetch/$s_!SJFG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c87f9e9-d002-4b7c-9621-20fc42387b1d_1056x710.png 1272w, https://substackcdn.com/image/fetch/$s_!SJFG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c87f9e9-d002-4b7c-9621-20fc42387b1d_1056x710.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SJFG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c87f9e9-d002-4b7c-9621-20fc42387b1d_1056x710.png" width="560" height="376.5151515151515" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0c87f9e9-d002-4b7c-9621-20fc42387b1d_1056x710.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:710,&quot;width&quot;:1056,&quot;resizeWidth&quot;:560,&quot;bytes&quot;:187834,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/166169560?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c87f9e9-d002-4b7c-9621-20fc42387b1d_1056x710.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SJFG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c87f9e9-d002-4b7c-9621-20fc42387b1d_1056x710.png 424w, https://substackcdn.com/image/fetch/$s_!SJFG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c87f9e9-d002-4b7c-9621-20fc42387b1d_1056x710.png 848w, https://substackcdn.com/image/fetch/$s_!SJFG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c87f9e9-d002-4b7c-9621-20fc42387b1d_1056x710.png 1272w, https://substackcdn.com/image/fetch/$s_!SJFG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c87f9e9-d002-4b7c-9621-20fc42387b1d_1056x710.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>To study the ability of RMs to capture subtle differences in response quality, authors also create difficult preference examples with small differences between chosen and rejected responses; see above for an example. Ideally, the RM should capture these subtle differences and assign credit to the preferable response in a stable manner. To ensure that <a href="https://arxiv.org/abs/2404.04475">length bias</a> does not skew results, authors ensure that all response pairs within RewardBench are of similar length.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!N3hz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36a2c6e1-49ff-470c-b567-815aa23fac19_1622x1272.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!N3hz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36a2c6e1-49ff-470c-b567-815aa23fac19_1622x1272.png 424w, https://substackcdn.com/image/fetch/$s_!N3hz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36a2c6e1-49ff-470c-b567-815aa23fac19_1622x1272.png 848w, https://substackcdn.com/image/fetch/$s_!N3hz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36a2c6e1-49ff-470c-b567-815aa23fac19_1622x1272.png 1272w, https://substackcdn.com/image/fetch/$s_!N3hz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36a2c6e1-49ff-470c-b567-815aa23fac19_1622x1272.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!N3hz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36a2c6e1-49ff-470c-b567-815aa23fac19_1622x1272.png" width="1456" height="1142" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/36a2c6e1-49ff-470c-b567-815aa23fac19_1622x1272.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1142,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:477035,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!N3hz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36a2c6e1-49ff-470c-b567-815aa23fac19_1622x1272.png 424w, https://substackcdn.com/image/fetch/$s_!N3hz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36a2c6e1-49ff-470c-b567-815aa23fac19_1622x1272.png 848w, https://substackcdn.com/image/fetch/$s_!N3hz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36a2c6e1-49ff-470c-b567-815aa23fac19_1622x1272.png 1272w, https://substackcdn.com/image/fetch/$s_!N3hz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36a2c6e1-49ff-470c-b567-815aa23fac19_1622x1272.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [8])</figcaption></figure></div><p><strong>Analysis of RMs.</strong> The empirical performance the top-20 RMs&#8212;<em>of 50+ total RMs considered in [1]</em>&#8212;on RewardBench is outlined above. These RMs range from 400M to 70B parameters in size and are separated into small, medium, and large groups. We can summarize the key results for these models as follows:</p><ul><li><p>Performance is generally lower on Chat Hard and Reasoning subsets for all RMs, revealing a potential area of improvement. Only larger RMs perform consistently well on Chat Hard and Reasoning subsets. </p></li><li><p>Using a more powerful base model for the RM is helpful; e.g., Llama-3-based<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a> RMs do well on RewardBench. Even subtle changes to the RM&#8217;s base model (e.g., tweaking the training data or strategy) can impact the RM. </p></li><li><p>Model size benefits performance for LLM-as-a-Judge-style RMs, but classifier-based RMs still perform noticeably better. </p></li><li><p>The scaling properties of RMs depend on the style of RM (e.g., classifier-based vs. DPO vs. LLM-as-a-Judge) and choice of base model. For example, the table below shows an example where LLaMA-2 DPO models improve in RM performance with scale, while classifier-based Qwen-1.5 RMs do not.</p></li><li><p>Results on prior evaluation datasets are not consistent with RewardBench, <em>revealing that results on these benchmarks may fail to comprehensively measure performance</em>. For example, DPO models&#8212;<em>when evaluated as an RM</em>&#8212;perform well on RewardBench but struggle on legacy benchmarks.</p></li></ul><blockquote><p><em>&#8220;Llama 2 shows a clear improvement with scaling across all sections of RewardBench, but Qwen 1.5 shows less monotonic improvement, likely due to out of distribution generalization challenges.&#8221;</em> - from [1]</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-ReQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b65081a-c324-4961-9abe-fc368009ed97_2256x496.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-ReQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b65081a-c324-4961-9abe-fc368009ed97_2256x496.png 424w, https://substackcdn.com/image/fetch/$s_!-ReQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b65081a-c324-4961-9abe-fc368009ed97_2256x496.png 848w, https://substackcdn.com/image/fetch/$s_!-ReQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b65081a-c324-4961-9abe-fc368009ed97_2256x496.png 1272w, https://substackcdn.com/image/fetch/$s_!-ReQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b65081a-c324-4961-9abe-fc368009ed97_2256x496.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-ReQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b65081a-c324-4961-9abe-fc368009ed97_2256x496.png" width="1456" height="320" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9b65081a-c324-4961-9abe-fc368009ed97_2256x496.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:320,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:345953,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/166169560?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b65081a-c324-4961-9abe-fc368009ed97_2256x496.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-ReQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b65081a-c324-4961-9abe-fc368009ed97_2256x496.png 424w, https://substackcdn.com/image/fetch/$s_!-ReQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b65081a-c324-4961-9abe-fc368009ed97_2256x496.png 848w, https://substackcdn.com/image/fetch/$s_!-ReQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b65081a-c324-4961-9abe-fc368009ed97_2256x496.png 1272w, https://substackcdn.com/image/fetch/$s_!-ReQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b65081a-c324-4961-9abe-fc368009ed97_2256x496.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><h4><a href="https://arxiv.org/abs/2406.09279">Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback</a> [13]</h4><p>The next step after learning best practices for RMs is using these ideas to train a better LLM. In [13], authors apply lessons learned from RewardBench to deeply study RL finetuning. In particular, this paper focuses on making a comparison between the performance of DPO and PPO. We will not focus on the comparison between these techniques in this overview. However, this analysis also contains numerous practical lessons for creating RMs that maximize the downstream performance of the LLMs that they are used to train. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EQMc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ad6659b-69df-41e3-b09b-80d4ab854cb8_1302x828.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EQMc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ad6659b-69df-41e3-b09b-80d4ab854cb8_1302x828.png 424w, https://substackcdn.com/image/fetch/$s_!EQMc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ad6659b-69df-41e3-b09b-80d4ab854cb8_1302x828.png 848w, https://substackcdn.com/image/fetch/$s_!EQMc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ad6659b-69df-41e3-b09b-80d4ab854cb8_1302x828.png 1272w, https://substackcdn.com/image/fetch/$s_!EQMc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ad6659b-69df-41e3-b09b-80d4ab854cb8_1302x828.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EQMc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ad6659b-69df-41e3-b09b-80d4ab854cb8_1302x828.png" width="484" height="307.7972350230415" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3ad6659b-69df-41e3-b09b-80d4ab854cb8_1302x828.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:828,&quot;width&quot;:1302,&quot;resizeWidth&quot;:484,&quot;bytes&quot;:415867,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/166169560?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ad6659b-69df-41e3-b09b-80d4ab854cb8_1302x828.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EQMc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ad6659b-69df-41e3-b09b-80d4ab854cb8_1302x828.png 424w, https://substackcdn.com/image/fetch/$s_!EQMc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ad6659b-69df-41e3-b09b-80d4ab854cb8_1302x828.png 848w, https://substackcdn.com/image/fetch/$s_!EQMc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ad6659b-69df-41e3-b09b-80d4ab854cb8_1302x828.png 1272w, https://substackcdn.com/image/fetch/$s_!EQMc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ad6659b-69df-41e3-b09b-80d4ab854cb8_1302x828.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [13])</figcaption></figure></div><p><strong>Data quality.</strong> The key experimental results presented in [13] are summarized above. The experiments begin by training a DPO model over the <a href="https://huggingface.co/datasets/Anthropic/hh-rlhf">HH RLHF</a> dataset from Anthropic, which is known to be an older and noisier dataset. This data boosts model performance, but a much bigger boost is seen from training on <a href="https://arxiv.org/abs/2310.01377">UltraFeedback</a>&#8212;<em>a modern, high-quality preference dataset</em>. When we switch to training with <a href="https://cameronrwolfe.substack.com/p/proximal-policy-optimization-ppo">PPO</a> (i.e., meaning that an RM is used) over the same data, we see a clear performance improvement, indicating that there is a downstream benefit in performance from using PPO with an explicit RM. However, we should notice that this benefit is much smaller relative to the impact of using better data!</p><p><strong>Larger RMs.</strong> Given the clear benefit of training with PPO, we might wonder if the LLM would also benefit from using a larger RM. This makes intuitive sense given <a href="https://cameronrwolfe.substack.com/p/llm-scaling-laws">LLM scaling laws</a>, but observations in [13] are not this straightforward. </p><p>When scaling the RM from 13B to 70B parameters, downstream LLM performance remains stagnant, even for models that are initialized from the same SFT checkpoint. The only observable performance benefit occurs in the reasoning domain, indicating that the benefit of larger RMs is only clear in scenarios where the superior capabilities of a bigger model are useful or necessary. In other words, we need harder data for these larger RMs to be useful!</p><blockquote><p><em>&#8220;If we&#8217;re using a bigger reward model, we need to have data that is actually challenging the reward model.&#8221;</em> - <a href="https://www.youtube.com/watch?v=rDF7eFPeVto">source</a></p></blockquote><p><strong>Better data + bigger RM.</strong> Combining the lessons outlined above, authors in [13] collect a larger set of more difficult prompts&#8212;<em>emphasizing coding and reasoning tasks</em>&#8212;for RM training and test again whether larger RMs are beneficial. From these experiments, we see clear signals of improving RM quality. For example, these larger and better RMs yield a noticeable boost in performance when used for Best-of-N sampling as shown below. However, this improvement is much less clear when we look at both RewardBench and downstream performance. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZfRv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96a83a07-9d28-406b-b1eb-16b99ce61594_920x398.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZfRv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96a83a07-9d28-406b-b1eb-16b99ce61594_920x398.png 424w, https://substackcdn.com/image/fetch/$s_!ZfRv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96a83a07-9d28-406b-b1eb-16b99ce61594_920x398.png 848w, https://substackcdn.com/image/fetch/$s_!ZfRv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96a83a07-9d28-406b-b1eb-16b99ce61594_920x398.png 1272w, https://substackcdn.com/image/fetch/$s_!ZfRv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96a83a07-9d28-406b-b1eb-16b99ce61594_920x398.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZfRv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96a83a07-9d28-406b-b1eb-16b99ce61594_920x398.png" width="920" height="398" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/96a83a07-9d28-406b-b1eb-16b99ce61594_920x398.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:398,&quot;width&quot;:920,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:123015,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/166169560?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96a83a07-9d28-406b-b1eb-16b99ce61594_920x398.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZfRv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96a83a07-9d28-406b-b1eb-16b99ce61594_920x398.png 424w, https://substackcdn.com/image/fetch/$s_!ZfRv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96a83a07-9d28-406b-b1eb-16b99ce61594_920x398.png 848w, https://substackcdn.com/image/fetch/$s_!ZfRv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96a83a07-9d28-406b-b1eb-16b99ce61594_920x398.png 1272w, https://substackcdn.com/image/fetch/$s_!ZfRv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96a83a07-9d28-406b-b1eb-16b99ce61594_920x398.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [13])</figcaption></figure></div><p>Put simply, using a bigger and better RM does not directly imply that our LLM will be better when this RM is used for RL finetuning. In fact, we even see a performance <em>regression</em> in some domains when using larger RMs in [13]. Such findings make the evaluation of RMs very complicated&#8212;<em>just measuring the accuracy of an RM does not help us to understand how useful it will be</em>. </p><h4><strong><a href="https://arxiv.org/abs/2506.01937">RewardBench 2: Advancing Reward Model Evaluation</a> [2]</strong></h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QC2H!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f2e98d3-b5b4-43cb-9824-63dd79b78244_1938x820.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QC2H!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f2e98d3-b5b4-43cb-9824-63dd79b78244_1938x820.png 424w, https://substackcdn.com/image/fetch/$s_!QC2H!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f2e98d3-b5b4-43cb-9824-63dd79b78244_1938x820.png 848w, https://substackcdn.com/image/fetch/$s_!QC2H!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f2e98d3-b5b4-43cb-9824-63dd79b78244_1938x820.png 1272w, https://substackcdn.com/image/fetch/$s_!QC2H!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f2e98d3-b5b4-43cb-9824-63dd79b78244_1938x820.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QC2H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f2e98d3-b5b4-43cb-9824-63dd79b78244_1938x820.png" width="1456" height="616" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0f2e98d3-b5b4-43cb-9824-63dd79b78244_1938x820.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:616,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:367108,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/166169560?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f2e98d3-b5b4-43cb-9824-63dd79b78244_1938x820.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QC2H!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f2e98d3-b5b4-43cb-9824-63dd79b78244_1938x820.png 424w, https://substackcdn.com/image/fetch/$s_!QC2H!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f2e98d3-b5b4-43cb-9824-63dd79b78244_1938x820.png 848w, https://substackcdn.com/image/fetch/$s_!QC2H!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f2e98d3-b5b4-43cb-9824-63dd79b78244_1938x820.png 1272w, https://substackcdn.com/image/fetch/$s_!QC2H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f2e98d3-b5b4-43cb-9824-63dd79b78244_1938x820.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p>The recently-proposed RewardBench 2 [2]<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a> aims to make improvements over the initial RewardBench so that evaluating RMs is more useful and informative. This benchmark contains new data that covers a wider scope of skills that LLMs may possess, and RMs score ~20 points lower on this benchmark on average&#8212;<em>it is a much more challenging benchmark.</em> Despite still using an accuracy-based approach for evaluating RMs, RewardBench 2 has a clear correlation with downstream RM usage (e.g., for Best-of-N sampling) and provides useful lessons for determining whether a given RM will be effective when used for RL finetuning.</p><p><strong>Measuring RM performance.</strong> Instead of measuring the accuracy of the RM in differentiating between a chosen and rejected response, RewardBench 2 has four possible responses for each prompt&#8212;<em>one chosen and three rejected</em>. Among these responses, the RM must score the chosen response higher than all rejected responses; see below. This best-of-4 approach, which is still accuracy-based like the initial RewardBench, is more challenging and brings the performance of even strong RMs closer to that of the random baseline (i.e., 25% accuracy).</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MobT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb44b1dcf-1e55-4544-98b9-1c075aa60486_951x367.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MobT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb44b1dcf-1e55-4544-98b9-1c075aa60486_951x367.png 424w, https://substackcdn.com/image/fetch/$s_!MobT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb44b1dcf-1e55-4544-98b9-1c075aa60486_951x367.png 848w, https://substackcdn.com/image/fetch/$s_!MobT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb44b1dcf-1e55-4544-98b9-1c075aa60486_951x367.png 1272w, https://substackcdn.com/image/fetch/$s_!MobT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb44b1dcf-1e55-4544-98b9-1c075aa60486_951x367.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MobT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb44b1dcf-1e55-4544-98b9-1c075aa60486_951x367.png" width="528" height="203.7602523659306" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b44b1dcf-1e55-4544-98b9-1c075aa60486_951x367.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:367,&quot;width&quot;:951,&quot;resizeWidth&quot;:528,&quot;bytes&quot;:94836,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/166169560?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f2e98d3-b5b4-43cb-9824-63dd79b78244_1938x820.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!MobT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb44b1dcf-1e55-4544-98b9-1c075aa60486_951x367.png 424w, https://substackcdn.com/image/fetch/$s_!MobT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb44b1dcf-1e55-4544-98b9-1c075aa60486_951x367.png 848w, https://substackcdn.com/image/fetch/$s_!MobT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb44b1dcf-1e55-4544-98b9-1c075aa60486_951x367.png 1272w, https://substackcdn.com/image/fetch/$s_!MobT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb44b1dcf-1e55-4544-98b9-1c075aa60486_951x367.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p>Additionally, RewardBench 2 goes beyond accuracy-based evaluation by measuring LLM performance when:</p><ol><li><p>A certain RM is used for Best-of-N sampling.</p></li><li><p>A certain RM is used for RL training.</p></li></ol><p>As a result of this extended evaluation, we can both understand the quality of an RM, as well as observe the impact of this quality on downstream performance when used for inference-time scaling and RL training. Compared to alternative benchmarks, this evaluation process is quite comprehensive; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hGEQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73c28087-a6c7-49d6-9682-39b53980caad_1818x948.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hGEQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73c28087-a6c7-49d6-9682-39b53980caad_1818x948.png 424w, https://substackcdn.com/image/fetch/$s_!hGEQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73c28087-a6c7-49d6-9682-39b53980caad_1818x948.png 848w, https://substackcdn.com/image/fetch/$s_!hGEQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73c28087-a6c7-49d6-9682-39b53980caad_1818x948.png 1272w, https://substackcdn.com/image/fetch/$s_!hGEQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73c28087-a6c7-49d6-9682-39b53980caad_1818x948.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hGEQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73c28087-a6c7-49d6-9682-39b53980caad_1818x948.png" width="1456" height="759" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73c28087-a6c7-49d6-9682-39b53980caad_1818x948.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:759,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:281702,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/166169560?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73c28087-a6c7-49d6-9682-39b53980caad_1818x948.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hGEQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73c28087-a6c7-49d6-9682-39b53980caad_1818x948.png 424w, https://substackcdn.com/image/fetch/$s_!hGEQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73c28087-a6c7-49d6-9682-39b53980caad_1818x948.png 848w, https://substackcdn.com/image/fetch/$s_!hGEQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73c28087-a6c7-49d6-9682-39b53980caad_1818x948.png 1272w, https://substackcdn.com/image/fetch/$s_!hGEQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73c28087-a6c7-49d6-9682-39b53980caad_1818x948.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p><strong>Data composition.</strong> RewardBench 2 focuses upon six different domains or capabilities when evaluating an RM. Three of these domains&#8212;<em>focus, math and safety</em>&#8212;overlap with existing benchmarks, while the three others&#8212;<em>factuality, precise instruction following and ties (i.e., testing the RM&#8217;s ability to handle equally-valid answers)</em>&#8212;present completely new challenges for RMs.</p><blockquote><p><em>&#8220;The benchmark was created with a majority of previously unused human prompts from the WildChat pipeline with extensive manual, programmatic, and LM-based filtering techniques.&#8221;</em> - from [2]</p></blockquote><p>RewardBench 2 uses unseen and human-written prompts, largely sampled from <a href="https://arxiv.org/abs/2405.01470">WildChat</a>&#8212;<em>a dataset of ChatGPT logs collected from real-world users</em>. Using unseen prompts is important due to the risk of data contamination. If our data is contaminated<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a>, the RM benchmark will be highly correlated with downstream performance due to the same data being used for both evaluations. To ensure correlation is legitimate, we must decontaminate the data and avoid leakage.</p><p>To accomplish this goal, authors in [2] adopt a multi-stage data curation pipeline that involves:</p><ul><li><p>Sourcing unseen, human-written prompts from WildChat.</p></li><li><p>Identifying the domain and quality of each prompt using manual inspection and classifiers; e.g., <a href="https://huggingface.co/princeton-nlp/QuRater-1.3B">QuRater</a> and <a href="https://huggingface.co/valpy/prompt-classification">domain classifiers</a>. </p></li><li><p>Performing extensive <a href="https://github.com/allenai/open-instruct/tree/main/decontamination">data decontamination</a> to ensure virtually zero overlap with downstream evaluation datasets.</p></li><li><p>Manually selecting the best prompts from those remaining.</p></li><li><p>Sampling completions for each of the prompts from diverse sources that accurately reflect the capabilities of recent LLMs.</p></li><li><p>Filtering completions based on correctness using a variety of signals; e.g., <a href="https://cameronrwolfe.substack.com/p/llm-as-a-judge">LLM-as-a-Judge</a>, automatic verifiers, <a href="https://cameronrwolfe.substack.com/i/120285767/solving-tough-problems-with-llms">majority voting</a> and more. </p></li></ul><p>Details of the final dataset created for RewardBench 2 and how each component of this dataset is created are summarized below. To derive the final benchmark score, we take an unweighted average of an RM&#8217;s performance<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a> in each domain. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!R9d7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7137cb51-ed3e-4c08-9ec2-d714c9c39c87_1812x812.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!R9d7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7137cb51-ed3e-4c08-9ec2-d714c9c39c87_1812x812.png 424w, https://substackcdn.com/image/fetch/$s_!R9d7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7137cb51-ed3e-4c08-9ec2-d714c9c39c87_1812x812.png 848w, https://substackcdn.com/image/fetch/$s_!R9d7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7137cb51-ed3e-4c08-9ec2-d714c9c39c87_1812x812.png 1272w, https://substackcdn.com/image/fetch/$s_!R9d7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7137cb51-ed3e-4c08-9ec2-d714c9c39c87_1812x812.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!R9d7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7137cb51-ed3e-4c08-9ec2-d714c9c39c87_1812x812.png" width="1456" height="652" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7137cb51-ed3e-4c08-9ec2-d714c9c39c87_1812x812.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:652,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:270118,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/166169560?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7137cb51-ed3e-4c08-9ec2-d714c9c39c87_1812x812.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!R9d7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7137cb51-ed3e-4c08-9ec2-d714c9c39c87_1812x812.png 424w, https://substackcdn.com/image/fetch/$s_!R9d7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7137cb51-ed3e-4c08-9ec2-d714c9c39c87_1812x812.png 848w, https://substackcdn.com/image/fetch/$s_!R9d7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7137cb51-ed3e-4c08-9ec2-d714c9c39c87_1812x812.png 1272w, https://substackcdn.com/image/fetch/$s_!R9d7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7137cb51-ed3e-4c08-9ec2-d714c9c39c87_1812x812.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p><strong>RewardBench 2 performance.</strong> RewardBench 2 is used to evaluate &gt;100 different RMs in [2]. The performance of the top-20 models is provided below. In addition to scores being lower on this new benchmark, we see that foundation model-based (e.g., Gemini and Claude) LLM-as-a-Judge models perform very well. This observation&#8212;<em>though in line with the improving capabilities of foundation models</em>&#8212;is in stark contrast to observations on the initial RewardBench, where LLM-as-a-Judge models performed consistently worse than classifier-based RMs.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tLkg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadf904de-02a0-4f50-bb61-2ec5e8995daa_1034x916.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tLkg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadf904de-02a0-4f50-bb61-2ec5e8995daa_1034x916.png 424w, https://substackcdn.com/image/fetch/$s_!tLkg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadf904de-02a0-4f50-bb61-2ec5e8995daa_1034x916.png 848w, https://substackcdn.com/image/fetch/$s_!tLkg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadf904de-02a0-4f50-bb61-2ec5e8995daa_1034x916.png 1272w, https://substackcdn.com/image/fetch/$s_!tLkg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadf904de-02a0-4f50-bb61-2ec5e8995daa_1034x916.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tLkg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadf904de-02a0-4f50-bb61-2ec5e8995daa_1034x916.png" width="684" height="605.9419729206963" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/adf904de-02a0-4f50-bb61-2ec5e8995daa_1034x916.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:916,&quot;width&quot;:1034,&quot;resizeWidth&quot;:684,&quot;bytes&quot;:470383,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/166169560?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadf904de-02a0-4f50-bb61-2ec5e8995daa_1034x916.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tLkg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadf904de-02a0-4f50-bb61-2ec5e8995daa_1034x916.png 424w, https://substackcdn.com/image/fetch/$s_!tLkg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadf904de-02a0-4f50-bb61-2ec5e8995daa_1034x916.png 848w, https://substackcdn.com/image/fetch/$s_!tLkg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadf904de-02a0-4f50-bb61-2ec5e8995daa_1034x916.png 1272w, https://substackcdn.com/image/fetch/$s_!tLkg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fadf904de-02a0-4f50-bb61-2ec5e8995daa_1034x916.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p>Authors in [2] also train a variety of their own RMs using various base models and hyperparameter settings, finding that the base model used to initialize the RM clearly impacts the RM&#8212;<em>skills present in the base model carry over to the RM</em>. Factors like the model family, training data mixture, style of training used or the stage of post-training from which the RM is initialized clearly influence the performance of the RM across domains. Additionally, authors in [2] find that training the RM for two epochs&#8212;<em>instead of the usual one epoch</em>&#8212;can be beneficial. </p><p><strong>Downstream performance.</strong> Finally, the analysis of RMs in [2] is extended to consider inference-time scaling and RL training. Unsurprisingly, performance on RewardBench 2 is highly correlated with Best-of-N sampling&#8212;<em>accurate reward models are capable of identifying the best completions within a candidate set</em>. </p><p>Although correlation of RewardBench 2 with downstream performance is less clear, authors in [2] do identify one key factor that influences the success of an RM when used for RL training: <em>whether the RM and the policy being trained are derived from the same model lineage</em>. In other words, we see the following:</p><ul><li><p>High scores on RM benchmarks are necessary (but not sufficient) for high downstream performance with RL training&#8212;<em>downstream performance quickly saturates with improving RM quality</em><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-9" href="#footnote-9" target="_self">9</a><em>.</em></p></li><li><p>A misalignment between the policy model for RL training and the RM&#8217;s base model&#8212;<em>or between the distribution of prompts used for RL training versus training the RM</em>&#8212;causes a huge drop in downstream performance. </p></li></ul><p>As a result of these findings, authors in [2] conclude their work by leaving us with a final recommendation for training RMs summarized in the below quote.</p><div class="pullquote"><p><em>&#8220;These findings warrant caution when using reward model evaluation benchmarks: While the benchmark can be used as a guide for picking a reward model off-the-shelf to be used in some settings like best-of-N sampling&#8230; for policy-gradient algorithms like PPO, the results of the benchmark should be considered in the context of one&#8217;s training setup. Instead of simply taking the top model on RewardBench 2, we show that one should take the recipe for that model and integrate it into their specific workflow rather than the checkpoint itself.&#8221;</em> - from [2]</p></div><h2>Conclusion</h2><p>Reward models are among the most powerful and flexible tools in LLM research. As we have learned, various styles of RMs exist beyond the standard classifier-based RM, and creating an effective RM is the result of countless practical considerations. Additionally, the correct choices for creating an RM are application-dependent; e.g., Best-of-N sampling versus RL finetuning. In this overview, we have built a foundational understanding of RMs, ranging from basic statistical models like Bradley-Terry to training large-scale LLM-based RMs. As more focus is dedicated to large-scale RL training for LLMs, research on RMs will rapidly advance and play an increasingly pivotal role in AI.</p><h4>New to the newsletter?</h4><p>Hi! I&#8217;m <a href="https://cameronrwolfe.me/">Cameron R. Wolfe</a>, Deep Learning Ph.D. and Senior Research Scientist at <a href="https://research.netflix.com/research-area/nlp-and-conversations">Netflix</a>. This is the Deep (Learning) Focus newsletter, where I help readers better understand important topics in AI research. The newsletter will always be free and open to read. If you like the newsletter, please subscribe, consider a paid subscription, share it, or follow me on <a href="https://twitter.com/cwolferesearch">X</a> and <a href="https://www.linkedin.com/in/cameron-r-wolfe-ph-d-04744a238/">LinkedIn</a>!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://cameronrwolfe.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://cameronrwolfe.substack.com/subscribe?"><span>Subscribe now</span></a></p><h4>Bibliography</h4><p>[1] Lambert, Nathan, et al. "Rewardbench: Evaluating reward models for language modeling." <em>arXiv preprint arXiv:2403.13787</em> (2024).</p><p>[2] Malik, Saumya, et al. "RewardBench 2: Advancing Reward Model Evaluation." <em>arXiv preprint arXiv:2506.01937</em> (2025).</p><p>[3] Ouyang, Long, et al. "Training language models to follow instructions with human feedback." <em>Advances in neural information processing systems</em> 35 (2022): 27730-27744.</p><p>[4] Lambert, Nathan, et al. "T\" ulu 3: Pushing frontiers in open language model post-training." <em>arXiv preprint arXiv:2411.15124</em> (2024).</p><p>[5] OpenAI et al. &#8220;Learning to Reason with LLMs.&#8221; <em>https://openai.com/index/learning-to-reason-with-llms/</em> (2024).</p><p>[6] Rafailov, Rafael, et al. "Direct preference optimization: Your language model is secretly a reward model." <em>Advances in Neural Information Processing Systems</em> 36 (2023): 53728-53741.</p><p>[7] Guo, Daya, et al. "Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning." <em>arXiv preprint arXiv:2501.12948</em> (2025).</p><p>[8] Bai, Yuntao, et al. "Training a helpful and harmless assistant with reinforcement learning from human feedback." <em>arXiv preprint arXiv:2204.05862</em> (2022).</p><p>[9] Zheng, Lianmin, et al. "Judging llm-as-a-judge with mt-bench and chatbot arena." <em>Advances in Neural Information Processing Systems</em> 36 (2023): 46595-46623.</p><p>[10] Bai, Yuntao, et al. "Constitutional ai: Harmlessness from ai feedback." <em>arXiv preprint arXiv:2212.08073</em> (2022).</p><p>[11] Zheng, Lianmin, et al. "Judging llm-as-a-judge with mt-bench and chatbot arena." <em>Advances in Neural Information Processing Systems</em> 36 (2023): 46595-46623.</p><p>[12] Cobbe, Karl, et al. "Training verifiers to solve math word problems." <em>arXiv preprint arXiv:2110.14168</em> (2021).</p><p>[13] Ivison, Hamish, et al. "Unpacking dpo and ppo: Disentangling best practices for learning from preference feedback." <em>Advances in neural information processing systems</em> 37 (2024): 36602-36633.</p><p>[14] Stiennon, Nisan, et al. "Learning to summarize with human feedback." <em>Advances in neural information processing systems</em> 33 (2020): 3008-3021.</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Sometimes, we may have more than two candidate completions per prompt. In this case, preferences are captured by ranking completions in terms of their preference. However, binary preference data is more commonly used in recent research. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Here, we use the term policy to refer to an LLM that we are currently training. This is standard terminology used within reinforcement learning; see <a href="https://cameronrwolfe.substack.com/p/basics-of-reinforcement-learning">here</a>. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>In practice, these sequences will be both the prompt and the completion for all chosen and rejected sequences. Here, we just have flat textual sequences with no clear prompt or completion structure for simplicity. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>However, there are many variants of policy gradient algorithms used for training LLMs (e.g., <a href="https://cameronrwolfe.substack.com/p/proximal-policy-optimization-ppo">PPO</a>, <a href="https://arxiv.org/abs/2402.14740">REINFORCE</a>, <a href="https://arxiv.org/abs/2402.03300">GRPO</a> and many more), each of which have their benefits.   </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>At the time of writing, Llama-3 was the best open-source model that was available.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>This benchmarks comes with <a href="https://huggingface.co/datasets/allenai/reward-bench-2">data</a>, a <a href="https://huggingface.co/spaces/allenai/reward-bench">leaderboard</a>, and an extensive technical report!</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>Data contamination refers to the idea of data being present in our training set that will later be used to evaluate the same model. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>Performance is measured in terms of accuracy for all domains except ties, where we check for the correct margin between correct and incorrect examples.  </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-9" href="#footnote-anchor-9" class="footnote-number" contenteditable="false" target="_self">9</a><div class="footnote-content"><p>This is in line with results in [13], where we see that RMs of various strengths all perform relatively well when used for RL training. </p></div></div>]]></content:encoded></item><item><title><![CDATA[AI Agents from First Principles]]></title><description><![CDATA[Understanding AI agents by building upon the most basic concepts of LLMs...]]></description><link>https://cameronrwolfe.substack.com/p/ai-agents</link><guid isPermaLink="false">https://cameronrwolfe.substack.com/p/ai-agents</guid><dc:creator><![CDATA[Cameron R. Wolfe, Ph.D.]]></dc:creator><pubDate>Mon, 09 Jun 2025 09:33:09 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/cee4a772-78a7-41b7-8cf1-4da233376ea6_2002x1122.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HVW3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1cd321e-1b6e-45db-89d5-4d1eaffa039b_2000x1122.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HVW3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1cd321e-1b6e-45db-89d5-4d1eaffa039b_2000x1122.png 424w, https://substackcdn.com/image/fetch/$s_!HVW3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1cd321e-1b6e-45db-89d5-4d1eaffa039b_2000x1122.png 848w, https://substackcdn.com/image/fetch/$s_!HVW3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1cd321e-1b6e-45db-89d5-4d1eaffa039b_2000x1122.png 1272w, https://substackcdn.com/image/fetch/$s_!HVW3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1cd321e-1b6e-45db-89d5-4d1eaffa039b_2000x1122.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HVW3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1cd321e-1b6e-45db-89d5-4d1eaffa039b_2000x1122.png" width="1456" height="817" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a1cd321e-1b6e-45db-89d5-4d1eaffa039b_2000x1122.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:817,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1153493,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/164903679?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1cd321e-1b6e-45db-89d5-4d1eaffa039b_2000x1122.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HVW3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1cd321e-1b6e-45db-89d5-4d1eaffa039b_2000x1122.png 424w, https://substackcdn.com/image/fetch/$s_!HVW3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1cd321e-1b6e-45db-89d5-4d1eaffa039b_2000x1122.png 848w, https://substackcdn.com/image/fetch/$s_!HVW3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1cd321e-1b6e-45db-89d5-4d1eaffa039b_2000x1122.png 1272w, https://substackcdn.com/image/fetch/$s_!HVW3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1cd321e-1b6e-45db-89d5-4d1eaffa039b_2000x1122.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1] and <a href="https://modelcontextprotocol.io/introduction">source</a>)</figcaption></figure></div><p>The capabilities of large language models (LLMs) are advancing rapidly. As LLMs become more capable, we can use them to create higher-level systems that solve increasingly complex problems, interact with external environments and operate over longer time horizons&#8212;<em>these are referred to as AI agent systems</em>. AI agents are a popular topic, but there is considerable confusion regarding the definition and capabilities of these agents. In this overview, we will build an understanding of AI agents from first principles. Starting with a standard text-to-text LLM, we will explore how functionalities like tool usage, reasoning and more can enhance a standard LLM, leading to the creation of complex, autonomous systems.</p><h2>LLMs and their Capabilities</h2><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dPSO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b5bd8d4-a75a-4a61-a95c-f2e3363fef79_2216x700.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dPSO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b5bd8d4-a75a-4a61-a95c-f2e3363fef79_2216x700.png 424w, https://substackcdn.com/image/fetch/$s_!dPSO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b5bd8d4-a75a-4a61-a95c-f2e3363fef79_2216x700.png 848w, https://substackcdn.com/image/fetch/$s_!dPSO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b5bd8d4-a75a-4a61-a95c-f2e3363fef79_2216x700.png 1272w, https://substackcdn.com/image/fetch/$s_!dPSO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b5bd8d4-a75a-4a61-a95c-f2e3363fef79_2216x700.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dPSO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b5bd8d4-a75a-4a61-a95c-f2e3363fef79_2216x700.png" width="728" height="230" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8b5bd8d4-a75a-4a61-a95c-f2e3363fef79_2216x700.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:460,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:149316,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/164903679?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b5bd8d4-a75a-4a61-a95c-f2e3363fef79_2216x700.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dPSO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b5bd8d4-a75a-4a61-a95c-f2e3363fef79_2216x700.png 424w, https://substackcdn.com/image/fetch/$s_!dPSO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b5bd8d4-a75a-4a61-a95c-f2e3363fef79_2216x700.png 848w, https://substackcdn.com/image/fetch/$s_!dPSO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b5bd8d4-a75a-4a61-a95c-f2e3363fef79_2216x700.png 1272w, https://substackcdn.com/image/fetch/$s_!dPSO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b5bd8d4-a75a-4a61-a95c-f2e3363fef79_2216x700.png 1456w" sizes="100vw"></picture><div></div></div></a><figcaption class="image-caption">The input-output signature of a standard LLM</figcaption></figure></div><p>The functionality of an LLM is depicted above. Given a textual prompt, the LLM generates a textual response. This functionality is easy to understand and can be generalized to solve nearly any problem. In many ways, the generality of an LLM is one of its biggest strengths. In this section, we will outline how new capabilities&#8212;<em>such as reasoning or interacting with external APIs&#8212;</em>can be integrated into an LLM by taking advantage of this text-to-text structure. As we will soon learn, advanced capabilities of modern AI agents are largely built upon this basic functionality.</p><h4>Tool Usage</h4><p>As LLMs started to become more capable, teaching them how to integrate with and use external tools quickly became a popular topic in AI research. Examples of useful tools that can be integrated with an LLM include calculators, calendars, search engines, code interpreters and more. <em>Why is this approach so popular?</em> Put simply, LLMs are (obviously) not the best tool for solving all tasks. In many cases, simpler and more reliable tools are available; e.g., calculators for performing basic arithmetic or search engines for getting up-to-date factual info on a certain topic. Given that LLMs excel in planning and orchestration, however, we can easily teach them how to use these tools as part of their problem solving process!</p><div class="pullquote"><p>The fundamental idea behind tool-use LLMs is endowing an LLM with the ability to delegate sub-tasks or components of a problem to a more specialized or robust tool. The LLM serves as the &#8220;brain&#8221; that orchestrates various specialized tools together.</p></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kSln!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeb1e059-821d-4e77-98af-2129d4a8766a_1742x782.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kSln!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeb1e059-821d-4e77-98af-2129d4a8766a_1742x782.png 424w, https://substackcdn.com/image/fetch/$s_!kSln!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeb1e059-821d-4e77-98af-2129d4a8766a_1742x782.png 848w, https://substackcdn.com/image/fetch/$s_!kSln!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeb1e059-821d-4e77-98af-2129d4a8766a_1742x782.png 1272w, https://substackcdn.com/image/fetch/$s_!kSln!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeb1e059-821d-4e77-98af-2129d4a8766a_1742x782.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kSln!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeb1e059-821d-4e77-98af-2129d4a8766a_1742x782.png" width="1456" height="654" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/deb1e059-821d-4e77-98af-2129d4a8766a_1742x782.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:654,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:561568,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/164903679?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeb1e059-821d-4e77-98af-2129d4a8766a_1742x782.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kSln!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeb1e059-821d-4e77-98af-2129d4a8766a_1742x782.png 424w, https://substackcdn.com/image/fetch/$s_!kSln!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeb1e059-821d-4e77-98af-2129d4a8766a_1742x782.png 848w, https://substackcdn.com/image/fetch/$s_!kSln!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeb1e059-821d-4e77-98af-2129d4a8766a_1742x782.png 1272w, https://substackcdn.com/image/fetch/$s_!kSln!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeb1e059-821d-4e77-98af-2129d4a8766a_1742x782.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Tool usage examples (from [2, 3])</figcaption></figure></div><p><strong>Finetuning for tool usage.</strong> Early work on tool use&#8212;<em>e.g., LaMDA [2] or the Toolformer [3] (depicted above)</em>&#8212;used targeted finetuning to teach an LLM how to leverage a fixed set of tools. We simply curate training examples where a function call to some tool is directly inserted into the LLM&#8217;s token stream; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!N4MY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b33434-e0d9-4211-847a-ff89508dfa37_2382x350.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!N4MY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b33434-e0d9-4211-847a-ff89508dfa37_2382x350.png 424w, https://substackcdn.com/image/fetch/$s_!N4MY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b33434-e0d9-4211-847a-ff89508dfa37_2382x350.png 848w, https://substackcdn.com/image/fetch/$s_!N4MY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b33434-e0d9-4211-847a-ff89508dfa37_2382x350.png 1272w, https://substackcdn.com/image/fetch/$s_!N4MY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b33434-e0d9-4211-847a-ff89508dfa37_2382x350.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!N4MY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b33434-e0d9-4211-847a-ff89508dfa37_2382x350.png" width="1456" height="214" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/59b33434-e0d9-4211-847a-ff89508dfa37_2382x350.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:214,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:151992,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/164903679?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b33434-e0d9-4211-847a-ff89508dfa37_2382x350.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!N4MY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b33434-e0d9-4211-847a-ff89508dfa37_2382x350.png 424w, https://substackcdn.com/image/fetch/$s_!N4MY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b33434-e0d9-4211-847a-ff89508dfa37_2382x350.png 848w, https://substackcdn.com/image/fetch/$s_!N4MY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b33434-e0d9-4211-847a-ff89508dfa37_2382x350.png 1272w, https://substackcdn.com/image/fetch/$s_!N4MY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b33434-e0d9-4211-847a-ff89508dfa37_2382x350.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Structure of a tool call</figcaption></figure></div><p>During training, these tool calls are treated similarly to any other token&#8212;<em>they are just part of the textual sequence</em>! When a call to a tool is generated by the LLM at inference time, we handle it as follows:</p><ol><li><p>Stop generating tokens.</p></li><li><p>Parse the tool call (i.e., determine the tool being used and its parameters).</p></li><li><p>Make a call to the tool with these parameters.</p></li><li><p>Add the response from the tool to the LLM&#8217;s token stream.</p></li><li><p>Continue generating tokens. </p></li></ol><p>The tool call can be handled in real-time as the LLM generates its output, and the information returned by the tool is added directly into the model&#8217;s context!</p><p><strong>Prompt-based tool usage.</strong> Teaching LLMs to call tools via finetuning requires curating&#8212;<em>usually with human annotation</em>&#8212;a large training dataset. As LLM capabilities improved, later work instead emphasized in-context learning-based approaches for tool usage. <em>Why would we finetune a language model when we can simply explain the tools that are available in the model&#8217;s prompt?</em> </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bcHR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef039977-ac5d-4a83-94dd-944ccae42847_1698x1090.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bcHR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef039977-ac5d-4a83-94dd-944ccae42847_1698x1090.png 424w, https://substackcdn.com/image/fetch/$s_!bcHR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef039977-ac5d-4a83-94dd-944ccae42847_1698x1090.png 848w, https://substackcdn.com/image/fetch/$s_!bcHR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef039977-ac5d-4a83-94dd-944ccae42847_1698x1090.png 1272w, https://substackcdn.com/image/fetch/$s_!bcHR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef039977-ac5d-4a83-94dd-944ccae42847_1698x1090.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bcHR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef039977-ac5d-4a83-94dd-944ccae42847_1698x1090.png" width="1456" height="935" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ef039977-ac5d-4a83-94dd-944ccae42847_1698x1090.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:935,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1918060,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/164903679?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef039977-ac5d-4a83-94dd-944ccae42847_1698x1090.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bcHR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef039977-ac5d-4a83-94dd-944ccae42847_1698x1090.png 424w, https://substackcdn.com/image/fetch/$s_!bcHR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef039977-ac5d-4a83-94dd-944ccae42847_1698x1090.png 848w, https://substackcdn.com/image/fetch/$s_!bcHR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef039977-ac5d-4a83-94dd-944ccae42847_1698x1090.png 1272w, https://substackcdn.com/image/fetch/$s_!bcHR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef039977-ac5d-4a83-94dd-944ccae42847_1698x1090.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [4, 5])</figcaption></figure></div><p>Prompt-based tool usage requires less human effort, allowing us to drastically increase the number of tools to which LLMs have access. For example, later work in this space integrates LLMs with hundreds [4] or even thousands [5] of tools; see above. To do this, we treat each tool as a generic API and provide the schema for relevant APIs as context in the model&#8217;s prompt. This approach enables LLMs to be integrated with arbitrary APIs on the internet using a standardized structure, which makes countless applications possible; e.g., finding information, calling other ML models, booking a vacation, handling your calendar and much more.</p><blockquote><p><em>&#8220;Today, we're open-sourcing the <a href="https://modelcontextprotocol.io/">Model Context Protocol</a> (MCP), a new standard for connecting AI assistants to the systems where data lives, including content repositories, business tools, and development environments. Its aim is to help frontier models produce better, more relevant responses.&#8221;</em> - from [15]</p></blockquote><p><strong>Model context protocol (MCP)</strong>&#8212;<em><a href="https://www.anthropic.com/news/model-context-protocol">proposed by Anthropic</a></em>&#8212;is a popular framework that extends upon the idea of allowing LLMs to interact with arbitrary tools. Put simply, MCP standardizes the format used by external systems to provide context into the prompt of an LLM. To solve complex problems, <em>LLMs will need to integrate with a progressively larger set of external tools over time</em>. To streamline this process, MCP proposes a standard format for these integrations and allows developers to create pre-built integrations, called MCP servers, that can be used by any LLM to connect with a variety of custom data sources; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EpEu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cc4f982-4113-4f06-ad44-7d41235e6c4e_1610x1108.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EpEu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cc4f982-4113-4f06-ad44-7d41235e6c4e_1610x1108.png 424w, https://substackcdn.com/image/fetch/$s_!EpEu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cc4f982-4113-4f06-ad44-7d41235e6c4e_1610x1108.png 848w, https://substackcdn.com/image/fetch/$s_!EpEu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cc4f982-4113-4f06-ad44-7d41235e6c4e_1610x1108.png 1272w, https://substackcdn.com/image/fetch/$s_!EpEu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cc4f982-4113-4f06-ad44-7d41235e6c4e_1610x1108.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EpEu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cc4f982-4113-4f06-ad44-7d41235e6c4e_1610x1108.png" width="1456" height="1002" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6cc4f982-4113-4f06-ad44-7d41235e6c4e_1610x1108.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1002,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:136265,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/164903679?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cc4f982-4113-4f06-ad44-7d41235e6c4e_1610x1108.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EpEu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cc4f982-4113-4f06-ad44-7d41235e6c4e_1610x1108.png 424w, https://substackcdn.com/image/fetch/$s_!EpEu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cc4f982-4113-4f06-ad44-7d41235e6c4e_1610x1108.png 848w, https://substackcdn.com/image/fetch/$s_!EpEu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cc4f982-4113-4f06-ad44-7d41235e6c4e_1610x1108.png 1272w, https://substackcdn.com/image/fetch/$s_!EpEu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cc4f982-4113-4f06-ad44-7d41235e6c4e_1610x1108.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Depiction of the general architecture for MCP (<a href="https://modelcontextprotocol.io/introduction">source</a>)</figcaption></figure></div><p>For those who are interested in digging deeper into tool usage, please see the following series of overview on this topic:</p><ul><li><p>Finetuning LLMs to use tools [<a href="https://cameronrwolfe.substack.com/p/teaching-language-models-to-use-tools">link</a>]</p></li><li><p>Prompt-based tool usage [<a href="https://cameronrwolfe.substack.com/p/language-models-and-friends-gorilla">link</a>]</p></li><li><p>Integrating LLMs with code interpreters [<a href="https://cameronrwolfe.substack.com/p/program-aided-language-models">link</a>]</p></li><li><p>Allowing LLMs to create their own tools [<a href="https://cameronrwolfe.substack.com/p/can-language-models-make-their-own">link</a>]</p></li></ul><p><strong>Limitations of tool usage.</strong> Despite the power of the tool usage paradigm, the capabilities of tool-use LLMs are limited by their reasoning capabilities. To effectively leverage tools, our LLM must be able to:</p><ul><li><p>Decompose complex problems into smaller sub-tasks.</p></li><li><p>Determine what tools should be used to solve a problem.</p></li><li><p>Reliably craft calls to relevant tools with the correct format. </p></li></ul><p>Complex tool usage requires the LLM to be an effective orchestrator, which is very dependent upon the model&#8217;s reasoning capabilities and overall reliability.</p><h4>Reasoning Models</h4><p>Given the relationship between agency and reasoning, reasoning capabilities have been a core focus of LLM research for several years. For a more in-depth overview of current reasoning research, please see the overview below. However, we will briefly cover the key ideas behind reasoning models here for completeness.</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;c1c0f66e-6bc6-481c-b54f-adc1779afaa1&quot;,&quot;caption&quot;:&quot;Reasoning models approach problem solving differently than standard LLMs. In particular, they spend a variable amount of time &#8220;thinking&#8221; prior to providing an answer. This post outlines key concepts behind reasoning models, how they are trained and best practices for using them. &quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Demystifying Reasoning Models&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:29736521,&quot;name&quot;:&quot;Cameron R. Wolfe, Ph.D.&quot;,&quot;bio&quot;:&quot;ML @ Netflix &#8226; Rice University PhD &#8226; I make AI understandable&quot;,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/69aba7df-b571-4609-aa47-fc2d031c11b8_1242x1595.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-02-18T10:33:55.513Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/23d9c87e-b238-4fdd-996e-4ed4465b9931_2334x1282.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://cameronrwolfe.substack.com/p/demystifying-reasoning-models&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:153722335,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:216,&quot;comment_count&quot;:3,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Deep (Learning) Focus&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab9b43fb-52d5-40da-995d-5b7cd3f91064_896x896.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p><strong>CoT prompting.</strong> When LLMs first became popular, one of the most common criticisms of these models was that they could not perform complex reasoning. However, research on <a href="https://cameronrwolfe.substack.com/p/chain-of-thought-prompting-for-llms">Chain of Thought (CoT) prompting</a> [6, 7] revealed that vanilla LLMs are better at reasoning than we initially realized. The idea behind CoT prompting is simple. Instead of directly prompting an LLM for output, we ask it to generate a rationale or explanation prior to its final output; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NPw_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F599a636e-b0b2-4de3-84c8-3edf906bfa82_1616x882.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NPw_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F599a636e-b0b2-4de3-84c8-3edf906bfa82_1616x882.png 424w, https://substackcdn.com/image/fetch/$s_!NPw_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F599a636e-b0b2-4de3-84c8-3edf906bfa82_1616x882.png 848w, https://substackcdn.com/image/fetch/$s_!NPw_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F599a636e-b0b2-4de3-84c8-3edf906bfa82_1616x882.png 1272w, https://substackcdn.com/image/fetch/$s_!NPw_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F599a636e-b0b2-4de3-84c8-3edf906bfa82_1616x882.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NPw_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F599a636e-b0b2-4de3-84c8-3edf906bfa82_1616x882.png" width="1456" height="795" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/599a636e-b0b2-4de3-84c8-3edf906bfa82_1616x882.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:795,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NPw_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F599a636e-b0b2-4de3-84c8-3edf906bfa82_1616x882.png 424w, https://substackcdn.com/image/fetch/$s_!NPw_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F599a636e-b0b2-4de3-84c8-3edf906bfa82_1616x882.png 848w, https://substackcdn.com/image/fetch/$s_!NPw_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F599a636e-b0b2-4de3-84c8-3edf906bfa82_1616x882.png 1272w, https://substackcdn.com/image/fetch/$s_!NPw_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F599a636e-b0b2-4de3-84c8-3edf906bfa82_1616x882.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [7])</figcaption></figure></div><p>Interestingly, this approach drastically improves the performance of vanilla LLMs on reasoning tasks, indicating that LLMs are capable of complex reasoning&#8212;<em>to a reasonable extent</em>&#8212;if we can find the correct approach to elicit these capabilities.</p><p><strong>Reasoning models.</strong> CoT prompting is incredibly effective and is a core part of all modern LLMs; e.g., ChatGPT usually outputs a CoT with its answers by default. However, this approach to reasoning is also somewhat naive. The entire reasoning process revolves around the CoT generated by the LLM and there is no dynamic adaptation based on the complexity of the problem being solved. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JJH6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JJH6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png 424w, https://substackcdn.com/image/fetch/$s_!JJH6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png 848w, https://substackcdn.com/image/fetch/$s_!JJH6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png 1272w, https://substackcdn.com/image/fetch/$s_!JJH6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JJH6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png" width="476" height="283.7692307692308" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:868,&quot;width&quot;:1456,&quot;resizeWidth&quot;:476,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JJH6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png 424w, https://substackcdn.com/image/fetch/$s_!JJH6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png 848w, https://substackcdn.com/image/fetch/$s_!JJH6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png 1272w, https://substackcdn.com/image/fetch/$s_!JJH6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(<a href="https://openai.com/index/learning-to-reason-with-llms/">source</a>)</figcaption></figure></div><p>To solve these issues, recent research has introduced new training strategies to create LLMs that specialize in reasoning (i.e., reasoning models). These models approach problem solving differently compared to standard LLMs&#8212;<em>they spend a variable amount of time &#8220;thinking&#8221; prior to providing an answer to a question</em>. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lZD6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdc9fc1-4032-41ba-9d7a-946f4826f826_1840x454.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lZD6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdc9fc1-4032-41ba-9d7a-946f4826f826_1840x454.png 424w, https://substackcdn.com/image/fetch/$s_!lZD6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdc9fc1-4032-41ba-9d7a-946f4826f826_1840x454.png 848w, https://substackcdn.com/image/fetch/$s_!lZD6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdc9fc1-4032-41ba-9d7a-946f4826f826_1840x454.png 1272w, https://substackcdn.com/image/fetch/$s_!lZD6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdc9fc1-4032-41ba-9d7a-946f4826f826_1840x454.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lZD6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdc9fc1-4032-41ba-9d7a-946f4826f826_1840x454.png" width="1456" height="359" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9bdc9fc1-4032-41ba-9d7a-946f4826f826_1840x454.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:359,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lZD6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdc9fc1-4032-41ba-9d7a-946f4826f826_1840x454.png 424w, https://substackcdn.com/image/fetch/$s_!lZD6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdc9fc1-4032-41ba-9d7a-946f4826f826_1840x454.png 848w, https://substackcdn.com/image/fetch/$s_!lZD6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdc9fc1-4032-41ba-9d7a-946f4826f826_1840x454.png 1272w, https://substackcdn.com/image/fetch/$s_!lZD6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdc9fc1-4032-41ba-9d7a-946f4826f826_1840x454.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(from [8])</figcaption></figure></div><p>The thoughts of a reasoning model are just standard chains of thought, but a CoT from a reasoning model is much longer than that of a standard LLM (i.e., can be several thousands of tokens), tends to exhibit complex reasoning behavior (e.g., backtracking and self-refinement) and can dynamically adapt based on the difficulty of the problem being solved&#8212;<em>harder problems warrant a longer CoT</em>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mzxO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mzxO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png 424w, https://substackcdn.com/image/fetch/$s_!mzxO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png 848w, https://substackcdn.com/image/fetch/$s_!mzxO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png 1272w, https://substackcdn.com/image/fetch/$s_!mzxO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mzxO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png" width="1456" height="570" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:570,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mzxO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png 424w, https://substackcdn.com/image/fetch/$s_!mzxO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png 848w, https://substackcdn.com/image/fetch/$s_!mzxO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png 1272w, https://substackcdn.com/image/fetch/$s_!mzxO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The key advancement that made reasoning models possible was large-scale post-training with <a href="https://cameronrwolfe.substack.com/i/153722335/reinforcement-learning-with-verifiable-rewards">reinforcement learning from verifiable rewards (RLVR)</a>; see above. If we have a dataset of ground truth solutions to verifiable problems (e.g., Math or coding), we can simply check whether the answer generated by the LLM is correct and use this signal to train a model with RL. During this training process, reasoning models naturally learn how to generate long chains of thought to solve verifiable reasoning problems via RL-powered self-evolution. </p><blockquote><p><em>&#8220;We explore the potential of LLMs to develop reasoning capabilities without any supervised data, focusing on their self-evolution through a pure reinforcement learning process.&#8221; </em>- from [8]</p></blockquote><p><strong>Reasoning trajectories.</strong> In summary, reasoning models, which are trained via large-scale post-training with RLVR, change the behavior of a standard LLM as shown below. Instead of directly generating output, the reasoning model first generates an arbitrarily long CoT<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> that decomposes and solves the reasoning task&#8212;<em>this is the &#8220;thinking&#8221; process</em>. We can change how much the model thinks by controlling the length of this reasoning trace; e.g., the <a href="https://openai.com/index/introducing-o3-and-o4-mini/">o-series</a> of reasoning models from OpenAI provide low, medium and high levels of reasoning effort.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iThv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff8c2a7d-e62b-4ed7-bb2c-99de79b0ad96_2390x688.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iThv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff8c2a7d-e62b-4ed7-bb2c-99de79b0ad96_2390x688.png 424w, https://substackcdn.com/image/fetch/$s_!iThv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff8c2a7d-e62b-4ed7-bb2c-99de79b0ad96_2390x688.png 848w, https://substackcdn.com/image/fetch/$s_!iThv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff8c2a7d-e62b-4ed7-bb2c-99de79b0ad96_2390x688.png 1272w, https://substackcdn.com/image/fetch/$s_!iThv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff8c2a7d-e62b-4ed7-bb2c-99de79b0ad96_2390x688.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iThv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff8c2a7d-e62b-4ed7-bb2c-99de79b0ad96_2390x688.png" width="1456" height="419" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ff8c2a7d-e62b-4ed7-bb2c-99de79b0ad96_2390x688.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:419,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:185778,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/164903679?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff8c2a7d-e62b-4ed7-bb2c-99de79b0ad96_2390x688.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iThv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff8c2a7d-e62b-4ed7-bb2c-99de79b0ad96_2390x688.png 424w, https://substackcdn.com/image/fetch/$s_!iThv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff8c2a7d-e62b-4ed7-bb2c-99de79b0ad96_2390x688.png 848w, https://substackcdn.com/image/fetch/$s_!iThv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff8c2a7d-e62b-4ed7-bb2c-99de79b0ad96_2390x688.png 1272w, https://substackcdn.com/image/fetch/$s_!iThv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff8c2a7d-e62b-4ed7-bb2c-99de79b0ad96_2390x688.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The input-output signature of a reasoning model</figcaption></figure></div><p>Although the model still generates a single output given a prompt, the reasoning trajectory implicitly demonstrates a variety of advanced behaviors; e.g., planning, backtracking, monitoring, evaluation and more. For examples of these reasoning trajectories and their properties, see the <a href="https://www.primeintellect.ai/blog/synthetic-1-release">Synthetic-1 dataset</a>, which contains over 2M examples of reasoning traces generated by <a href="https://arxiv.org/abs/2501.12948">DeepSeek-R1</a>.</p><p><strong>Reasoning + agents.</strong> Given recent advancements in reasoning,<em> </em>a sufficiently capable LLM that can plan and effectively reason over its instructions should be able to decompose a problem, solve each component of the problem and arrive at a final solution itself. Providing LLMs with more autonomy and relying on their capabilities&#8212;<em>rather than human intervention</em>&#8212;to solve complex problems is a key idea behind agent systems. To make the idea of an agent more clear, let&#8217;s now discuss a framework that can be used to design these types of systems. </p><h2><strong><a href="https://arxiv.org/abs/2210.03629">The ReAct Framework</a> [1]</strong></h2><blockquote><p><em>&#8220;It is becoming more evident that with the help of LLMs, language as a fundamental cognitive mechanism will play a critical role in interaction and decision making.&#8221;</em> - from [1]</p></blockquote><p>ReAct [1]&#8212;<em>short for <strong>RE</strong>asoning and <strong>ACT</strong>ion</em>&#8212;is one of the first general frameworks to be proposed for autonomously decomposing and solving complex problems with an LLM agent. We can think of ReAct as a sequential, multi-step problem-solving process powered by an LLM at its core. At each time step <code>t</code>, the LLM incorporates any feedback that is available and considers the current state of the problem it is trying to solve, allowing it to effectively reason over and select the best possible course of action for the future. Given that (nearly) any LLM system can be modeled sequentially, ReAct is a generic and powerful framework. </p><h4>Creating a Framework for Agents</h4><p>At a particular time step <code>t</code>, our agent is given an observation from its environment <code>o_t</code>. Based upon this observation, our agent will decide to take some action <code>a_t</code>, which may be intermediate&#8212;<em>such as searching the web to find data that is needed to solve a problem</em>&#8212;or terminal (i.e., the final action that &#8220;solves&#8221; the problem of interest). We define the function that our agent uses to produce this action as a policy <code>&#960;</code><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>. The policy takes the context&#8212;<em>a concatenated list of prior actions and observations from the agent</em>&#8212;as input and predicts the next action <code>a_t</code> as output, either deterministically or stochastically<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>. As depicted below, this loop of observations and actions continues until our agent outputs a terminal action.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XNLJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c731b2f-9a36-45a8-b016-89691850dc88_2134x760.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XNLJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c731b2f-9a36-45a8-b016-89691850dc88_2134x760.png 424w, https://substackcdn.com/image/fetch/$s_!XNLJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c731b2f-9a36-45a8-b016-89691850dc88_2134x760.png 848w, https://substackcdn.com/image/fetch/$s_!XNLJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c731b2f-9a36-45a8-b016-89691850dc88_2134x760.png 1272w, https://substackcdn.com/image/fetch/$s_!XNLJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c731b2f-9a36-45a8-b016-89691850dc88_2134x760.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XNLJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c731b2f-9a36-45a8-b016-89691850dc88_2134x760.png" width="1456" height="519" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9c731b2f-9a36-45a8-b016-89691850dc88_2134x760.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:519,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:158084,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/164903679?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c731b2f-9a36-45a8-b016-89691850dc88_2134x760.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XNLJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c731b2f-9a36-45a8-b016-89691850dc88_2134x760.png 424w, https://substackcdn.com/image/fetch/$s_!XNLJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c731b2f-9a36-45a8-b016-89691850dc88_2134x760.png 848w, https://substackcdn.com/image/fetch/$s_!XNLJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c731b2f-9a36-45a8-b016-89691850dc88_2134x760.png 1272w, https://substackcdn.com/image/fetch/$s_!XNLJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c731b2f-9a36-45a8-b016-89691850dc88_2134x760.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The observation-action loop for agents</figcaption></figure></div><p>ReAct [1] makes one key modification to the observation-action loop shown above. The space of potential actions that can be outputted by the policy <code>A</code> typically includes the set of intermediate and terminal actions that can be taken by the agent; e.g., searching for data on the web or outputting a final solution to a problem. However, ReAct expands the action space to include language, allowing the agent to produce a textual output as an action instead of taking a traditional action. In other words, <em>the agent can choose to &#8220;think&#8221;</em>; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7PpC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe159d88e-5792-4106-bff8-044361ced6fb_2136x934.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7PpC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe159d88e-5792-4106-bff8-044361ced6fb_2136x934.png 424w, https://substackcdn.com/image/fetch/$s_!7PpC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe159d88e-5792-4106-bff8-044361ced6fb_2136x934.png 848w, https://substackcdn.com/image/fetch/$s_!7PpC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe159d88e-5792-4106-bff8-044361ced6fb_2136x934.png 1272w, https://substackcdn.com/image/fetch/$s_!7PpC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe159d88e-5792-4106-bff8-044361ced6fb_2136x934.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7PpC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe159d88e-5792-4106-bff8-044361ced6fb_2136x934.png" width="1456" height="637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e159d88e-5792-4106-bff8-044361ced6fb_2136x934.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:637,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:191215,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/164903679?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe159d88e-5792-4106-bff8-044361ced6fb_2136x934.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7PpC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe159d88e-5792-4106-bff8-044361ced6fb_2136x934.png 424w, https://substackcdn.com/image/fetch/$s_!7PpC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe159d88e-5792-4106-bff8-044361ced6fb_2136x934.png 848w, https://substackcdn.com/image/fetch/$s_!7PpC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe159d88e-5792-4106-bff8-044361ced6fb_2136x934.png 1272w, https://substackcdn.com/image/fetch/$s_!7PpC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe159d88e-5792-4106-bff8-044361ced6fb_2136x934.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The ReAct framework</figcaption></figure></div><p>Formally, we can define a thought as a special kind of action as shown above. As one might infer from the name of the framework, the primary motivation behind ReAct is finding a balance between reasoning and action. Similarly to a human, the agent should be able to think and plan the actions that it takes in an environment&#8212;<em>reasoning and action have a symbiotic relationship</em>. </p><blockquote><p><em>&#8220;Reasoning traces help the model induce, track, and update action plans, while actions allow it to interface with and gather additional information from external sources such as knowledge bases or environments.&#8221;</em> - from [1]</p></blockquote><h4>How do agents think?</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CAb3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8a68b2e-7283-4b8d-90ab-399d58bee163_1920x998.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CAb3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8a68b2e-7283-4b8d-90ab-399d58bee163_1920x998.png 424w, https://substackcdn.com/image/fetch/$s_!CAb3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8a68b2e-7283-4b8d-90ab-399d58bee163_1920x998.png 848w, https://substackcdn.com/image/fetch/$s_!CAb3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8a68b2e-7283-4b8d-90ab-399d58bee163_1920x998.png 1272w, https://substackcdn.com/image/fetch/$s_!CAb3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8a68b2e-7283-4b8d-90ab-399d58bee163_1920x998.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CAb3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8a68b2e-7283-4b8d-90ab-399d58bee163_1920x998.png" width="1456" height="757" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a8a68b2e-7283-4b8d-90ab-399d58bee163_1920x998.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:757,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:271507,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/164903679?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8a68b2e-7283-4b8d-90ab-399d58bee163_1920x998.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CAb3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8a68b2e-7283-4b8d-90ab-399d58bee163_1920x998.png 424w, https://substackcdn.com/image/fetch/$s_!CAb3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8a68b2e-7283-4b8d-90ab-399d58bee163_1920x998.png 848w, https://substackcdn.com/image/fetch/$s_!CAb3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8a68b2e-7283-4b8d-90ab-399d58bee163_1920x998.png 1272w, https://substackcdn.com/image/fetch/$s_!CAb3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8a68b2e-7283-4b8d-90ab-399d58bee163_1920x998.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Example action space for an agent</figcaption></figure></div><p>The traditional action space for an agent is discrete and&#8212;<em>in most cases</em>&#8212;relatively small. For example, an agent specialized in question-answering could have the following options for actions (depicted above):</p><ul><li><p>Perform a Google search to retrieve relevant webpages.</p></li><li><p>Grab relevant information from a particular webpage.</p></li><li><p>Return a final answer.</p></li></ul><p>There are only so many actions that this agent can take while working towards a solution. In contrast, the space of language is virtually unlimited. As a result, the ReAct framework requires the use of a strong language model as its policy. In order to produce useful thoughts that benefit performance, the LLM backend of our agent system must possess advanced reasoning and planning capabilities!</p><blockquote><p><em>&#8220;Learning in this augmented action space is difficult and requires strong language priors&#8230; we mainly focus on the setup where a frozen large language model&#8230; is prompted with few-shot in-context examples to generate both domain-specific actions and free-form language thoughts for task solving.&#8221;</em> - from [1]</p></blockquote><p><strong>Thought patterns.</strong> Common examples of useful thought patterns that can be produced by an agent include decomposing tasks, creating itemized action plans, tracking progress toward a final solution, or simply outputting information&#8212;<em>from the implicit knowledge base of the LLM</em>&#8212;that may be relevant to solving a problem.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Bj0c!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F527e2e86-7b4f-4919-b1f7-cbdf92e99148_1576x796.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Bj0c!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F527e2e86-7b4f-4919-b1f7-cbdf92e99148_1576x796.png 424w, https://substackcdn.com/image/fetch/$s_!Bj0c!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F527e2e86-7b4f-4919-b1f7-cbdf92e99148_1576x796.png 848w, https://substackcdn.com/image/fetch/$s_!Bj0c!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F527e2e86-7b4f-4919-b1f7-cbdf92e99148_1576x796.png 1272w, https://substackcdn.com/image/fetch/$s_!Bj0c!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F527e2e86-7b4f-4919-b1f7-cbdf92e99148_1576x796.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Bj0c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F527e2e86-7b4f-4919-b1f7-cbdf92e99148_1576x796.png" width="1456" height="735" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/527e2e86-7b4f-4919-b1f7-cbdf92e99148_1576x796.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:735,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:574519,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/164903679?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F527e2e86-7b4f-4919-b1f7-cbdf92e99148_1576x796.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Bj0c!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F527e2e86-7b4f-4919-b1f7-cbdf92e99148_1576x796.png 424w, https://substackcdn.com/image/fetch/$s_!Bj0c!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F527e2e86-7b4f-4919-b1f7-cbdf92e99148_1576x796.png 848w, https://substackcdn.com/image/fetch/$s_!Bj0c!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F527e2e86-7b4f-4919-b1f7-cbdf92e99148_1576x796.png 1272w, https://substackcdn.com/image/fetch/$s_!Bj0c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F527e2e86-7b4f-4919-b1f7-cbdf92e99148_1576x796.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Agents use their thinking ability to explicitly describe how a problem should be solved and then execute&#8212;<em>and monitor the execution of</em>&#8212;this plan. In both of the examples above, the agent explicitly writes out the next steps that it needs to perform when solving a problem; e.g., <em>&#8220;Next, I need to&#8230;&#8221;</em> or <em>&#8220;I need to search&#8230;&#8221;</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vp5a!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6c6a76b-e84c-4a9c-b5a8-79cdfe7d2a4d_1394x1164.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vp5a!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6c6a76b-e84c-4a9c-b5a8-79cdfe7d2a4d_1394x1164.png 424w, https://substackcdn.com/image/fetch/$s_!vp5a!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6c6a76b-e84c-4a9c-b5a8-79cdfe7d2a4d_1394x1164.png 848w, https://substackcdn.com/image/fetch/$s_!vp5a!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6c6a76b-e84c-4a9c-b5a8-79cdfe7d2a4d_1394x1164.png 1272w, https://substackcdn.com/image/fetch/$s_!vp5a!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6c6a76b-e84c-4a9c-b5a8-79cdfe7d2a4d_1394x1164.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vp5a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6c6a76b-e84c-4a9c-b5a8-79cdfe7d2a4d_1394x1164.png" width="518" height="432.5337159253946" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c6c6a76b-e84c-4a9c-b5a8-79cdfe7d2a4d_1394x1164.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1164,&quot;width&quot;:1394,&quot;resizeWidth&quot;:518,&quot;bytes&quot;:395236,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/164903679?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6c6a76b-e84c-4a9c-b5a8-79cdfe7d2a4d_1394x1164.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vp5a!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6c6a76b-e84c-4a9c-b5a8-79cdfe7d2a4d_1394x1164.png 424w, https://substackcdn.com/image/fetch/$s_!vp5a!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6c6a76b-e84c-4a9c-b5a8-79cdfe7d2a4d_1394x1164.png 848w, https://substackcdn.com/image/fetch/$s_!vp5a!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6c6a76b-e84c-4a9c-b5a8-79cdfe7d2a4d_1394x1164.png 1272w, https://substackcdn.com/image/fetch/$s_!vp5a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6c6a76b-e84c-4a9c-b5a8-79cdfe7d2a4d_1394x1164.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">A few-shot example given to a ReAct agent (from [1])</figcaption></figure></div><p>In most cases, the thoughts produced by an agent&#8212;<em>commonly referred to as a problem or task-solving trajectory</em>&#8212;mimic that of a human trying to solve a problem. In fact, experiments with ReAct in [1] guide the agent&#8217;s approach to a problem by providing <a href="https://cameronrwolfe.substack.com/i/117151147/few-shot-learning">in-context examples</a> of task-solving trajectories (i.e., actions, thoughts and observations) used by humans to solve similar problems. Agents prompted in this fashion are likely adopt a human-like reasoning process.</p><blockquote><p><em>&#8220;We let the language model decide the asynchronous occurrence of thoughts and actions for itself.&#8221;</em> - from [1]</p></blockquote><p><strong>When should the agent think?</strong> Depending on the problem we are solving, the ReAct framework can be setup differently. For reasoning heavy tasks, thoughts are typically interleaved with actions&#8212;<em>we can hard-code the agent such that it produces a single thought before every action</em>. However, the agent can also be given the ability to determine for itself whether thinking is necessary. For tasks that require a lot of actions (i.e., decision-making tasks), the agent may choose to use thoughts more sparsely within its problem-solving trajectory. </p><h4>Concrete Use Cases</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5QLQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F999ee665-a74a-47d3-8498-b3c7db099edc_1606x1566.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5QLQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F999ee665-a74a-47d3-8498-b3c7db099edc_1606x1566.png 424w, https://substackcdn.com/image/fetch/$s_!5QLQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F999ee665-a74a-47d3-8498-b3c7db099edc_1606x1566.png 848w, https://substackcdn.com/image/fetch/$s_!5QLQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F999ee665-a74a-47d3-8498-b3c7db099edc_1606x1566.png 1272w, https://substackcdn.com/image/fetch/$s_!5QLQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F999ee665-a74a-47d3-8498-b3c7db099edc_1606x1566.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5QLQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F999ee665-a74a-47d3-8498-b3c7db099edc_1606x1566.png" width="1456" height="1420" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/999ee665-a74a-47d3-8498-b3c7db099edc_1606x1566.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1420,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:750455,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/164903679?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F999ee665-a74a-47d3-8498-b3c7db099edc_1606x1566.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5QLQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F999ee665-a74a-47d3-8498-b3c7db099edc_1606x1566.png 424w, https://substackcdn.com/image/fetch/$s_!5QLQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F999ee665-a74a-47d3-8498-b3c7db099edc_1606x1566.png 848w, https://substackcdn.com/image/fetch/$s_!5QLQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F999ee665-a74a-47d3-8498-b3c7db099edc_1606x1566.png 1272w, https://substackcdn.com/image/fetch/$s_!5QLQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F999ee665-a74a-47d3-8498-b3c7db099edc_1606x1566.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>Two use cases are considered for applications of the ReAct framework in [1]:</p><ol><li><p><em>Knowledge-intensive reasoning</em>: using ReAct for question answering and fact verification tasks (e.g., <a href="https://huggingface.co/datasets/hotpotqa/hotpot_qa">HotpotQA</a> and <a href="https://huggingface.co/datasets/fever/fever">FEVER</a>).</p></li><li><p><em>Decision making</em>: applying ReAct to interactive (language-based) decision-making tasks; e.g., <a href="https://alfworld.github.io/">ALFWorld</a> for navigating simulated households or <a href="https://webshop-pnlp.github.io/">WebShop</a> for completing autonomous shopping tasks.</p></li></ol><p>Examples of ReAct being applied in each use case are provided above. The ReAct framework is implemented with an LLM&#8212;<em><a href="https://cameronrwolfe.substack.com/p/palm-efficiently-training-massive">PaLM-540B</a> in particular</em>&#8212;that is prompted with several in-context examples that outline the problem solving process. The LLM&#8217;s prompt provides human-crafted thought-action-observation trajectories that are followed to arrive at a final solution to a question. </p><blockquote><p><em>&#8220;By interacting with a Wikipedia API, ReAct is able to retrieve information to support reasoning, while also use reasoning to target what to retrieve next, demonstrating a synergy of reasoning and acting.&#8221;</em> - from [1]</p></blockquote><p><strong>Knowledge-intensive reasoning.</strong> In this domain, the LLM agent is provided only a question (and optionally a claim) as input. To answer a question or evaluate the correctness of a claim, the LLM must either rely upon its internal knowledge base or retrieve necessary information from an external environment. Specifically, the agent&#8217;s action space is outlined below. Here, we see that authors in [1] expose basic information retrieval functionality&#8212;<em>reflective of how a typical human would lookup information on Wikipedia</em>&#8212;to the LLM agent via its action space.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8CG-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F803f9b98-970a-4145-b03b-2855a904c047_1868x738.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8CG-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F803f9b98-970a-4145-b03b-2855a904c047_1868x738.png 424w, https://substackcdn.com/image/fetch/$s_!8CG-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F803f9b98-970a-4145-b03b-2855a904c047_1868x738.png 848w, https://substackcdn.com/image/fetch/$s_!8CG-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F803f9b98-970a-4145-b03b-2855a904c047_1868x738.png 1272w, https://substackcdn.com/image/fetch/$s_!8CG-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F803f9b98-970a-4145-b03b-2855a904c047_1868x738.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8CG-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F803f9b98-970a-4145-b03b-2855a904c047_1868x738.png" width="618" height="244.05906593406593" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/803f9b98-970a-4145-b03b-2855a904c047_1868x738.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:575,&quot;width&quot;:1456,&quot;resizeWidth&quot;:618,&quot;bytes&quot;:169804,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/164903679?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F803f9b98-970a-4145-b03b-2855a904c047_1868x738.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8CG-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F803f9b98-970a-4145-b03b-2855a904c047_1868x738.png 424w, https://substackcdn.com/image/fetch/$s_!8CG-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F803f9b98-970a-4145-b03b-2855a904c047_1868x738.png 848w, https://substackcdn.com/image/fetch/$s_!8CG-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F803f9b98-970a-4145-b03b-2855a904c047_1868x738.png 1272w, https://substackcdn.com/image/fetch/$s_!8CG-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F803f9b98-970a-4145-b03b-2855a904c047_1868x738.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">ReAct action space for knowledge-intensive reasoning</figcaption></figure></div><p>To teach the LLM how to properly think and act in this environment, humans manually craft ReAct-style problem-solving trajectories for each application dataset, including six trajectories for HotpotQA and three trajectories for FEVER. These trajectories are then provided as in-context examples in the prompt provided to the LLM agent, allowing the trajectories to be used as a demonstration of how to solve a problem. A full example of a prompt&#8212;<em>with three in-context examples</em>&#8212;used for a ReAct agent on HotpotQA is provided below. </p><pre><code><strong>Question:</strong> Musician and satirist Allie Goertz wrote a song about the "The Simpsons" character Milhouse, who Matt Groening named after who?
<strong>Thought 1:</strong> The question simplifies to "The Simpsons" character Milhouse is named after who. I only need to search Milhouse and find who it is named after.
<strong>Action 1:</strong> Search[Milhouse]
<strong>Observation 1:</strong> Milhouse Mussolini Van Houten is a recurring character in the Fox animated television series The Simpsons voiced by Pamela Hayden and created by Matt Groening.
<strong>Thought 2:</strong> The paragraph does not tell who Milhouse is named after, maybe I can look up "named after".
<strong>Action 2:</strong> Lookup[named after]
<strong>Observation 2:</strong> (Result 1 / 1) Milhouse was named after U.S. president Richard Nixon, whose middle name was Milhous.
<strong>Thought 3:</strong> Milhouse was named after U.S. president Richard Nixon, so the answer is Richard Nixon.
<strong>Action 3:</strong> Finish[Richard Nixon]

<strong>Question:</strong> Which documentary is about Finnish rock groups, Adam Clayton Powell or The Saimaa Gesture?
<strong>Thought 1:</strong> I need to search Adam Clayton Powell and The Saimaa Gesture, and find which documentary is about Finnish rock groups.
<strong>Action 1:</strong> Search[Adam Clayton Powell]
<strong>Observation 1:</strong> Could not find [Adam Clayton Powell]. Similar: [&#8217;Adam Clayton Powell III&#8217;, &#8217;Seventh Avenue (Manhattan)&#8217;, &#8217;Adam Clayton Powell Jr. State Office Building&#8217;, &#8217;Isabel Washington Powell&#8217;, &#8217;Adam Powell&#8217;, &#8217;Adam Clayton Powell (film)&#8217;, &#8217;Giancarlo Esposito&#8217;].
<strong>Thought 2:</strong> To find the documentary, I can search Adam Clayton Powell (film).
<strong>Action 2:</strong> Search[Adam Clayton Powell (film)]
<strong>Observation 2:</strong> Adam Clayton Powell is a 1989 American documentary film directed by Richard Kilberg. The film is about the rise and fall of influential African-American politician Adam Clayton Powell Jr.[3][4] It was later aired as part of the PBS series The American Experience.
<strong>Thought 3:</strong> Adam Clayton Powell (film) is a documentary about an African-American politician, not Finnish rock groups. So the documentary about Finnish rock groups must instead be The Saimaa Gesture.
<strong>Action 3:</strong> Finish[The Saimaa Gesture]

<strong>Question:</strong> What profession does Nicholas Ray and Elia Kazan have in common?
<strong>Thought 1:</strong> I need to search Nicholas Ray and Elia Kazan, find their professions, then find the profession they have in common.
<strong>Action 1:</strong> Search[Nicholas Ray]
<strong>Observation 1:</strong> Nicholas Ray (born Raymond Nicholas Kienzle Jr., August 7, 1911 - June 16, 1979) was an American film director, screenwriter, and actor best known for the 1955 film Rebel Without a Cause.
<strong>Thought 2:</strong> Professions of Nicholas Ray are director, screenwriter, and actor. I need to search Elia Kazan next and find his professions.
<strong>Action 2:</strong> Search[Elia Kazan]
<strong>Observation 2:</strong> Elia Kazan was an American film and theatre director, producer, screenwriter and actor.
<strong>Thought 3:</strong> Professions of Elia Kazan are director, producer, screenwriter, and actor. So profession Nicholas Ray and Elia Kazan have in common is director, screenwriter, and actor.
<strong>Action 3:</strong> Finish[director, screenwriter, actor]

<strong>Question:</strong> Which magazine was started first Arthur&#8217;s Magazine or First for Women?</code></pre><p>In this example, the ReAct agent is explicitly prompted to output a thought prior to every concrete action that it takes. Unlike a traditional LLM, the ReAct agent does not produce a single output per prompt. Rather, the agent generates output sequentially as follows:</p><ol><li><p>Selects an action to perform (either a concrete action or a thought).</p></li><li><p>Gets feedback from the environment based on this action (e.g., the information retrieved from a search query).</p></li><li><p>Continues on to the next action with this new context.</p></li></ol><p>Eventually, the terminal action is reached, triggering the end of the problem solving process; see below. This stateful, sequential problem solving approach is characteristic of agents and helps to distinguish them from standard LLMs.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Hg2e!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75aadeb9-3ce5-489d-8e0c-8490981161e5_2320x1244.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Hg2e!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75aadeb9-3ce5-489d-8e0c-8490981161e5_2320x1244.png 424w, https://substackcdn.com/image/fetch/$s_!Hg2e!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75aadeb9-3ce5-489d-8e0c-8490981161e5_2320x1244.png 848w, https://substackcdn.com/image/fetch/$s_!Hg2e!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75aadeb9-3ce5-489d-8e0c-8490981161e5_2320x1244.png 1272w, https://substackcdn.com/image/fetch/$s_!Hg2e!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75aadeb9-3ce5-489d-8e0c-8490981161e5_2320x1244.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Hg2e!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75aadeb9-3ce5-489d-8e0c-8490981161e5_2320x1244.png" width="1456" height="781" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/75aadeb9-3ce5-489d-8e0c-8490981161e5_2320x1244.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:781,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:187019,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/164903679?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75aadeb9-3ce5-489d-8e0c-8490981161e5_2320x1244.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Hg2e!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75aadeb9-3ce5-489d-8e0c-8490981161e5_2320x1244.png 424w, https://substackcdn.com/image/fetch/$s_!Hg2e!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75aadeb9-3ce5-489d-8e0c-8490981161e5_2320x1244.png 848w, https://substackcdn.com/image/fetch/$s_!Hg2e!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75aadeb9-3ce5-489d-8e0c-8490981161e5_2320x1244.png 1272w, https://substackcdn.com/image/fetch/$s_!Hg2e!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75aadeb9-3ce5-489d-8e0c-8490981161e5_2320x1244.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Sequentially solving a problem with ReAct</figcaption></figure></div><p><strong>Decision making.</strong> The setup for ReAct on decision making tasks in very similar to that of knowledge-intensive reasoning tasks. For both decision making tasks, humans manually annotate several reasoning trajectories that are used as in-context examples for the ReAct agent. Unlike knowledge-intensive reasoning tasks, however, the thought patterns used by ReAct for decision making tasks are sparse&#8212;<em>the model is prompted to use discretion in determining when and how it should think</em>. Additionally, the ReAct agent is provided with a wider variety of tools and actions to use for the WebShop dataset; e.g., search, filter, choose a product, choose product attributes, buy a product and more. This application serves as a good test of ReAct when interacting with a more complex environment. </p><p><strong>Does ReAct perform well?</strong> The ReAct agents described above are compared to several baselines:</p><ul><li><p><em>Prompting</em>: few-shot prompt that removes thoughts, actions and observations from example trajectories, leaving only questions and answers.</p></li><li><p><em>CoT prompting</em>: same as above, but the model is prompted to produce a chain of thought before outputting a final solution<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>.</p></li><li><p><em>Act (action-only)</em>: removes thoughts from ReAct trajectories, leaving only observations and actions.  </p></li><li><p><em>Imitation</em>: agents trained via imitation and / or reinforcement learning to mimic human reasoning trajectories (e.g., <a href="https://arxiv.org/abs/2010.03768">BUTLER</a>).  </p></li></ul><p>As shown below, the ReAct framework consistently outperforms the Act setup, <em>revealing that the ability of an agent to think as it acts is incredibly important</em>. Going further, we see that CoT prompting is a strong baseline that outperforms ReAct in some cases but struggles in scenarios where the LLM is prone to hallucination&#8212;<em>ReAct is able to leverage external sources of information to avoid hallucinating in these cases</em>.  Finally, we see that there is much room to improve the performance of ReAct agents. In fact, the agents explored in [1] are quite brittle; e.g., authors note that simply retrieving non-informative information can lead to failure.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!t7P-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e6e7533-bd8d-4787-821d-c92b081c3e16_1932x678.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!t7P-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e6e7533-bd8d-4787-821d-c92b081c3e16_1932x678.png 424w, https://substackcdn.com/image/fetch/$s_!t7P-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e6e7533-bd8d-4787-821d-c92b081c3e16_1932x678.png 848w, https://substackcdn.com/image/fetch/$s_!t7P-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e6e7533-bd8d-4787-821d-c92b081c3e16_1932x678.png 1272w, https://substackcdn.com/image/fetch/$s_!t7P-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e6e7533-bd8d-4787-821d-c92b081c3e16_1932x678.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!t7P-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e6e7533-bd8d-4787-821d-c92b081c3e16_1932x678.png" width="1456" height="511" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6e6e7533-bd8d-4787-821d-c92b081c3e16_1932x678.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:511,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:378818,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/164903679?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e6e7533-bd8d-4787-821d-c92b081c3e16_1932x678.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!t7P-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e6e7533-bd8d-4787-821d-c92b081c3e16_1932x678.png 424w, https://substackcdn.com/image/fetch/$s_!t7P-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e6e7533-bd8d-4787-821d-c92b081c3e16_1932x678.png 848w, https://substackcdn.com/image/fetch/$s_!t7P-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e6e7533-bd8d-4787-821d-c92b081c3e16_1932x678.png 1272w, https://substackcdn.com/image/fetch/$s_!t7P-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e6e7533-bd8d-4787-821d-c92b081c3e16_1932x678.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p><strong>ReAct + CoT.</strong> ReAct is factual and grounded in its approach to solving problems. Although CoT prompting may suffer from hallucinated facts due to not being grounded in external knowledge, this approach still excels at formulating a structure for solving complex reasoning tasks. ReAct imposes a strict structure of observations, thoughts and actions onto the agent&#8217;s reasoning trajectory, while CoT has more flexibility in formulating the reasoning process.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ainZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb14d0a2-8b3c-43a9-b31d-e34b3a4b5c47_1986x1206.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ainZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb14d0a2-8b3c-43a9-b31d-e34b3a4b5c47_1986x1206.png 424w, https://substackcdn.com/image/fetch/$s_!ainZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb14d0a2-8b3c-43a9-b31d-e34b3a4b5c47_1986x1206.png 848w, https://substackcdn.com/image/fetch/$s_!ainZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb14d0a2-8b3c-43a9-b31d-e34b3a4b5c47_1986x1206.png 1272w, https://substackcdn.com/image/fetch/$s_!ainZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb14d0a2-8b3c-43a9-b31d-e34b3a4b5c47_1986x1206.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ainZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb14d0a2-8b3c-43a9-b31d-e34b3a4b5c47_1986x1206.png" width="417" height="253.17857142857142" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/db14d0a2-8b3c-43a9-b31d-e34b3a4b5c47_1986x1206.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:884,&quot;width&quot;:1456,&quot;resizeWidth&quot;:417,&quot;bytes&quot;:302732,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/164903679?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb14d0a2-8b3c-43a9-b31d-e34b3a4b5c47_1986x1206.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ainZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb14d0a2-8b3c-43a9-b31d-e34b3a4b5c47_1986x1206.png 424w, https://substackcdn.com/image/fetch/$s_!ainZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb14d0a2-8b3c-43a9-b31d-e34b3a4b5c47_1986x1206.png 848w, https://substackcdn.com/image/fetch/$s_!ainZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb14d0a2-8b3c-43a9-b31d-e34b3a4b5c47_1986x1206.png 1272w, https://substackcdn.com/image/fetch/$s_!ainZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb14d0a2-8b3c-43a9-b31d-e34b3a4b5c47_1986x1206.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>To reap the benefits of both approaches<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>, <em>we can toggle between them</em>! For example, we can default to CoT prompting if ReAct fails to return an answer after <code>N</code> steps (i.e., ReAct &#8594; CoT) or take several CoT samples and use ReAct if disagreement exists among the answers (i.e., CoT &#8594; ReAct). As shown above, such a backoff approach&#8212;<em>in either direction</em>&#8212;boosts the agent&#8217;s problem solving capabilities.</p><h4>Prior Attempts at Agents</h4><p>Although ReAct was (arguably) the first lasting framework to be proposed for AI agents, there were a variety of impactful papers and ideas previously proposed within the agents space. Here, we will quickly outline some of these key proposals and how they compare, allowing us to understand how the ReAct framework builds upon prior work to create a more useful and popular framework.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!u-Fr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae13e457-28b8-40bf-b3ee-c1497e99c8a0_2146x1176.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!u-Fr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae13e457-28b8-40bf-b3ee-c1497e99c8a0_2146x1176.png 424w, https://substackcdn.com/image/fetch/$s_!u-Fr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae13e457-28b8-40bf-b3ee-c1497e99c8a0_2146x1176.png 848w, https://substackcdn.com/image/fetch/$s_!u-Fr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae13e457-28b8-40bf-b3ee-c1497e99c8a0_2146x1176.png 1272w, https://substackcdn.com/image/fetch/$s_!u-Fr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae13e457-28b8-40bf-b3ee-c1497e99c8a0_2146x1176.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!u-Fr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae13e457-28b8-40bf-b3ee-c1497e99c8a0_2146x1176.png" width="1456" height="798" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ae13e457-28b8-40bf-b3ee-c1497e99c8a0_2146x1176.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:798,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1063941,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/164903679?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae13e457-28b8-40bf-b3ee-c1497e99c8a0_2146x1176.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!u-Fr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae13e457-28b8-40bf-b3ee-c1497e99c8a0_2146x1176.png 424w, https://substackcdn.com/image/fetch/$s_!u-Fr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae13e457-28b8-40bf-b3ee-c1497e99c8a0_2146x1176.png 848w, https://substackcdn.com/image/fetch/$s_!u-Fr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae13e457-28b8-40bf-b3ee-c1497e99c8a0_2146x1176.png 1272w, https://substackcdn.com/image/fetch/$s_!u-Fr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae13e457-28b8-40bf-b3ee-c1497e99c8a0_2146x1176.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [10])</figcaption></figure></div><p><strong>Inner monologue (IM) [10]</strong> was one of the most comparable works to ReAct and is applied to planning in a robotics setting. As shown above, IM integrates an LLM with several domain-specific feedback mechanisms; e.g., scene descriptors or success detectors. Somewhat similarly to ReAct, the LLM is used to generate a plan and monitor the solution of a task&#8212;<em>like picking up an object</em>&#8212;by iteratively acting, thinking and receiving feedback from the external environment. </p><blockquote><p><em>&#8220;We investigate to what extent LLMs used in embodied contexts can reason over sources of feedback provided through natural language&#8230; We propose that by leveraging environment feedback, LLMs are able to form an inner monologue that allows them to more richly process and plan in robotic control scenarios.&#8221;</em> - from [10]</p></blockquote><p>IM demonstrates the feasibility of leveraging LLMs as a general tool for problem solving in domains beyond natural language. Relative to ReAct, however, the ability of the LLM to &#8220;think&#8221; within IM is limited&#8212;<em>the model can only observe feedback from the environment and decide what needs to be done next</em>. ReAct solves this problem by empowering the agent to output extensive, free-form thoughts.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hmBA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabdedf5c-f94b-41e4-8628-8f5bf7ca7df5_1690x700.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hmBA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabdedf5c-f94b-41e4-8628-8f5bf7ca7df5_1690x700.png 424w, https://substackcdn.com/image/fetch/$s_!hmBA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabdedf5c-f94b-41e4-8628-8f5bf7ca7df5_1690x700.png 848w, https://substackcdn.com/image/fetch/$s_!hmBA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabdedf5c-f94b-41e4-8628-8f5bf7ca7df5_1690x700.png 1272w, https://substackcdn.com/image/fetch/$s_!hmBA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabdedf5c-f94b-41e4-8628-8f5bf7ca7df5_1690x700.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hmBA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabdedf5c-f94b-41e4-8628-8f5bf7ca7df5_1690x700.png" width="1456" height="603" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/abdedf5c-f94b-41e4-8628-8f5bf7ca7df5_1690x700.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:603,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:339750,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/164903679?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabdedf5c-f94b-41e4-8628-8f5bf7ca7df5_1690x700.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!hmBA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabdedf5c-f94b-41e4-8628-8f5bf7ca7df5_1690x700.png 424w, https://substackcdn.com/image/fetch/$s_!hmBA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabdedf5c-f94b-41e4-8628-8f5bf7ca7df5_1690x700.png 848w, https://substackcdn.com/image/fetch/$s_!hmBA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabdedf5c-f94b-41e4-8628-8f5bf7ca7df5_1690x700.png 1272w, https://substackcdn.com/image/fetch/$s_!hmBA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabdedf5c-f94b-41e4-8628-8f5bf7ca7df5_1690x700.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [14])</figcaption></figure></div><p><strong>LLMs for interactive decision making (LID) [14]</strong> uses language as a general medium for planning and action by proposing a language-based framework for solving sequential problems. We can formulate the context and action space for a wide variety of tasks as a sequence of tokens, thus converting arbitrary tasks into a standardized format that is LLM-compatible. Then, this data can be ingested by an LLM, allowing powerful foundation models to incorporate feedback from the environment and make decisions; see above. In [14], authors finetune LID using imitation learning to correctly predict actions across a variety of domains.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GqwP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cb9686e-bbd4-472b-9d3f-94c36adf91ef_2134x1032.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GqwP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cb9686e-bbd4-472b-9d3f-94c36adf91ef_2134x1032.png 424w, https://substackcdn.com/image/fetch/$s_!GqwP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cb9686e-bbd4-472b-9d3f-94c36adf91ef_2134x1032.png 848w, https://substackcdn.com/image/fetch/$s_!GqwP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cb9686e-bbd4-472b-9d3f-94c36adf91ef_2134x1032.png 1272w, https://substackcdn.com/image/fetch/$s_!GqwP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cb9686e-bbd4-472b-9d3f-94c36adf91ef_2134x1032.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GqwP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cb9686e-bbd4-472b-9d3f-94c36adf91ef_2134x1032.png" width="1456" height="704" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6cb9686e-bbd4-472b-9d3f-94c36adf91ef_2134x1032.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:704,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:560495,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/164903679?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cb9686e-bbd4-472b-9d3f-94c36adf91ef_2134x1032.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!GqwP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cb9686e-bbd4-472b-9d3f-94c36adf91ef_2134x1032.png 424w, https://substackcdn.com/image/fetch/$s_!GqwP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cb9686e-bbd4-472b-9d3f-94c36adf91ef_2134x1032.png 848w, https://substackcdn.com/image/fetch/$s_!GqwP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cb9686e-bbd4-472b-9d3f-94c36adf91ef_2134x1032.png 1272w, https://substackcdn.com/image/fetch/$s_!GqwP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cb9686e-bbd4-472b-9d3f-94c36adf91ef_2134x1032.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [11])</figcaption></figure></div><p><strong>WebGPT [11]</strong> explores integrating an LLM (<a href="https://arxiv.org/abs/2005.14165">GPT-3</a>) with a text-based web browser to more effectively answer questions. This work is any early pioneer of open-ended tool use and teaches the LLM how to openly search and navigate the web. However, WebGPT is explicitly finetuned over a large dataset of task solutions from humans (i.e., behavior cloning or imitation learning). Therefore, this system&#8212;<em>despite being very forward-looking and effective (i.e., produces answers preferred to those of a human in &gt;50% of cases)</em>&#8212;requires a large amount of human intervention. Nonetheless, finetuning LLM agents with human feedback is a hot research topic even today, and WebGPT is a foundational work in this space.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cMs-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb280ff9b-b2c5-4c67-b96a-e45d7b341ff0_1252x1002.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cMs-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb280ff9b-b2c5-4c67-b96a-e45d7b341ff0_1252x1002.png 424w, https://substackcdn.com/image/fetch/$s_!cMs-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb280ff9b-b2c5-4c67-b96a-e45d7b341ff0_1252x1002.png 848w, https://substackcdn.com/image/fetch/$s_!cMs-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb280ff9b-b2c5-4c67-b96a-e45d7b341ff0_1252x1002.png 1272w, https://substackcdn.com/image/fetch/$s_!cMs-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb280ff9b-b2c5-4c67-b96a-e45d7b341ff0_1252x1002.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cMs-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb280ff9b-b2c5-4c67-b96a-e45d7b341ff0_1252x1002.png" width="1252" height="1002" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b280ff9b-b2c5-4c67-b96a-e45d7b341ff0_1252x1002.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1002,&quot;width&quot;:1252,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:912460,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/164903679?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb280ff9b-b2c5-4c67-b96a-e45d7b341ff0_1252x1002.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!cMs-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb280ff9b-b2c5-4c67-b96a-e45d7b341ff0_1252x1002.png 424w, https://substackcdn.com/image/fetch/$s_!cMs-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb280ff9b-b2c5-4c67-b96a-e45d7b341ff0_1252x1002.png 848w, https://substackcdn.com/image/fetch/$s_!cMs-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb280ff9b-b2c5-4c67-b96a-e45d7b341ff0_1252x1002.png 1272w, https://substackcdn.com/image/fetch/$s_!cMs-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb280ff9b-b2c5-4c67-b96a-e45d7b341ff0_1252x1002.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [12])</figcaption></figure></div><p>Inspired by the broad capabilities of LLMs, <strong>Gato [12]</strong> is a single &#8220;generalist&#8221; agent that is capable of acting across many modalities, tasks and domains. For example, Gato is used for playing Atari, captioning images, manipulating robotic arms and more. As described in the report, Gato is capable of <em>&#8220;deciding based on its context whether to output text, joint torques, button presses, or other tokens.&#8221;</em> This model truly works towards the goal of creating an autonomous system that can solve almost any problem. Similarly to WebGPT, however, Gato is trained via an imitation learning approach that collects a massive dataset of context and actions&#8212;<em>all represented as flat sequences of tokens</em>&#8212;across many problem scenarios.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aZGa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7a99d91-2746-43bf-b5be-b05ce8a6e26a_1834x990.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aZGa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7a99d91-2746-43bf-b5be-b05ce8a6e26a_1834x990.png 424w, https://substackcdn.com/image/fetch/$s_!aZGa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7a99d91-2746-43bf-b5be-b05ce8a6e26a_1834x990.png 848w, https://substackcdn.com/image/fetch/$s_!aZGa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7a99d91-2746-43bf-b5be-b05ce8a6e26a_1834x990.png 1272w, https://substackcdn.com/image/fetch/$s_!aZGa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7a99d91-2746-43bf-b5be-b05ce8a6e26a_1834x990.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aZGa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7a99d91-2746-43bf-b5be-b05ce8a6e26a_1834x990.png" width="1456" height="786" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d7a99d91-2746-43bf-b5be-b05ce8a6e26a_1834x990.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:786,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:320190,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/164903679?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7a99d91-2746-43bf-b5be-b05ce8a6e26a_1834x990.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!aZGa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7a99d91-2746-43bf-b5be-b05ce8a6e26a_1834x990.png 424w, https://substackcdn.com/image/fetch/$s_!aZGa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7a99d91-2746-43bf-b5be-b05ce8a6e26a_1834x990.png 848w, https://substackcdn.com/image/fetch/$s_!aZGa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7a99d91-2746-43bf-b5be-b05ce8a6e26a_1834x990.png 1272w, https://substackcdn.com/image/fetch/$s_!aZGa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7a99d91-2746-43bf-b5be-b05ce8a6e26a_1834x990.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [13])</figcaption></figure></div><p><strong>Reasoning via Planning (RAP) [13]</strong> aims to endow LLMs with a better world model&#8212;<em>or an understanding of the environment in which they act and the rewards that come from it</em>&#8212;with the goal of improving the LLM&#8217;s ability to plan solutions to complex, multi-step problems. In particular, the LLM is used to build a reasoning tree that can be explored via <a href="https://www.geeksforgeeks.org/ml-monte-carlo-tree-search-mcts/">Monte Carlo Tree Search (MCTS)</a> to find a solution that achieves high reward. Here, the LLM itself is also used to evaluate solutions&#8212;<em>the LLM serves as both an agent and a world model in RAP</em>!</p><blockquote><p><em>&#8220;The LLM (as agent) incrementally builds a reasoning tree under the guidance of the LLM (as world model) and rewards, and efficiently obtains a high-reward reasoning path with a proper balance between exploration vs. exploitation.&#8221;</em> - from [13]</p></blockquote><p>RAP is a useful and effective framework, but it is applied purely to text-based reasoning problems in [13]&#8212;<em>it is not a general problem-solving framework like ReAct</em>. There are many such works that bear a high level of resemblance to agent systems but are applied mostly to improving LLM reasoning capabilities:</p><ul><li><p><a href="https://arxiv.org/abs/2205.09712">Selection-Inference</a> improves LLM reasoning capabilities by separating the problem solving process into alternating steps of selection (or planning) and solving. A similar approach is pioneered by <a href="https://arxiv.org/abs/2208.14271">Creswell et al</a>. </p></li><li><p><a href="https://arxiv.org/abs/2309.06275">Re2</a> is a prompting strategy that improves LLM reasoning capabilities by asking the LLM to re-read the question prior to deriving an answer. </p></li><li><p><a href="https://arxiv.org/abs/2302.12813">LLM-Augmenter</a> combines an LLM with databases or sources of domain-specific information that provide useful external knowledge to the LLM, thus improving groundedness in question-answering tasks.</p></li></ul><p>For a more complete survey of research on the intersection of agents and reasoning for LLMs (and much more), see <a href="https://arxiv.org/abs/2504.09037">this incredible writeup</a>. </p><h2>What is an &#8220;agent&#8221;?</h2><blockquote><p><em>&#8220;The simplest way to view the starting points for language model-based agents is any tool-use language model. The spectrum of agents increases in complexity from here.&#8221;</em> - <a href="https://www.interconnects.ai/p/the-ai-agent-spectrum">Nathan Lambert</a></p></blockquote><p>Despite their popularity in the industry, agents do not have a clear definition&#8212;<em>there is a lot of discussion about what qualifies as an &#8220;agent&#8221;</em>. Lack of clarity on the definition of agents arises from the fact that we encounter a variety of agents in today&#8217;s world that lie on a wide spectrum of complexity. At a high level, the functionality of an agent may appear similar to that of an LLM in some cases, but an agent typically has a wider scope of strategies and tools available for solving a problem. Using the information we have learned so far, we will now create a framework for understanding the spectrum of capabilities an AI agent may possess, as well as how these capabilities differ from a standard LLM. </p><h4>From LLMs to Agents</h4><p>We have learned about a variety of concepts in this overview, including <em>i)</em> standard LLMs, <em>ii)</em> tool usage, <em>iii)</em> reasoning models and <em>iv)</em> autonomous systems for problem solving. Starting with the standard definition of an LLM, we will now explain how each of these ideas can be used to build upon the standard LLM&#8217;s capabilities, creating a system that is more agentic in nature.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GApJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a7db9c5-fdc1-4ac5-be2d-85a56cda348d_1348x552.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GApJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a7db9c5-fdc1-4ac5-be2d-85a56cda348d_1348x552.png 424w, https://substackcdn.com/image/fetch/$s_!GApJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a7db9c5-fdc1-4ac5-be2d-85a56cda348d_1348x552.png 848w, https://substackcdn.com/image/fetch/$s_!GApJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a7db9c5-fdc1-4ac5-be2d-85a56cda348d_1348x552.png 1272w, https://substackcdn.com/image/fetch/$s_!GApJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a7db9c5-fdc1-4ac5-be2d-85a56cda348d_1348x552.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GApJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a7db9c5-fdc1-4ac5-be2d-85a56cda348d_1348x552.png" width="452" height="185.0919881305638" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8a7db9c5-fdc1-4ac5-be2d-85a56cda348d_1348x552.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:552,&quot;width&quot;:1348,&quot;resizeWidth&quot;:452,&quot;bytes&quot;:70758,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/164903679?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a7db9c5-fdc1-4ac5-be2d-85a56cda348d_1348x552.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!GApJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a7db9c5-fdc1-4ac5-be2d-85a56cda348d_1348x552.png 424w, https://substackcdn.com/image/fetch/$s_!GApJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a7db9c5-fdc1-4ac5-be2d-85a56cda348d_1348x552.png 848w, https://substackcdn.com/image/fetch/$s_!GApJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a7db9c5-fdc1-4ac5-be2d-85a56cda348d_1348x552.png 1272w, https://substackcdn.com/image/fetch/$s_!GApJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a7db9c5-fdc1-4ac5-be2d-85a56cda348d_1348x552.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>[Level 0] Standard LLMs.</strong> As a starting point, we can consider the standard setup for an LLM (depicted above), which receives a textual prompt as input and generates a textual response as output. To solve problems, this system purely relies upon the internal knowledge base of the LLM without introducing external systems or imposing any structure upon the problem-solving process. To solve more complex reasoning problems, we may also use a reasoning-style LLM or a CoT prompting approach to elicit a reasoning trajectory; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lHTC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20d63a60-9328-4e1e-bfce-3234b7cdfc6a_1846x548.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lHTC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20d63a60-9328-4e1e-bfce-3234b7cdfc6a_1846x548.png 424w, https://substackcdn.com/image/fetch/$s_!lHTC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20d63a60-9328-4e1e-bfce-3234b7cdfc6a_1846x548.png 848w, https://substackcdn.com/image/fetch/$s_!lHTC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20d63a60-9328-4e1e-bfce-3234b7cdfc6a_1846x548.png 1272w, https://substackcdn.com/image/fetch/$s_!lHTC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20d63a60-9328-4e1e-bfce-3234b7cdfc6a_1846x548.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lHTC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20d63a60-9328-4e1e-bfce-3234b7cdfc6a_1846x548.png" width="652" height="193.45054945054946" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/20d63a60-9328-4e1e-bfce-3234b7cdfc6a_1846x548.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:432,&quot;width&quot;:1456,&quot;resizeWidth&quot;:652,&quot;bytes&quot;:137461,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/164903679?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20d63a60-9328-4e1e-bfce-3234b7cdfc6a_1846x548.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lHTC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20d63a60-9328-4e1e-bfce-3234b7cdfc6a_1846x548.png 424w, https://substackcdn.com/image/fetch/$s_!lHTC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20d63a60-9328-4e1e-bfce-3234b7cdfc6a_1846x548.png 848w, https://substackcdn.com/image/fetch/$s_!lHTC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20d63a60-9328-4e1e-bfce-3234b7cdfc6a_1846x548.png 1272w, https://substackcdn.com/image/fetch/$s_!lHTC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20d63a60-9328-4e1e-bfce-3234b7cdfc6a_1846x548.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>[Level 1] Tool usage.</strong> Relying upon an LLM&#8217;s internal knowledge base is risky&#8212;<em>LLMs have a fixed knowledge cutoff date and a tendency to hallucinate</em>. To mitigate this problem, we can teach an LLM how to make API calls for the purpose of retrieving useful information and solving sub-tasks with specialized tools. Using this approach, the LLM can more robustly solve problems by delegating the solution of sub-tasks to more specialized systems; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!l6Hv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfd4ac4a-6040-40b0-852d-80daeac6bf00_1360x932.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!l6Hv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfd4ac4a-6040-40b0-852d-80daeac6bf00_1360x932.png 424w, https://substackcdn.com/image/fetch/$s_!l6Hv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfd4ac4a-6040-40b0-852d-80daeac6bf00_1360x932.png 848w, https://substackcdn.com/image/fetch/$s_!l6Hv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfd4ac4a-6040-40b0-852d-80daeac6bf00_1360x932.png 1272w, https://substackcdn.com/image/fetch/$s_!l6Hv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfd4ac4a-6040-40b0-852d-80daeac6bf00_1360x932.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!l6Hv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfd4ac4a-6040-40b0-852d-80daeac6bf00_1360x932.png" width="508" height="348.1294117647059" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cfd4ac4a-6040-40b0-852d-80daeac6bf00_1360x932.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:932,&quot;width&quot;:1360,&quot;resizeWidth&quot;:508,&quot;bytes&quot;:125367,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/164903679?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfd4ac4a-6040-40b0-852d-80daeac6bf00_1360x932.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!l6Hv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfd4ac4a-6040-40b0-852d-80daeac6bf00_1360x932.png 424w, https://substackcdn.com/image/fetch/$s_!l6Hv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfd4ac4a-6040-40b0-852d-80daeac6bf00_1360x932.png 848w, https://substackcdn.com/image/fetch/$s_!l6Hv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfd4ac4a-6040-40b0-852d-80daeac6bf00_1360x932.png 1272w, https://substackcdn.com/image/fetch/$s_!l6Hv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfd4ac4a-6040-40b0-852d-80daeac6bf00_1360x932.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>[Level 2] Decomposing problems.</strong> Expecting an LLM to solve a complex problem in a single step may be unreasonable. Instead, we can create a framework that plans how a problem should be solved and iteratively derives a solution. Such an LLM system can be handcrafted; e.g., by chaining multiple prompts or executing several prompts in parallel and aggregating their results. Alternatively, we can avoid this manual effort by using a framework like ReAct that relies upon an LLM to sequentially derive and execute a problem-solving strategy; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NrQa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3765945-9ea8-47de-a7f7-802a009e9c67_1856x990.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NrQa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3765945-9ea8-47de-a7f7-802a009e9c67_1856x990.png 424w, https://substackcdn.com/image/fetch/$s_!NrQa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3765945-9ea8-47de-a7f7-802a009e9c67_1856x990.png 848w, https://substackcdn.com/image/fetch/$s_!NrQa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3765945-9ea8-47de-a7f7-802a009e9c67_1856x990.png 1272w, https://substackcdn.com/image/fetch/$s_!NrQa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3765945-9ea8-47de-a7f7-802a009e9c67_1856x990.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NrQa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3765945-9ea8-47de-a7f7-802a009e9c67_1856x990.png" width="634" height="338.33653846153845" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a3765945-9ea8-47de-a7f7-802a009e9c67_1856x990.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:777,&quot;width&quot;:1456,&quot;resizeWidth&quot;:634,&quot;bytes&quot;:219084,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/164903679?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3765945-9ea8-47de-a7f7-802a009e9c67_1856x990.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NrQa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3765945-9ea8-47de-a7f7-802a009e9c67_1856x990.png 424w, https://substackcdn.com/image/fetch/$s_!NrQa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3765945-9ea8-47de-a7f7-802a009e9c67_1856x990.png 848w, https://substackcdn.com/image/fetch/$s_!NrQa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3765945-9ea8-47de-a7f7-802a009e9c67_1856x990.png 1272w, https://substackcdn.com/image/fetch/$s_!NrQa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3765945-9ea8-47de-a7f7-802a009e9c67_1856x990.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Of course, the problem of decomposing and solving complex problems with an LLM is intricately related to tool usage and reasoning. The LLM may rely upon various tools throughout the problem solving process, and reasoning capabilities are essential for formulating detailed and correct plans for solving a problem. Going further, this LLM-centric approach to problem solving introduces the notion of control flow to inference with an LLM&#8212;<em>the agent&#8217;s output is sequentially built as it statefully moves through a sequence of problem-solving steps</em>.  </p><p><strong>[Level 3] Increasing autonomy.</strong> The above framework outlines most key functionalities of AI agents today. However, we can also make such a system more capable by providing it with a greater level of autonomy. For example, we can include within the agent&#8217;s action space the ability to take concrete actions (e.g., buying an item, sending an email or opening a pull request) on our behalf.</p><blockquote><p><em>&#8220;An agent is anything that can perceive its environment and act upon that environment&#8230; This means that an agent is characterized by the environment it operates in and the set of actions it can perform.&#8221;</em> - <a href="https://huyenchip.com/2025/01/07/agents.html">Chip Huyen</a> </p></blockquote><p>So far, the agents we have outlined always take a prompt from a human user as input. When given this prompt, they begin the process of thinking, acting and formulating an appropriate response. In other words, <em>these agents only take action when triggered by a prompt from a human user</em>. However, this does not have to be the case! We can build agents that continuously operate in the background. For example, a lot of research has been done on <a href="https://openai.com/index/universe/">open-ended computer use agents</a>, and OpenAI recently announced <a href="https://openai.com/index/introducing-codex/">Codex</a>&#8212;<em>a cloud-based software engineering agent that can work on many tasks in parallel and even make PRs to codebases on its own</em>.</p><p><strong>AI agent spectrum.</strong> Combining all of the concepts we have discussed throughout this overview, we could create an agent system that:</p><ul><li><p>Runs asynchronously without any human input.</p></li><li><p>Uses reasoning LLMs to formulate plans for solving complex tasks.</p></li><li><p>Uses a standard LLM to produce basic thoughts or synthesize information.</p></li><li><p>Takes actions in the external world (e.g., booking a plane ticket or adding an event to our calendar) on our behalf.</p></li><li><p>Retrieves up-to-date info via the Google search API (or any other tool).</p></li></ul><p>Each style of LLM&#8212;<em>as well as any other tool or model</em>&#8212;has both strengths and weaknesses. These components provide agent systems with many capabilities that are useful for various aspects of problem solving. <em>The crux of agent systems is orchestrating these components in a way that is seamless and reliable</em>. However, agents <a href="https://www.interconnects.ai/p/the-ai-agent-spectrum">lie on a spectrum</a> and may or may not use all of these functionalities; e.g., the system described above, a basic tool-use LLM and a chain of prompts for solving a particular class of problems all fall under the umbrella of an agent system. </p><h2>The Future of AI Agents</h2><p>Although AI agents are incredibly popular, work in this space&#8212;<em>both from a research and application perspective</em>&#8212;is nascent. As we have learned, agents operate via a sequential problem solving process. If any step in this process goes wrong, then the agent is likely to fail. As such, <em>reliability is a prerequisite for building effective agents in complex environments</em>. In other words, building robust agent systems will require creating LLMs with more <a href="https://en.wikipedia.org/wiki/High_availability#%22Nines%22">nines of reliability</a>; see below. </p><blockquote><p><em>&#8220;Last year, you said the thing that was holding [agents] back was the extra nines of reliability&#8230; that's the way you would still describe the way in which these software agents aren't able to do a full day of work, but are able to help you out with a couple minutes.&#8221;</em> - <a href="https://www.dwarkesh.com/p/sholto-trenton-2">Dwarkesh Podcast</a></p></blockquote><p>Many agents today <a href="https://arxiv.org/abs/2405.13966">are (arguably) brittle</a> due to a lack of reliability. However, progress is being made quickly, both on LLMs in general (i.e., better reasoning and new generations of models) and agents in particular. Recent research has focused especially on <a href="https://arxiv.org/abs/2410.10934">effectively evaluating agents</a>, <a href="https://arxiv.org/abs/2402.03578">creating multi-agent systems</a> and <a href="https://arxiv.org/abs/2410.07706">finetuning agent systems</a> to improve reliability in specialized domains. Given the pace of research in this area, <em>we are likely to see a significant increase in the capabilities and generality of these agent systems in the near future</em>. </p><h4><strong>New to the newsletter?</strong></h4><p>Hi! I&#8217;m <a href="https://cameronrwolfe.me/">Cameron R. Wolfe</a>, Deep Learning Ph.D. and Senior Research Scientist at <a href="https://research.netflix.com/research-area/nlp-and-conversations">Netflix</a>. This is the Deep (Learning) Focus newsletter, where I help readers better understand important topics in AI research. The newsletter will always be free and open to read. If you like the newsletter, please subscribe, consider a paid subscription, share it, or follow me on <a href="https://twitter.com/cwolferesearch">X</a> and <a href="https://www.linkedin.com/in/cameron-r-wolfe-ph-d-04744a238/">LinkedIn</a>!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://cameronrwolfe.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://cameronrwolfe.substack.com/subscribe?"><span>Subscribe now</span></a></p><h4><strong>Bibliography</strong></h4><p>[1] Yao, Shunyu, et al. "React: Synergizing reasoning and acting in language models." <em>International Conference on Learning Representations (ICLR)</em>. 2023.</p><p>[2] Schick, Timo, et al. "Toolformer: Language models can teach themselves to use tools." <em>Advances in Neural Information Processing Systems</em> 36 (2023): 68539-68551.</p><p>[3] Thoppilan, Romal, et al. "Lamda: Language models for dialog applications." <em>arXiv preprint arXiv:2201.08239</em> (2022).</p><p>[4] Shen, Yongliang, et al. "Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face." <em>Advances in Neural Information Processing Systems</em> 36 (2023): 38154-38180.</p><p>[5] Patil, Shishir G., et al. "Gorilla: Large language model connected with massive apis." <em>Advances in Neural Information Processing Systems</em> 37 (2024): 126544-126565.</p><p>[6] Wei, Jason, et al. "Chain-of-thought prompting elicits reasoning in large language models." <em>Advances in neural information processing systems</em> 35 (2022): 24824-24837.</p><p>[7] Kojima, Takeshi, et al. "Large language models are zero-shot reasoners." <em>Advances in neural information processing systems</em> 35 (2022): 22199-22213.</p><p>[8] Guo, Daya, et al. "Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning." <em>arXiv preprint arXiv:2501.12948</em> (2025).</p><p>[9] Lambert, Nathan, et al. "T\" ulu 3: Pushing frontiers in open language model post-training." <em>arXiv preprint arXiv:2411.15124</em> (2024).</p><p>[10] Huang, Wenlong, et al. "Inner monologue: Embodied reasoning through planning with language models." <em>arXiv preprint arXiv:2207.05608</em> (2022).</p><p>[11] Nakano, Reiichiro, et al. "Webgpt: Browser-assisted question-answering with human feedback." <em>arXiv preprint arXiv:2112.09332</em> (2021).</p><p>[12] Reed, Scott, et al. "A generalist agent." <em>arXiv preprint arXiv:2205.06175</em> (2022).</p><p>[13] Hao, Shibo, et al. "Reasoning with language model is planning with world model." <em>arXiv preprint arXiv:2305.14992</em> (2023).</p><p>[14] Li, Shuang, et al. "Pre-trained language models for interactive decision-making." <em>Advances in Neural Information Processing Systems</em> 35 (2022): 31199-31212.</p><p>[15] Anthropic. &#8220;Introducing the Model Context Protocol&#8221; <a href="https://www.anthropic.com/news/model-context-protocol">https://www.anthropic.com/news/model-context-protocol</a> (2024).</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>In the context of reasoning models, these chains of thought are also referred to as reasoning trajectories or traces. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>This is quite similar to the definition of a policy in reinforcement learning (RL); see <a href="https://cameronrwolfe.substack.com/i/137266538/markov-decision-process-mdp">here</a> for details. In both cases, the policy is implemented as a language model and produces an action as output. The main difference between the agent and RL definition of a policy is the policy&#8217;s input. For agents, the input is the current observation. For RL, the the policy&#8217;s input is the current state of the environment. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>See <a href="https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#policies">here</a> for details on the difference between a deterministic and stochastic policy. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>CoT prompting can also be extended with <a href="https://arxiv.org/abs/2203.11171">self-consistency</a> with a majority vote to further improve performance. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>Notably, ReAct (or any other agentic framework) is not guaranteed to outperform standard CoT prompting! The relative performance of these techniques is highly related to the complexity of problems being solved&#8212;<em>CoT prompting performs very well in cases where hallucination is unlikely to be a problem for the LLM being used</em>. </p></div></div>]]></content:encoded></item><item><title><![CDATA[A Guide for Debugging LLM Training Data]]></title><description><![CDATA[Data-centric techniques and tools that anyone should use when training an LLM...]]></description><link>https://cameronrwolfe.substack.com/p/llm-debugging</link><guid isPermaLink="false">https://cameronrwolfe.substack.com/p/llm-debugging</guid><dc:creator><![CDATA[Cameron R. Wolfe, Ph.D.]]></dc:creator><pubDate>Mon, 19 May 2025 09:33:19 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/725f69ce-2f6f-4914-a797-01ace5b67332_2484x1380.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EX0M!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37c17277-3f8e-4ff3-8e90-2dd7c2bb5565_2516x1357.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EX0M!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37c17277-3f8e-4ff3-8e90-2dd7c2bb5565_2516x1357.png 424w, https://substackcdn.com/image/fetch/$s_!EX0M!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37c17277-3f8e-4ff3-8e90-2dd7c2bb5565_2516x1357.png 848w, https://substackcdn.com/image/fetch/$s_!EX0M!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37c17277-3f8e-4ff3-8e90-2dd7c2bb5565_2516x1357.png 1272w, https://substackcdn.com/image/fetch/$s_!EX0M!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37c17277-3f8e-4ff3-8e90-2dd7c2bb5565_2516x1357.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EX0M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37c17277-3f8e-4ff3-8e90-2dd7c2bb5565_2516x1357.png" width="2516" height="1357" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/37c17277-3f8e-4ff3-8e90-2dd7c2bb5565_2516x1357.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1357,&quot;width&quot;:2516,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1890879,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/162722014?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2464274f-93c9-4320-aba2-978f6ae93fa2_2516x1376.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EX0M!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37c17277-3f8e-4ff3-8e90-2dd7c2bb5565_2516x1357.png 424w, https://substackcdn.com/image/fetch/$s_!EX0M!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37c17277-3f8e-4ff3-8e90-2dd7c2bb5565_2516x1357.png 848w, https://substackcdn.com/image/fetch/$s_!EX0M!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37c17277-3f8e-4ff3-8e90-2dd7c2bb5565_2516x1357.png 1272w, https://substackcdn.com/image/fetch/$s_!EX0M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37c17277-3f8e-4ff3-8e90-2dd7c2bb5565_2516x1357.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p>Most discussions of LLM training focus heavily on models and algorithms. We enjoy experimenting with new frameworks like <a href="https://arxiv.org/abs/2402.03300">GRPO</a> and anticipate the release of next-generation models like <a href="https://arxiv.org/abs/2503.19786">Gemma-3</a> and <a href="https://arxiv.org/abs/2505.09388">Qwen-3</a>. However, the primary factor distinguishing success from failure in LLM training is the quality of the training dataset. Unfortunately, this topic receives far less attention compared to other popular research areas. In this overview, we will offer a data-centric guide to debugging and optimizing LLM training, <em>emphasizing practical strategies that we can use to iteratively enhance our data and develop more powerful LLMs</em>.</p><h2>The LLM Development Lifecycle</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xSnb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa43347c-41d6-473e-8a46-095192476264_1934x682.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xSnb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa43347c-41d6-473e-8a46-095192476264_1934x682.png 424w, https://substackcdn.com/image/fetch/$s_!xSnb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa43347c-41d6-473e-8a46-095192476264_1934x682.png 848w, https://substackcdn.com/image/fetch/$s_!xSnb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa43347c-41d6-473e-8a46-095192476264_1934x682.png 1272w, https://substackcdn.com/image/fetch/$s_!xSnb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa43347c-41d6-473e-8a46-095192476264_1934x682.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xSnb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa43347c-41d6-473e-8a46-095192476264_1934x682.png" width="1456" height="513" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fa43347c-41d6-473e-8a46-095192476264_1934x682.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:513,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:258543,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/162722014?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa43347c-41d6-473e-8a46-095192476264_1934x682.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xSnb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa43347c-41d6-473e-8a46-095192476264_1934x682.png 424w, https://substackcdn.com/image/fetch/$s_!xSnb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa43347c-41d6-473e-8a46-095192476264_1934x682.png 848w, https://substackcdn.com/image/fetch/$s_!xSnb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa43347c-41d6-473e-8a46-095192476264_1934x682.png 1272w, https://substackcdn.com/image/fetch/$s_!xSnb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa43347c-41d6-473e-8a46-095192476264_1934x682.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The key steps of LLM development</figcaption></figure></div><p>When training an LLM, we follow an iterative and empirically-driven process that is comprised of two primary steps (shown above):</p><ol><li><p>Training an LLM.</p></li><li><p>Evaluating the LLM.</p></li></ol><p>To develop an LLM, we simply repeat these steps, eventually yielding an LLM that performs well on evaluations relevant to our application of interest. </p><p><strong>LLM evaluation.</strong> We will not discuss the topic of evaluating LLMs in detail, as this topic is extremely complex. At a high level, however, we evaluate an LLM in two ways&#8212;<em>either manually (i.e., with humans) or automatically</em>. Human evaluation can be setup in several ways; e.g., picking the better of two model responses or scoring a model response along several quality dimensions; see below. As with any other data annotation project, we must invest effort to make sure that these <a href="https://lilianweng.github.io/posts/2024-02-05-human-data-quality/">human evaluations are high-quality</a> and align with what we are trying to measure. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hAhC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f6ef46e-8ff1-4af4-96df-f4c08514ddf7_2486x742.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hAhC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f6ef46e-8ff1-4af4-96df-f4c08514ddf7_2486x742.png 424w, https://substackcdn.com/image/fetch/$s_!hAhC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f6ef46e-8ff1-4af4-96df-f4c08514ddf7_2486x742.png 848w, https://substackcdn.com/image/fetch/$s_!hAhC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f6ef46e-8ff1-4af4-96df-f4c08514ddf7_2486x742.png 1272w, https://substackcdn.com/image/fetch/$s_!hAhC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f6ef46e-8ff1-4af4-96df-f4c08514ddf7_2486x742.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hAhC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f6ef46e-8ff1-4af4-96df-f4c08514ddf7_2486x742.png" width="1456" height="435" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8f6ef46e-8ff1-4af4-96df-f4c08514ddf7_2486x742.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:435,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:598283,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/162722014?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f6ef46e-8ff1-4af4-96df-f4c08514ddf7_2486x742.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hAhC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f6ef46e-8ff1-4af4-96df-f4c08514ddf7_2486x742.png 424w, https://substackcdn.com/image/fetch/$s_!hAhC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f6ef46e-8ff1-4af4-96df-f4c08514ddf7_2486x742.png 848w, https://substackcdn.com/image/fetch/$s_!hAhC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f6ef46e-8ff1-4af4-96df-f4c08514ddf7_2486x742.png 1272w, https://substackcdn.com/image/fetch/$s_!hAhC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f6ef46e-8ff1-4af4-96df-f4c08514ddf7_2486x742.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [5, 12])</figcaption></figure></div><p>When developing an LLM, human evaluation is the gold standard for measuring quality&#8212;<em>we should always depend on human evaluation to provide a definitive signal of whether our LLM is getting better or not</em>. However, human evaluation is also time intensive (i.e., takes several days or weeks)! To avoid slowing down our iteration speed, we must develop automatic evaluation metrics to provide a more efficient proxy measure of model quality. Using these automatic metrics, we can perform a much larger number of model iterations between each human evaluation trial, allowing us to improve model quality more quickly; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rRSF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37f8440c-a1aa-4252-af9f-e8047f7dcf09_1354x654.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rRSF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37f8440c-a1aa-4252-af9f-e8047f7dcf09_1354x654.png 424w, https://substackcdn.com/image/fetch/$s_!rRSF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37f8440c-a1aa-4252-af9f-e8047f7dcf09_1354x654.png 848w, https://substackcdn.com/image/fetch/$s_!rRSF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37f8440c-a1aa-4252-af9f-e8047f7dcf09_1354x654.png 1272w, https://substackcdn.com/image/fetch/$s_!rRSF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37f8440c-a1aa-4252-af9f-e8047f7dcf09_1354x654.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rRSF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37f8440c-a1aa-4252-af9f-e8047f7dcf09_1354x654.png" width="472" height="227.98227474150664" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/37f8440c-a1aa-4252-af9f-e8047f7dcf09_1354x654.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:654,&quot;width&quot;:1354,&quot;resizeWidth&quot;:472,&quot;bytes&quot;:89211,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/162722014?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37f8440c-a1aa-4252-af9f-e8047f7dcf09_1354x654.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rRSF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37f8440c-a1aa-4252-af9f-e8047f7dcf09_1354x654.png 424w, https://substackcdn.com/image/fetch/$s_!rRSF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37f8440c-a1aa-4252-af9f-e8047f7dcf09_1354x654.png 848w, https://substackcdn.com/image/fetch/$s_!rRSF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37f8440c-a1aa-4252-af9f-e8047f7dcf09_1354x654.png 1272w, https://substackcdn.com/image/fetch/$s_!rRSF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37f8440c-a1aa-4252-af9f-e8047f7dcf09_1354x654.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>In terms of automatic evaluation, two main techniques that are typically used&#8212;<em>benchmark-style evaluation and LLM judges</em>; see below. These two strategies test the model&#8217;s performance on closed and open-ended tasks, respectively. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FCJd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00b3ca29-cb6d-472e-b4f4-69ad7875d503_938x418.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FCJd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00b3ca29-cb6d-472e-b4f4-69ad7875d503_938x418.png 424w, https://substackcdn.com/image/fetch/$s_!FCJd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00b3ca29-cb6d-472e-b4f4-69ad7875d503_938x418.png 848w, https://substackcdn.com/image/fetch/$s_!FCJd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00b3ca29-cb6d-472e-b4f4-69ad7875d503_938x418.png 1272w, https://substackcdn.com/image/fetch/$s_!FCJd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00b3ca29-cb6d-472e-b4f4-69ad7875d503_938x418.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FCJd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00b3ca29-cb6d-472e-b4f4-69ad7875d503_938x418.png" width="414" height="184.4904051172708" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/00b3ca29-cb6d-472e-b4f4-69ad7875d503_938x418.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:418,&quot;width&quot;:938,&quot;resizeWidth&quot;:414,&quot;bytes&quot;:102191,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/162722014?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00b3ca29-cb6d-472e-b4f4-69ad7875d503_938x418.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FCJd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00b3ca29-cb6d-472e-b4f4-69ad7875d503_938x418.png 424w, https://substackcdn.com/image/fetch/$s_!FCJd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00b3ca29-cb6d-472e-b4f4-69ad7875d503_938x418.png 848w, https://substackcdn.com/image/fetch/$s_!FCJd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00b3ca29-cb6d-472e-b4f4-69ad7875d503_938x418.png 1272w, https://substackcdn.com/image/fetch/$s_!FCJd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00b3ca29-cb6d-472e-b4f4-69ad7875d503_938x418.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(<a href="https://www.databricks.com/blog/limit-less-more-instruction-tuning">source</a>)</figcaption></figure></div><p>Benchmark-style evaluations (e.g., multiple-choice style questions or question-answer pairs) have been used throughout the history of NLP research. Modern examples of such benchmarks for LLMs include <a href="https://arxiv.org/abs/2009.03300">MMLU</a> or <a href="https://arxiv.org/abs/2311.12022">GPQA Diamond</a>. These benchmarks have closed-ended solutions, but LLMs produce open-ended outputs that can be difficult to evaluate. The most popular technique for open-ended evaluation is LLM-as-a-Judge, or other related techniques (e.g., <a href="https://arxiv.org/abs/2403.13787">reward models</a>, <a href="https://cameronrwolfe.substack.com/p/finetuned-judge">finetuned judges</a> or <a href="https://arxiv.org/abs/2408.15240">verifiers</a>); see the article below for details. </p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;b7838361-e797-487d-85a2-a7fb82b5825f&quot;,&quot;caption&quot;:&quot;This post begins with an introduction to LLM-as-a-Judge and how it can be used to evaluate open-ended LLM outputs. Once these concepts are established, the overview covers several popular research papers in this space, providing a practical view of how LLM-as-a-Judge is used and implemented. &quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Using LLMs for Evaluation&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:29736521,&quot;name&quot;:&quot;Cameron R. Wolfe, Ph.D.&quot;,&quot;bio&quot;:&quot;ML @ Netflix &#8226; Rice University PhD &#8226; I make AI understandable&quot;,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/69aba7df-b571-4609-aa47-fc2d031c11b8_1242x1595.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-07-22T09:34:01.735Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3cca744e-8ad5-4266-9680-7da4fe94f497_1878x1052.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://cameronrwolfe.substack.com/p/llm-as-a-judge&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:141159804,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:104,&quot;comment_count&quot;:14,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Deep (Learning) Focus&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab9b43fb-52d5-40da-995d-5b7cd3f91064_896x896.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p><strong>Tweaking the data.</strong> Once we have an evaluation setup, we can begin to train new models and measure their performance. For each new model, we perform some intervention that will (hopefully) benefit the LLM&#8217;s performance. Traditionally, AI researchers are very interested in algorithms and architectures<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>, and sometimes we do tweak these details! For example, Llama 4 made significant changes to its post-training pipeline<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>, and many LLMs are incorporating new algorithms&#8212;<em>such as <a href="https://arxiv.org/abs/2411.15124">RLVR</a></em>&#8212;into their training pipelines to improve reasoning capabilities. Despite these recent developments, however, <em>the majority of interventions are data-related</em>. We tweak our training data, leave everything else fixed, retrain (or keep training) our model, and see if the new data improves the model&#8217;s performance. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XUX1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24592711-8eae-4089-a3e2-2299479d1fcf_1644x764.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XUX1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24592711-8eae-4089-a3e2-2299479d1fcf_1644x764.png 424w, https://substackcdn.com/image/fetch/$s_!XUX1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24592711-8eae-4089-a3e2-2299479d1fcf_1644x764.png 848w, https://substackcdn.com/image/fetch/$s_!XUX1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24592711-8eae-4089-a3e2-2299479d1fcf_1644x764.png 1272w, https://substackcdn.com/image/fetch/$s_!XUX1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24592711-8eae-4089-a3e2-2299479d1fcf_1644x764.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XUX1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24592711-8eae-4089-a3e2-2299479d1fcf_1644x764.png" width="1456" height="677" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/24592711-8eae-4089-a3e2-2299479d1fcf_1644x764.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:677,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:204382,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/162722014?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24592711-8eae-4089-a3e2-2299479d1fcf_1644x764.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XUX1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24592711-8eae-4089-a3e2-2299479d1fcf_1644x764.png 424w, https://substackcdn.com/image/fetch/$s_!XUX1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24592711-8eae-4089-a3e2-2299479d1fcf_1644x764.png 848w, https://substackcdn.com/image/fetch/$s_!XUX1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24592711-8eae-4089-a3e2-2299479d1fcf_1644x764.png 1272w, https://substackcdn.com/image/fetch/$s_!XUX1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24592711-8eae-4089-a3e2-2299479d1fcf_1644x764.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p>The most conceptually straightforward data intervention is just collecting more training data. Collecting more data as an LLM is being developed is common. For example, the Llama 2 report [3] notes that models are post-trained in several stages, where more data is collected for further post-training at each stage; see above. Collecting data might seem simple conceptually, but data annotation is an incredibly complex and nuanced topic that requires the correct strategy&#8212;<em>and usually prior experience</em>&#8212;to execute successfully; see <a href="https://lilianweng.github.io/posts/2024-02-05-human-data-quality/">here</a> and <a href="https://eugeneyan.com/writing/labeling-guidelines/">here</a> for more details.</p><blockquote><p><em>&#8220;Getting the most out of human data involves iterative training of models, evolving and highly detailed data instructions, translating through data foundry businesses, and other challenges that add up.&#8221;</em> - <a href="https://rlhfbook.com/c/06-preference-data.html">RLHF book</a></p></blockquote><p><strong>Curating data.</strong> In this report, we will not focus on collecting more data. Instead, we will focus on curating (or debugging) the data we have available. This is an orthogonal approach to human data collection; see below. To do this, we use a variety of techniques to identify high or low-quality data so that we can fix issues in our dataset and focus the training process on the highest-quality data.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2flt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a8502ae-348d-47be-bd17-888dceb16c60_1598x826.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2flt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a8502ae-348d-47be-bd17-888dceb16c60_1598x826.png 424w, https://substackcdn.com/image/fetch/$s_!2flt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a8502ae-348d-47be-bd17-888dceb16c60_1598x826.png 848w, https://substackcdn.com/image/fetch/$s_!2flt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a8502ae-348d-47be-bd17-888dceb16c60_1598x826.png 1272w, https://substackcdn.com/image/fetch/$s_!2flt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a8502ae-348d-47be-bd17-888dceb16c60_1598x826.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2flt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a8502ae-348d-47be-bd17-888dceb16c60_1598x826.png" width="500" height="258.58516483516485" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6a8502ae-348d-47be-bd17-888dceb16c60_1598x826.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:753,&quot;width&quot;:1456,&quot;resizeWidth&quot;:500,&quot;bytes&quot;:138506,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/162722014?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a8502ae-348d-47be-bd17-888dceb16c60_1598x826.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2flt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a8502ae-348d-47be-bd17-888dceb16c60_1598x826.png 424w, https://substackcdn.com/image/fetch/$s_!2flt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a8502ae-348d-47be-bd17-888dceb16c60_1598x826.png 848w, https://substackcdn.com/image/fetch/$s_!2flt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a8502ae-348d-47be-bd17-888dceb16c60_1598x826.png 1272w, https://substackcdn.com/image/fetch/$s_!2flt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a8502ae-348d-47be-bd17-888dceb16c60_1598x826.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Two directions to approaching high-quality data (<a href="https://lilianweng.github.io/posts/2024-02-05-human-data-quality/">source</a>)</figcaption></figure></div><p>Given that most interventions to LLM quality are data-related, data curation is a pivotally important topic; e.g., there are <a href="https://www.datologyai.com/">several</a> <a href="https://github.com/bespokelabsai/curator">startups</a> and a <a href="https://arxiv.org/abs/2305.11206">swath</a> <a href="https://arxiv.org/abs/2406.03476">of</a> <a href="https://arxiv.org/abs/2502.03387">great</a> <a href="https://arxiv.org/abs/2305.13169">papers</a> focused on this topic. Despite being so fundamental to the LLM training process, however, data-related topics are usually underrepresented in AI research. Optimizing data is simply not a flashy or popular topic, <em>but it is more often than not the key differentiator between success and failure when training LLMs.</em></p><h4>How do we curate data?</h4><p>Put simply, there are two ways we can curate data:</p><ol><li><p>Directly looking at the data.</p></li><li><p>Using model outputs to debug the training data. </p></li></ol><p>For example, we can curate and debug our data via manual inspection or basic searches and heuristics. Additionally, we can use another model to analyze our data; e.g., tagging, classification, assigning a quality score and more. All of these strategies are unrelated to the downstream model we are creating&#8212;<em>we are directly looking at the training data</em>. Once we have trained a model, however, we can further fuel the data curation process by debugging the LLM&#8217;s outputs as follows:</p><ul><li><p>Identifying poor model outputs.</p></li><li><p>Finding data issues that (potentially) contributed to these outputs. </p></li><li><p>Fixing the data via some intervention.</p></li><li><p>Re-training the model.</p></li></ul><p><strong>A strategy for debugging.</strong> In this overview, we will refer to the two strategies outlined above as data and model-focused curation. There are many terms one could use to refer to these ideas, and this nomenclature is definitely not perfect; e.g., data-focused curation can still involve the use of a model, we just use models to analyze data instead of using the data to train a model. However, we will use this terminology throughout to keep our discussion clear and consistent. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!80-H!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefd00ea7-16a3-49ba-9cd8-de68803dc606_1596x376.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!80-H!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefd00ea7-16a3-49ba-9cd8-de68803dc606_1596x376.png 424w, https://substackcdn.com/image/fetch/$s_!80-H!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefd00ea7-16a3-49ba-9cd8-de68803dc606_1596x376.png 848w, https://substackcdn.com/image/fetch/$s_!80-H!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefd00ea7-16a3-49ba-9cd8-de68803dc606_1596x376.png 1272w, https://substackcdn.com/image/fetch/$s_!80-H!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefd00ea7-16a3-49ba-9cd8-de68803dc606_1596x376.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!80-H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefd00ea7-16a3-49ba-9cd8-de68803dc606_1596x376.png" width="592" height="139.46153846153845" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/efd00ea7-16a3-49ba-9cd8-de68803dc606_1596x376.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:343,&quot;width&quot;:1456,&quot;resizeWidth&quot;:592,&quot;bytes&quot;:153325,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/162722014?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefd00ea7-16a3-49ba-9cd8-de68803dc606_1596x376.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!80-H!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefd00ea7-16a3-49ba-9cd8-de68803dc606_1596x376.png 424w, https://substackcdn.com/image/fetch/$s_!80-H!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefd00ea7-16a3-49ba-9cd8-de68803dc606_1596x376.png 848w, https://substackcdn.com/image/fetch/$s_!80-H!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefd00ea7-16a3-49ba-9cd8-de68803dc606_1596x376.png 1272w, https://substackcdn.com/image/fetch/$s_!80-H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefd00ea7-16a3-49ba-9cd8-de68803dc606_1596x376.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>As we discuss these ideas, we should keep in mind that data and model-focused debugging are <strong>NOT</strong> mutually exclusive. In fact, we should almost always leverage them both. Data-focused curation does not require training any models, which is incredibly useful in the early stages of LLM development. <em>Experienced scientists spend a lot of time analyzing and understanding their data prior to doing any modeling</em>. </p><p>We continue to perform such data-focused analysis over time, but new avenues of analysis become possible once we&#8217;ve trained a model. To debug and improve our LLM, we must develop a multi-faceted approach that allows us to gain a deeper understanding of our model, our data and the connection between them.</p><h2>Data-Focused Curation: Looking at the Data</h2><p>To gain a deep understanding of our data, we will start by simply looking at our data manually. As we manually inspect data, we will begin to notice&#8212;<em>and fix in some cases</em>&#8212;important issues and patterns in our data. To scale this curation process beyond our own judgement, however, we will need to use automated techniques based either upon heuristics or other machine learning models. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!O22R!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e974e72-5226-4ece-8f07-f7f41f7b11c9_1166x356.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!O22R!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e974e72-5226-4ece-8f07-f7f41f7b11c9_1166x356.png 424w, https://substackcdn.com/image/fetch/$s_!O22R!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e974e72-5226-4ece-8f07-f7f41f7b11c9_1166x356.png 848w, https://substackcdn.com/image/fetch/$s_!O22R!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e974e72-5226-4ece-8f07-f7f41f7b11c9_1166x356.png 1272w, https://substackcdn.com/image/fetch/$s_!O22R!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e974e72-5226-4ece-8f07-f7f41f7b11c9_1166x356.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!O22R!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e974e72-5226-4ece-8f07-f7f41f7b11c9_1166x356.png" width="516" height="157.54373927958832" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1e974e72-5226-4ece-8f07-f7f41f7b11c9_1166x356.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:356,&quot;width&quot;:1166,&quot;resizeWidth&quot;:516,&quot;bytes&quot;:81366,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/162722014?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e974e72-5226-4ece-8f07-f7f41f7b11c9_1166x356.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!O22R!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e974e72-5226-4ece-8f07-f7f41f7b11c9_1166x356.png 424w, https://substackcdn.com/image/fetch/$s_!O22R!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e974e72-5226-4ece-8f07-f7f41f7b11c9_1166x356.png 848w, https://substackcdn.com/image/fetch/$s_!O22R!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e974e72-5226-4ece-8f07-f7f41f7b11c9_1166x356.png 1272w, https://substackcdn.com/image/fetch/$s_!O22R!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e974e72-5226-4ece-8f07-f7f41f7b11c9_1166x356.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(<a href="https://x.com/gdb/status/1622683988736479232">source</a>)</figcaption></figure></div><p><strong>Manual inspection.</strong> The first step in debugging an LLM is simply looking at the model&#8217;s training data. <em>This should occur before we begin to train any models and should continue throughout the lifetime of model development</em>. Manual data inspection is very time consuming (and not always the most fun!), but it is an important part of LLM development. By taking time to manually inspect the data, we gain a better understanding of this data and, in turn, a better understanding of our model. If you ask any LLM researcher, they will likely confirm that they spend a large portion of their time manually inspecting data. This unpopular activity is a key contributor to success in training LLMs&#8212;<em>it cannot (and should not) be avoided</em>! </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jw1l!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5bc4f6c-d99c-404c-a61c-2fa78071066c_685x500.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jw1l!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5bc4f6c-d99c-404c-a61c-2fa78071066c_685x500.png 424w, https://substackcdn.com/image/fetch/$s_!jw1l!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5bc4f6c-d99c-404c-a61c-2fa78071066c_685x500.png 848w, https://substackcdn.com/image/fetch/$s_!jw1l!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5bc4f6c-d99c-404c-a61c-2fa78071066c_685x500.png 1272w, https://substackcdn.com/image/fetch/$s_!jw1l!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5bc4f6c-d99c-404c-a61c-2fa78071066c_685x500.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jw1l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5bc4f6c-d99c-404c-a61c-2fa78071066c_685x500.png" width="455" height="332.11678832116786" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b5bc4f6c-d99c-404c-a61c-2fa78071066c_685x500.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:500,&quot;width&quot;:685,&quot;resizeWidth&quot;:455,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jw1l!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5bc4f6c-d99c-404c-a61c-2fa78071066c_685x500.png 424w, https://substackcdn.com/image/fetch/$s_!jw1l!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5bc4f6c-d99c-404c-a61c-2fa78071066c_685x500.png 848w, https://substackcdn.com/image/fetch/$s_!jw1l!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5bc4f6c-d99c-404c-a61c-2fa78071066c_685x500.png 1272w, https://substackcdn.com/image/fetch/$s_!jw1l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5bc4f6c-d99c-404c-a61c-2fa78071066c_685x500.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Original credit goes to <a href="https://x.com/code_star">@code_star</a> for this hilarious (and accurate) meme</figcaption></figure></div><p>The main limitation of manual data inspection is the simple fact that it is not scalable&#8212;<em>there is only so much data that we as researchers can manually inspect</em>. Once we have performed enough manual inspection<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a> to understand our data well, we need to develop better strategies for scaling our data inspection efforts. </p><p><strong>Heuristic filtering.</strong> Manual inspection will uncover many issues and interesting patterns in our data. For example, we might notice that certain words are re-used very frequently; see below. To make sure our model does not reflect these sub-optimal patterns in the data, we can use heuristics to find training examples that match these patterns and filter (or modify) them. For example, finding data that re-uses the same set of words can be done via a simple string match. Here, we are using basic heuristics to solve noticeable limitations in our data. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OEJA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd86002e5-2eda-44be-a4fe-ecea571daa10_1604x916.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OEJA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd86002e5-2eda-44be-a4fe-ecea571daa10_1604x916.png 424w, https://substackcdn.com/image/fetch/$s_!OEJA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd86002e5-2eda-44be-a4fe-ecea571daa10_1604x916.png 848w, https://substackcdn.com/image/fetch/$s_!OEJA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd86002e5-2eda-44be-a4fe-ecea571daa10_1604x916.png 1272w, https://substackcdn.com/image/fetch/$s_!OEJA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd86002e5-2eda-44be-a4fe-ecea571daa10_1604x916.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OEJA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd86002e5-2eda-44be-a4fe-ecea571daa10_1604x916.png" width="1456" height="831" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d86002e5-2eda-44be-a4fe-ecea571daa10_1604x916.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:831,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:&quot;Screenshot 2025-05-01 at 5.00.30&#8239;PM.png&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="Screenshot 2025-05-01 at 5.00.30&#8239;PM.png" srcset="https://substackcdn.com/image/fetch/$s_!OEJA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd86002e5-2eda-44be-a4fe-ecea571daa10_1604x916.png 424w, https://substackcdn.com/image/fetch/$s_!OEJA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd86002e5-2eda-44be-a4fe-ecea571daa10_1604x916.png 848w, https://substackcdn.com/image/fetch/$s_!OEJA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd86002e5-2eda-44be-a4fe-ecea571daa10_1604x916.png 1272w, https://substackcdn.com/image/fetch/$s_!OEJA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd86002e5-2eda-44be-a4fe-ecea571daa10_1604x916.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(<a href="https://www.reddit.com/r/ClaudeAI/comments/1fyk8ql/claude_ignores_its_own_system_prompts_with/">source</a>)</figcaption></figure></div><p>There are many other heuristics for data inspection and filtering that we might consider. For example, we might notice that certain sources of data are of higher quality or have useful properties compared to other data sources. To act on this, we can emphasize this data during training<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a> or even obtain more data from this source. Similarly, we might notice a formatting issue in a subset of our data that can be identified or fixed with a regex statement. Depending on our observations during the manual inspection phase, there are an almost infinite number of heuristic checks or fixes that might need to be applied to our training dataset.</p><p><strong>Model-based filtering.</strong> If observed issues cannot be fixed heuristically, then we can fix them with the help of a machine learning model. <a href="https://github.com/facebookresearch/fastText">fastText classifiers</a> are heavily used for LLM data filtering due to their efficiency&#8212;<em>they can operate even at pretraining scale</em>. Concrete examples of fastText models being used for LLM data filtering include language identification (e.g., filtering out non-English data) or <a href="https://arxiv.org/abs/2402.00159">identifying toxic content</a>. However, <a href="https://fasttext.cc/docs/en/python-module.html">custom fastText models can be easily trained</a> to handle a variety of bespoke filtering tasks. We just <em>i)</em> train the model on examples of the data we want to identify, <em>ii)</em> use the model to identify such data and <em>iii)</em> either remove or keep the data that is identified; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OCL8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffef41e5d-d721-460b-97c1-78d9cd3baaed_2048x731.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OCL8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffef41e5d-d721-460b-97c1-78d9cd3baaed_2048x731.png 424w, https://substackcdn.com/image/fetch/$s_!OCL8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffef41e5d-d721-460b-97c1-78d9cd3baaed_2048x731.png 848w, https://substackcdn.com/image/fetch/$s_!OCL8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffef41e5d-d721-460b-97c1-78d9cd3baaed_2048x731.png 1272w, https://substackcdn.com/image/fetch/$s_!OCL8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffef41e5d-d721-460b-97c1-78d9cd3baaed_2048x731.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OCL8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffef41e5d-d721-460b-97c1-78d9cd3baaed_2048x731.png" width="569" height="203.21428571428572" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fef41e5d-d721-460b-97c1-78d9cd3baaed_2048x731.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:520,&quot;width&quot;:1456,&quot;resizeWidth&quot;:569,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:&quot;Screenshot 2025-05-01 at 4.53.05&#8239;PM.png&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="Screenshot 2025-05-01 at 4.53.05&#8239;PM.png" srcset="https://substackcdn.com/image/fetch/$s_!OCL8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffef41e5d-d721-460b-97c1-78d9cd3baaed_2048x731.png 424w, https://substackcdn.com/image/fetch/$s_!OCL8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffef41e5d-d721-460b-97c1-78d9cd3baaed_2048x731.png 848w, https://substackcdn.com/image/fetch/$s_!OCL8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffef41e5d-d721-460b-97c1-78d9cd3baaed_2048x731.png 1272w, https://substackcdn.com/image/fetch/$s_!OCL8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffef41e5d-d721-460b-97c1-78d9cd3baaed_2048x731.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(<a href="https://docs.google.com/presentation/d/179dpzWSQ9G7EAUlvaJdeE0av9PLuk9Rl33nfhHSJ4xI/edit?usp=sharing">source</a>)</figcaption></figure></div><p>We can also use other kinds of models for the purpose of data filtering. For example, <a href="https://cameronrwolfe.substack.com/p/llm-as-a-judge">LLM-as-a-Judge</a>-style models are commonly used both for filtering data and creating synthetic data. <a href="https://arxiv.org/pdf/2212.08073">Constitutional AI</a> is a popular example of using LLM judges to create synthetic preference pairs and Llama 4 uses an LLM judge to remove easier examples from their <a href="https://cameronrwolfe.substack.com/p/understanding-and-using-supervised">supervised finetuning</a> dataset. We can apply similar approaches to identify arbitrary properties and patterns&#8212;<em>usually with reasonably-high accuracy</em>&#8212;within our data for the purpose of filtering. </p><blockquote><p><em>&#8220;We removed more than 50% of our data tagged as easy by using Llama models as a judge and did lightweight SFT on the remaining harder set.&#8221;</em> - from [13]</p></blockquote><p>Such larger models are much less efficient relative to a fastText model, which limits them to smaller-scale use cases (usually post-training). If we compare <a href="https://cameronrwolfe.substack.com/p/language-understanding-with-bert">BERT-base</a>, which is ~10,000&#215; smaller than some of the largest modern LLMs, to a fastText model, the difference in efficiency and required hardware is massive; see below. Nonetheless, developing more sophisticated approaches and models for data curation is one of the most impactful topics in AI research right now. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hcuw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74956760-c648-499a-8c96-b6b9e8d2d027_1694x648.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hcuw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74956760-c648-499a-8c96-b6b9e8d2d027_1694x648.png 424w, https://substackcdn.com/image/fetch/$s_!hcuw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74956760-c648-499a-8c96-b6b9e8d2d027_1694x648.png 848w, https://substackcdn.com/image/fetch/$s_!hcuw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74956760-c648-499a-8c96-b6b9e8d2d027_1694x648.png 1272w, https://substackcdn.com/image/fetch/$s_!hcuw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74956760-c648-499a-8c96-b6b9e8d2d027_1694x648.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hcuw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74956760-c648-499a-8c96-b6b9e8d2d027_1694x648.png" width="504" height="192.80769230769232" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/74956760-c648-499a-8c96-b6b9e8d2d027_1694x648.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:557,&quot;width&quot;:1456,&quot;resizeWidth&quot;:504,&quot;bytes&quot;:136954,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/162722014?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74956760-c648-499a-8c96-b6b9e8d2d027_1694x648.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hcuw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74956760-c648-499a-8c96-b6b9e8d2d027_1694x648.png 424w, https://substackcdn.com/image/fetch/$s_!hcuw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74956760-c648-499a-8c96-b6b9e8d2d027_1694x648.png 848w, https://substackcdn.com/image/fetch/$s_!hcuw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74956760-c648-499a-8c96-b6b9e8d2d027_1694x648.png 1272w, https://substackcdn.com/image/fetch/$s_!hcuw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74956760-c648-499a-8c96-b6b9e8d2d027_1694x648.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Using fastText vs. BERT-base for data filtering (<a href="https://docs.google.com/presentation/d/179dpzWSQ9G7EAUlvaJdeE0av9PLuk9Rl33nfhHSJ4xI/edit?usp=sharing">source</a>)</figcaption></figure></div><h2>Model-Focused Curation: Debugging the LLM&#8217;s Outputs</h2><p>Once we have started training LLMs over our data, we can begin to use these LLMs to debug issues within the training dataset. The idea of model-focused curation is simple, we just:</p><ol><li><p>Identify problematic or incorrect outputs produced by our model.</p></li><li><p>Search for instances of training data that may lead to these outputs.</p></li></ol><p>The identification of problematic outputs is handled through our evaluation system. We can either have humans (even ourselves!) identify poor outputs via manual inspection or efficiently find incorrect or low-scoring outputs via our automatic evaluation setup. Once these problematic outputs have been identified, debugging our LLM becomes a search problem&#8212;<em>we want to find training examples that may be related to these poor outputs</em>. In this section, we will go over several common approaches for this, culminating with a low-cost and efficient method for tracing data called OLMoTrace [2] that was recently developed by Ai2. </p><h4>Searching over Training Data</h4><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3e1n!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407d89cc-a925-4b68-9d9b-5c0a2f563fec_1654x516.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3e1n!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407d89cc-a925-4b68-9d9b-5c0a2f563fec_1654x516.png 424w, https://substackcdn.com/image/fetch/$s_!3e1n!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407d89cc-a925-4b68-9d9b-5c0a2f563fec_1654x516.png 848w, https://substackcdn.com/image/fetch/$s_!3e1n!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407d89cc-a925-4b68-9d9b-5c0a2f563fec_1654x516.png 1272w, https://substackcdn.com/image/fetch/$s_!3e1n!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407d89cc-a925-4b68-9d9b-5c0a2f563fec_1654x516.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3e1n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407d89cc-a925-4b68-9d9b-5c0a2f563fec_1654x516.png" width="568" height="177.1098901098901" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/407d89cc-a925-4b68-9d9b-5c0a2f563fec_1654x516.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:454,&quot;width&quot;:1456,&quot;resizeWidth&quot;:568,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3e1n!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407d89cc-a925-4b68-9d9b-5c0a2f563fec_1654x516.png 424w, https://substackcdn.com/image/fetch/$s_!3e1n!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407d89cc-a925-4b68-9d9b-5c0a2f563fec_1654x516.png 848w, https://substackcdn.com/image/fetch/$s_!3e1n!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407d89cc-a925-4b68-9d9b-5c0a2f563fec_1654x516.png 1272w, https://substackcdn.com/image/fetch/$s_!3e1n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F407d89cc-a925-4b68-9d9b-5c0a2f563fec_1654x516.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Searching for relevant training data is similar to any other search problem; see above. The only difference is that our query is an output from our LLM, rather than something that we input into a search bar. But, all of the same techniques for search can be applied to solving this problem. For a deep dive on this topic, check out the overview below. In this section, we will briefly cover the key concepts of search and how they can be applied to tracing training data.</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;982aecee-3e58-4aeb-a257-dd907978e333&quot;,&quot;caption&quot;:&quot;An introduction to modern search system and the role that LLMs play in making these systems more accurate. &quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;The Basics of AI-Powered (Vector) Search&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:29736521,&quot;name&quot;:&quot;Cameron R. Wolfe, Ph.D.&quot;,&quot;bio&quot;:&quot;ML @ Netflix &#8226; Rice University PhD &#8226; I make AI understandable&quot;,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/69aba7df-b571-4609-aa47-fc2d031c11b8_1242x1595.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-01-08T10:19:35.010Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aee9216b-3a99-4432-8a6c-ce97bf9ad073_2394x1338.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://cameronrwolfe.substack.com/p/the-basics-of-ai-powered-vector-search&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:140061921,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:68,&quot;comment_count&quot;:1,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Deep (Learning) Focus&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab9b43fb-52d5-40da-995d-5b7cd3f91064_896x896.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p><strong>Lexical search.</strong> For many years prior to the popularization of deep learning, most search engines were <a href="https://huggingface.co/blog/xhluca/bm25s">purely lexical</a>, meaning that they rely on keyword (or n-gram) matches to find documents relevant to a query. To find these matches efficiently, we use a data structure called an <a href="https://www.geeksforgeeks.org/inverted-index/">inverted index</a>. By counting matches between each query and document, as well as considering the uniqueness of each n-gram that is matched, we can derive a relevance score for each document. The most common algorithm for this is <a href="https://en.wikipedia.org/wiki/Okapi_BM25">BM25</a>, which is computed as shown below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5f9T!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82315c90-36c0-46e7-92b9-7b77a34a5280_2250x644.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5f9T!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82315c90-36c0-46e7-92b9-7b77a34a5280_2250x644.png 424w, https://substackcdn.com/image/fetch/$s_!5f9T!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82315c90-36c0-46e7-92b9-7b77a34a5280_2250x644.png 848w, https://substackcdn.com/image/fetch/$s_!5f9T!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82315c90-36c0-46e7-92b9-7b77a34a5280_2250x644.png 1272w, https://substackcdn.com/image/fetch/$s_!5f9T!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82315c90-36c0-46e7-92b9-7b77a34a5280_2250x644.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5f9T!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82315c90-36c0-46e7-92b9-7b77a34a5280_2250x644.png" width="1456" height="417" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/82315c90-36c0-46e7-92b9-7b77a34a5280_2250x644.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:417,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:234125,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/162722014?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82315c90-36c0-46e7-92b9-7b77a34a5280_2250x644.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5f9T!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82315c90-36c0-46e7-92b9-7b77a34a5280_2250x644.png 424w, https://substackcdn.com/image/fetch/$s_!5f9T!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82315c90-36c0-46e7-92b9-7b77a34a5280_2250x644.png 848w, https://substackcdn.com/image/fetch/$s_!5f9T!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82315c90-36c0-46e7-92b9-7b77a34a5280_2250x644.png 1272w, https://substackcdn.com/image/fetch/$s_!5f9T!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82315c90-36c0-46e7-92b9-7b77a34a5280_2250x644.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Equation for computing BM25 scores</figcaption></figure></div><p>Although these details might seem complex, we can easily implement BM25-powered search via Python packages like <a href="https://github.com/dorianbrown/rank_bm25">rank_bm25</a> or <a href="https://github.com/xhluca/bm25s">bm25s</a>. With these packages, we can build a search index over our data in Python and start running searches as shown in the code example below. As we can see, this functionality is easy to prototype and begin using without too much effort!</p><pre><code>from transformers import AutoTokenizer
from rank_bm25 import BM25Okapi

tok = AutoTokenizer.from_pretrained(&lt;your tokenizer&gt;)

corpus = [
    "Here is a training example",
    "Here is another training example...",
]

tokenized_corpus = [doc.split(" ") for doc in corpus]

bm25 = BM25Okapi(tokenized_corpus)</code></pre><p><strong>Semantic search.</strong> Despite the power and efficiency of lexical search, this technique is still dependent upon keyword matching&#8212;<em>semantic matches (i.e., different words with similar meaning) are not captured by this framework</em>. If we want to handle semantic matches, we need to use some form of vector search; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pf1M!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd19f706e-65a8-4236-a652-d1bd5958e61c_2124x358.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pf1M!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd19f706e-65a8-4236-a652-d1bd5958e61c_2124x358.png 424w, https://substackcdn.com/image/fetch/$s_!pf1M!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd19f706e-65a8-4236-a652-d1bd5958e61c_2124x358.png 848w, https://substackcdn.com/image/fetch/$s_!pf1M!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd19f706e-65a8-4236-a652-d1bd5958e61c_2124x358.png 1272w, https://substackcdn.com/image/fetch/$s_!pf1M!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd19f706e-65a8-4236-a652-d1bd5958e61c_2124x358.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pf1M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd19f706e-65a8-4236-a652-d1bd5958e61c_2124x358.png" width="1456" height="245" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d19f706e-65a8-4236-a652-d1bd5958e61c_2124x358.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:245,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pf1M!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd19f706e-65a8-4236-a652-d1bd5958e61c_2124x358.png 424w, https://substackcdn.com/image/fetch/$s_!pf1M!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd19f706e-65a8-4236-a652-d1bd5958e61c_2124x358.png 848w, https://substackcdn.com/image/fetch/$s_!pf1M!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd19f706e-65a8-4236-a652-d1bd5958e61c_2124x358.png 1272w, https://substackcdn.com/image/fetch/$s_!pf1M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd19f706e-65a8-4236-a652-d1bd5958e61c_2124x358.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">A simple vector search pipeline</figcaption></figure></div><p>In vector search, we use an <a href="https://stackoverflow.blog/2023/11/09/an-intuitive-introduction-to-text-embeddings/">embedding model</a> to produce and embedding for each document we want to search. Then, we store all of these embeddings in a vector database, which allows us to efficiently search for similar embeddings using algorithms like <a href="https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world">hierarchical navigable small worlds (HNSW)</a>. From here, we can simply embed our query and search for similar embeddings within the index, allowing us to find documents that are semantically similar to our query! This is exactly what is done by retrieval augmented generation (RAG) to retrieve relevant text chunks to add into the context of an LLM; see <a href="https://cameronrwolfe.substack.com/p/a-practitioners-guide-to-retrieval">here</a> for details. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qHxF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F635a986a-9e9a-4a3a-8fd8-860017df9770_1802x786.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qHxF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F635a986a-9e9a-4a3a-8fd8-860017df9770_1802x786.png 424w, https://substackcdn.com/image/fetch/$s_!qHxF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F635a986a-9e9a-4a3a-8fd8-860017df9770_1802x786.png 848w, https://substackcdn.com/image/fetch/$s_!qHxF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F635a986a-9e9a-4a3a-8fd8-860017df9770_1802x786.png 1272w, https://substackcdn.com/image/fetch/$s_!qHxF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F635a986a-9e9a-4a3a-8fd8-860017df9770_1802x786.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qHxF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F635a986a-9e9a-4a3a-8fd8-860017df9770_1802x786.png" width="1456" height="635" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/635a986a-9e9a-4a3a-8fd8-860017df9770_1802x786.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:635,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:117893,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/162722014?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F635a986a-9e9a-4a3a-8fd8-860017df9770_1802x786.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qHxF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F635a986a-9e9a-4a3a-8fd8-860017df9770_1802x786.png 424w, https://substackcdn.com/image/fetch/$s_!qHxF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F635a986a-9e9a-4a3a-8fd8-860017df9770_1802x786.png 848w, https://substackcdn.com/image/fetch/$s_!qHxF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F635a986a-9e9a-4a3a-8fd8-860017df9770_1802x786.png 1272w, https://substackcdn.com/image/fetch/$s_!qHxF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F635a986a-9e9a-4a3a-8fd8-860017df9770_1802x786.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Difference between bi-encoders and cross-encoders</figcaption></figure></div><p>The semantic search system outlined above uses bi-encoders, which produce separate embeddings&#8212;<em>that are matched together via <a href="https://www.geeksforgeeks.org/cosine-similarity/">cosine similarity scores</a></em>&#8212;for each document and query. However, we can also use cross-encoders, which take both the document and query as input and output a single similarity score. The difference between these two strategies is illustrated in the figure above. A variety of pretrained bi-encoders and cross-encoders are available in public repos and can be either finetuned or used out-of-the-box; see <a href="https://sbert.net/">here</a> for more details. </p><p>Modern search systems combine all of these techniques. A hybrid of both bi-encoders and (BM25) lexical search are first used to efficiently retrieve documents that are most relevant to our query. Then, we perform a fine-grained ranking of retrieved documents using a cross-encoder, <em>bringing the most relevant documents to the top of the list</em>; see below. All of these components can be finetuned over data collected as the search engine is used to improve their accuracy over time.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Alf3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb72ab1d3-a9ee-42ea-ba81-8e03fa5f841e_1720x446.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Alf3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb72ab1d3-a9ee-42ea-ba81-8e03fa5f841e_1720x446.png 424w, https://substackcdn.com/image/fetch/$s_!Alf3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb72ab1d3-a9ee-42ea-ba81-8e03fa5f841e_1720x446.png 848w, https://substackcdn.com/image/fetch/$s_!Alf3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb72ab1d3-a9ee-42ea-ba81-8e03fa5f841e_1720x446.png 1272w, https://substackcdn.com/image/fetch/$s_!Alf3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb72ab1d3-a9ee-42ea-ba81-8e03fa5f841e_1720x446.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Alf3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb72ab1d3-a9ee-42ea-ba81-8e03fa5f841e_1720x446.png" width="1456" height="378" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b72ab1d3-a9ee-42ea-ba81-8e03fa5f841e_1720x446.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:378,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Alf3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb72ab1d3-a9ee-42ea-ba81-8e03fa5f841e_1720x446.png 424w, https://substackcdn.com/image/fetch/$s_!Alf3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb72ab1d3-a9ee-42ea-ba81-8e03fa5f841e_1720x446.png 848w, https://substackcdn.com/image/fetch/$s_!Alf3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb72ab1d3-a9ee-42ea-ba81-8e03fa5f841e_1720x446.png 1272w, https://substackcdn.com/image/fetch/$s_!Alf3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb72ab1d3-a9ee-42ea-ba81-8e03fa5f841e_1720x446.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Modern AI-powered search framework</figcaption></figure></div><p><strong>Applying search to debugging.</strong> Now that we understand the basics of search systems, we can also apply these ideas to debugging LLM outputs. However, there are two unique considerations for debugging LLM outputs that make this use case different from a standard search application:</p><ul><li><p>LLM training datasets can be massive (tens of trillions of tokens), which can prohibit the use of some techniques.</p></li><li><p>Depending on the use case, the output of an LLM, as well as the documents over which the LLM is trained, can be very long.</p></li></ul><p>If we are tracing a large dataset, using techniques like vector search&#8212;<em>although not impossible</em>&#8212;can be both time consuming and expensive. We have to first produce embeddings for our entire dataset, then store these embeddings in a vector database to make them searchable. This process requires a lot of setup (including the creation of large-scale data pipelines!), which makes the barrier to entry high. </p><p>Going further, the fact that our LLM&#8217;s outputs and training documents can be very long means that we need to approach this search problem differently. Instead of using the entire output as a search query, we need to consider shorter spans in this output and search for similar spans in the training data. Ideally, we want to develop a technique for tracing our training data that is:</p><ul><li><p>Relatively simple to setup.</p></li><li><p>Efficient on large-scale datasets. </p></li><li><p>Able to operate on a (shorter) span level.</p></li></ul><h4><a href="https://arxiv.org/abs/2401.17377">Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens</a> [1]</h4><blockquote><p><em>&#8220;Instead of pre-computing n-gram count tables (which would be very expensive), we develop an engine named infini-gram&#8212;powered by suffix arrays&#8212;that can compute &#8734;-gram (as well as n-gram with arbitrary n) probabilities with millisecond-level latency.&#8221;</em> - from [1]</p></blockquote><p>To understand how we can efficiently trace a massive dataset, we need to first understand the concept of an infini-gram [1]. Put simply, infini-grams are the generalization of <a href="https://en.wikipedia.org/wiki/N-gram">n-grams</a> to arbitrarily large values of <code>N</code>. As we will see, the data structure that we use to compute the probability of an infini-gram can also be used to (very efficiently) locate and count text spans of arbitrary length within a massive dataset. <em>This property is very useful for model-focused curation and debugging!</em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!v-fq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8879a41a-9544-4403-b544-0b66338a90be_2000x838.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!v-fq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8879a41a-9544-4403-b544-0b66338a90be_2000x838.png 424w, https://substackcdn.com/image/fetch/$s_!v-fq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8879a41a-9544-4403-b544-0b66338a90be_2000x838.png 848w, https://substackcdn.com/image/fetch/$s_!v-fq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8879a41a-9544-4403-b544-0b66338a90be_2000x838.png 1272w, https://substackcdn.com/image/fetch/$s_!v-fq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8879a41a-9544-4403-b544-0b66338a90be_2000x838.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!v-fq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8879a41a-9544-4403-b544-0b66338a90be_2000x838.png" width="526" height="220.37087912087912" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8879a41a-9544-4403-b544-0b66338a90be_2000x838.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:610,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:142664,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/162722014?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8879a41a-9544-4403-b544-0b66338a90be_2000x838.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!v-fq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8879a41a-9544-4403-b544-0b66338a90be_2000x838.png 424w, https://substackcdn.com/image/fetch/$s_!v-fq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8879a41a-9544-4403-b544-0b66338a90be_2000x838.png 848w, https://substackcdn.com/image/fetch/$s_!v-fq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8879a41a-9544-4403-b544-0b66338a90be_2000x838.png 1272w, https://substackcdn.com/image/fetch/$s_!v-fq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8879a41a-9544-4403-b544-0b66338a90be_2000x838.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Creating n-grams from a sequence of text</figcaption></figure></div><p><strong>What are n-gram LMs? </strong>An n-gram is simply an ordered set of <code>N</code> tokens (or words). Given a sequence of text, we can break it into n-grams as shown above, where we choose <code>N = 3</code>. If we break an entire dataset of text into n-grams, we can actually compute the probability of a given n-gram by simply counting the number of times that it occurs within the dataset; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ouM8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a6e7f4c-5367-4a52-a0e7-53de8f0b45f6_2366x622.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ouM8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a6e7f4c-5367-4a52-a0e7-53de8f0b45f6_2366x622.png 424w, https://substackcdn.com/image/fetch/$s_!ouM8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a6e7f4c-5367-4a52-a0e7-53de8f0b45f6_2366x622.png 848w, https://substackcdn.com/image/fetch/$s_!ouM8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a6e7f4c-5367-4a52-a0e7-53de8f0b45f6_2366x622.png 1272w, https://substackcdn.com/image/fetch/$s_!ouM8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a6e7f4c-5367-4a52-a0e7-53de8f0b45f6_2366x622.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ouM8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a6e7f4c-5367-4a52-a0e7-53de8f0b45f6_2366x622.png" width="1456" height="383" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5a6e7f4c-5367-4a52-a0e7-53de8f0b45f6_2366x622.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:383,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:209449,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/162722014?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a6e7f4c-5367-4a52-a0e7-53de8f0b45f6_2366x622.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ouM8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a6e7f4c-5367-4a52-a0e7-53de8f0b45f6_2366x622.png 424w, https://substackcdn.com/image/fetch/$s_!ouM8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a6e7f4c-5367-4a52-a0e7-53de8f0b45f6_2366x622.png 848w, https://substackcdn.com/image/fetch/$s_!ouM8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a6e7f4c-5367-4a52-a0e7-53de8f0b45f6_2366x622.png 1272w, https://substackcdn.com/image/fetch/$s_!ouM8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a6e7f4c-5367-4a52-a0e7-53de8f0b45f6_2366x622.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Computing n-gram probabilities</figcaption></figure></div><p>All of these counts are usually pre-computed and stored in a <a href="https://web.stanford.edu/~jurafsky/slp3/3.pdf">count table</a>, allowing us to quickly lookup n-gram probabilities and evaluate the expression shown above. We can actually form a simple language model using n-gram probabilities! To predict the next token in a sequence using n-grams, we just:</p><ol><li><p>Look at the last <code>N - 1</code> tokens in the sequence.</p></li><li><p>Get the probability of each possible n-gram given the prior <code>N - 1</code> tokens.</p></li><li><p><a href="https://huggingface.co/blog/mlabonne/decoding-strategies">Sample the next token</a> similarly to any other language model. </p></li></ol><p><strong>Limitations of n-grams.</strong> Practically speaking, n-gram LMs are not great at generating text&#8212;<em>you will not be able to make a powerful chatbot by counting n-grams</em>. Although this is true for any value of <code>N</code>, one of the key issues that limits the performance of n-gram LMs is the fact that n-gram count tables grow (almost) exponentially in size with respect to <code>N</code>. As a result, most n-gram LMs are limited to small values of <code>N</code>&#8212;<em>e.g., </em><code>N = 5</code><em> is a common setting</em>&#8212;and have a low capacity for capturing meaningful, long-context language distributions; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nLAq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2286afc0-b048-419f-95f1-2e152ff94137_1884x942.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nLAq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2286afc0-b048-419f-95f1-2e152ff94137_1884x942.png 424w, https://substackcdn.com/image/fetch/$s_!nLAq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2286afc0-b048-419f-95f1-2e152ff94137_1884x942.png 848w, https://substackcdn.com/image/fetch/$s_!nLAq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2286afc0-b048-419f-95f1-2e152ff94137_1884x942.png 1272w, https://substackcdn.com/image/fetch/$s_!nLAq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2286afc0-b048-419f-95f1-2e152ff94137_1884x942.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nLAq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2286afc0-b048-419f-95f1-2e152ff94137_1884x942.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2286afc0-b048-419f-95f1-2e152ff94137_1884x942.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:476459,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/162722014?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2286afc0-b048-419f-95f1-2e152ff94137_1884x942.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nLAq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2286afc0-b048-419f-95f1-2e152ff94137_1884x942.png 424w, https://substackcdn.com/image/fetch/$s_!nLAq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2286afc0-b048-419f-95f1-2e152ff94137_1884x942.png 848w, https://substackcdn.com/image/fetch/$s_!nLAq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2286afc0-b048-419f-95f1-2e152ff94137_1884x942.png 1272w, https://substackcdn.com/image/fetch/$s_!nLAq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2286afc0-b048-419f-95f1-2e152ff94137_1884x942.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>Additionally, n-gram LMs struggle with sparsity. Some n-grams may not appear in our data, forcing us to fall back to smaller n-grams to compute a probability&#8212;<em>this concept is typically referred to as n-gram &#8220;backoff&#8221;</em>. Forming a valid probability estimate when backing off to smaller n-grams is actually <a href="https://en.wikipedia.org/wiki/Katz%27s_back-off_model">quite complicated</a>. </p><p><strong>Making n-grams relevant again.</strong> In [1], authors propose a variant of n-gram LMs&#8212;<em>called infini-grams (or &#8734;-grams)</em>&#8212;that mesh better with modern LLMs. Relative to standard n-grams, infini-grams make two key changes:</p><ol><li><p>They are trained over a massive text dataset (trillions of tokens) like any other modern LLM, thus mitigating issues with sparsity.</p></li><li><p>The value of <code>N</code> can be made arbitrarily large when computing the probability of an n-gram, which captures more meaningful distributions in the data.</p></li></ol><p><strong>What are &#8734;-grams?</strong> By making these changes, infini-grams solve the two biggest issues with n-gram LMs that we covered above. <em>How does this work?</em> Assume we have a textual sequence <code>w</code>. To compute the infini-gram of token <code>i</code>, we consider all tokens that precede token <code>i</code> in the sequence; see below.</p><p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2kGf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2004290-7717-47cd-896a-46acc539811f_1882x698.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2kGf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2004290-7717-47cd-896a-46acc539811f_1882x698.png 424w, https://substackcdn.com/image/fetch/$s_!2kGf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2004290-7717-47cd-896a-46acc539811f_1882x698.png 848w, https://substackcdn.com/image/fetch/$s_!2kGf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2004290-7717-47cd-896a-46acc539811f_1882x698.png 1272w, https://substackcdn.com/image/fetch/$s_!2kGf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2004290-7717-47cd-896a-46acc539811f_1882x698.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2kGf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2004290-7717-47cd-896a-46acc539811f_1882x698.png" width="1456" height="540" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c2004290-7717-47cd-896a-46acc539811f_1882x698.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:540,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:221220,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/162722014?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2004290-7717-47cd-896a-46acc539811f_1882x698.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2kGf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2004290-7717-47cd-896a-46acc539811f_1882x698.png 424w, https://substackcdn.com/image/fetch/$s_!2kGf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2004290-7717-47cd-896a-46acc539811f_1882x698.png 848w, https://substackcdn.com/image/fetch/$s_!2kGf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2004290-7717-47cd-896a-46acc539811f_1882x698.png 1272w, https://substackcdn.com/image/fetch/$s_!2kGf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2004290-7717-47cd-896a-46acc539811f_1882x698.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Computing infini-gram probabilities</figcaption></figure></div><p>On the left side of this equation, the infini-gram probability is conditioned on the entire prior context of the sequence, which is different from before. However, the right side of this equation exactly matches that of the n-gram probability! <em>The key difference between n-grams and infini-grams lies in how we select the value of </em><code>N</code>.</p><p>For n-grams, <code>N</code> is a (fixed) hyperparameter. In contrast, infini-grams use a backoff procedure to dynamically select <code>N</code>. More specifically, we test the denominator of this expression with the largest possible <code>N</code>&#8212;<em>all preceding tokens in the sequence</em>&#8212;and continually decrease <code>N</code> by one until the denominator is non-zero; see below. </p><blockquote><p><em>&#8220;We stop backing off as soon as the denominator becomes positive, upon which the numerator might still be zero&#8230; the effective n is equal to one plus the length of the prompt&#8217;s longest suffix that appears in the training data.&#8221;</em> - from [1]</p></blockquote><p>If we define <code>w&#8217;</code>as the subsequence of <code>w</code> up to (and including) token <code>i - 1</code>, then this backoff procedure is simply finding the longest suffix of <code>w&#8217;</code> that exists in our dataset. From here, we use the value of <code>N</code> found via backoff to compute the infini-gram probability using the standard n-gram probability expression from before.</p><p><strong>Computing &#8734;-gram probabilities.</strong> To compute infini-gram probabilities, we cannot just precompute counts and store them in a table like before. The value of <code>N</code> is unbounded and infini-grams are trained over LLM-scale datasets in [1]&#8212;<em>the size of such a count table would be massive</em>. Instead, we use a data structure called a <a href="https://en.wikipedia.org/wiki/Suffix_array">suffix array</a> to create an engine for efficiently computing infini-gram probabilities.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nERk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F069cd044-9bb4-4bcf-bb11-511d82c54341_729x583.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nERk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F069cd044-9bb4-4bcf-bb11-511d82c54341_729x583.png 424w, https://substackcdn.com/image/fetch/$s_!nERk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F069cd044-9bb4-4bcf-bb11-511d82c54341_729x583.png 848w, https://substackcdn.com/image/fetch/$s_!nERk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F069cd044-9bb4-4bcf-bb11-511d82c54341_729x583.png 1272w, https://substackcdn.com/image/fetch/$s_!nERk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F069cd044-9bb4-4bcf-bb11-511d82c54341_729x583.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nERk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F069cd044-9bb4-4bcf-bb11-511d82c54341_729x583.png" width="354" height="283.1028806584362" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/069cd044-9bb4-4bcf-bb11-511d82c54341_729x583.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:583,&quot;width&quot;:729,&quot;resizeWidth&quot;:354,&quot;bytes&quot;:88372,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/162722014?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8ebc48d-5830-43d6-933c-ee6eb637e1de_2090x734.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nERk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F069cd044-9bb4-4bcf-bb11-511d82c54341_729x583.png 424w, https://substackcdn.com/image/fetch/$s_!nERk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F069cd044-9bb4-4bcf-bb11-511d82c54341_729x583.png 848w, https://substackcdn.com/image/fetch/$s_!nERk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F069cd044-9bb4-4bcf-bb11-511d82c54341_729x583.png 1272w, https://substackcdn.com/image/fetch/$s_!nERk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F069cd044-9bb4-4bcf-bb11-511d82c54341_729x583.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Suffix array on a toy sequence of six characters (from [1])</figcaption></figure></div><p>The concept of a suffix array is depicted above. Given a sequence of text <code>w</code> with length <code>L</code>, a suffix array is constructed by:</p><ol><li><p>Extracting every suffix of this sequence (there are <code>L</code> of them).</p></li><li><p>Sorting the suffixes <a href="https://en.wikipedia.org/wiki/Lexicographic_order">lexicographically</a><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>.</p></li><li><p>Storing the original index (prior to sorting) of each sorted suffix within a list&#8212;<em>this is the suffix array</em>!</p></li></ol><p>Consider <code>w&#8217;</code> to be an arbitrary subarray of <code>w</code> running from token <code>i</code> to token <code>j</code>, where <code>i &lt; j</code>.  Any suffix that begins with <code>w&#8217;</code> is stored consecutively in the suffix array due to the array being sorted lexicographically. Using this property, we can efficiently compute the count of <code>w&#8217;</code> in <code>w</code>. We just find the index of the first and last suffix in the array for which <code>w&#8217;</code> is a prefix, and the count of <code>w&#8217;</code> in <code>w</code> is the difference between these two indices. If we can compute the count of <code>w&#8217;</code>, we can compute arbitrary infini-gram probabilities&#8212;<em>this operation can be used to find </em><code>N</code><em> and compute both counts within the infini-gram probability expression</em>!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kw-x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25d05a53-38fd-4c7d-a477-0b8bd256443d_1053x575.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kw-x!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25d05a53-38fd-4c7d-a477-0b8bd256443d_1053x575.png 424w, https://substackcdn.com/image/fetch/$s_!kw-x!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25d05a53-38fd-4c7d-a477-0b8bd256443d_1053x575.png 848w, https://substackcdn.com/image/fetch/$s_!kw-x!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25d05a53-38fd-4c7d-a477-0b8bd256443d_1053x575.png 1272w, https://substackcdn.com/image/fetch/$s_!kw-x!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25d05a53-38fd-4c7d-a477-0b8bd256443d_1053x575.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kw-x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25d05a53-38fd-4c7d-a477-0b8bd256443d_1053x575.png" width="478" height="261.01614434947766" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/25d05a53-38fd-4c7d-a477-0b8bd256443d_1053x575.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:575,&quot;width&quot;:1053,&quot;resizeWidth&quot;:478,&quot;bytes&quot;:139806,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/162722014?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a240ac1-ed47-4ae4-ae6d-7619ae30f204_2090x734.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kw-x!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25d05a53-38fd-4c7d-a477-0b8bd256443d_1053x575.png 424w, https://substackcdn.com/image/fetch/$s_!kw-x!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25d05a53-38fd-4c7d-a477-0b8bd256443d_1053x575.png 848w, https://substackcdn.com/image/fetch/$s_!kw-x!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25d05a53-38fd-4c7d-a477-0b8bd256443d_1053x575.png 1272w, https://substackcdn.com/image/fetch/$s_!kw-x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25d05a53-38fd-4c7d-a477-0b8bd256443d_1053x575.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Suffix array on textual tokens (from [1])</figcaption></figure></div><p><strong>&#8734;-grams for LLMs.</strong> In the context of LLMs, our sequence <code>w</code> is the LLM&#8217;s entire tokenized training dataset, where document boundaries are marked with fixed separator token(s)<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a>; see above. This sequence will be large&#8212;<em>modern LLMs are trained on tens of trillion of tokens</em>&#8212;but suffix arrays can handle data of this scale<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a>.</p><blockquote><p><em>&#8220;During inference, the entire infini-gram index can stay on-disk, which minimizes the compute resources needed (no GPU, and minimal CPU / RAM)&#8230; Our most optimized infini-gram engine can count a given n-gram with an average latency of less than 20 milliseconds. It can compute the probability and next-token distribution in 40 milliseconds for n-gram LMs, and in 200 milliseconds for the &#8734;-gram.&#8221;</em> - from [1]</p></blockquote><p>For example, the suffix array built over a 5T token dataset in [1] consumes ~35Tb of memory. Building this suffix array takes ~48 hours, and the entire suffix array can be stored on disk&#8212;<em>even when computing infini-gram probabilties</em>&#8212;after it is created. The resulting infini-gram engine can be used to compute probabilities for over two <em>quadrillion</em> unique n-grams. However, retrieving the count of a given n-gram on a dataset of this size still takes only ~20 milliseconds!</p><p><strong>Using &#8734;-grams in practice. </strong>Fully grasping the ideas behind infini-grams will take some time. Luckily, the entire infini-gram project&#8212;<em>similarly to any other project from <a href="https://allenai.org/">Ai2</a></em>&#8212;is fully open-source! There are plenty of open-source tools available for working with infini-grams in Python. See the <a href="https://infini-gram.io/">project website</a> for full details. </p><pre><code>%pip install infini_gram 
python -m infini_gram.indexing 
    --data_dir &lt;path to data&gt;
    --save_dir &lt;path to save index&gt;
    --tokenizer llama  # also supports gpt2 and olmo
    --cpus &lt;cpus available&gt;
    --mem &lt;memory available (in Gb)&gt;
    --shards 1  # increase if N &gt; 500B
    --add_metadata 
    --ulimit 1048576</code></pre><p>The tool that is most relevant to this overview is the <a href="https://infini-gram.readthedocs.io/en/latest/pkg.html">inifini-gram Python package</a>. Several open LLM training datasets have already been <a href="https://infini-gram.readthedocs.io/en/latest/pkg.html#pre-built-indexes">pre-indexed within this package</a>, but we can also use the package to build an infini-gram index over our custom dataset using the command above. Once the index is available, we can use the infini-gram Python package to efficiently run a variety of different search and counting operations; see below for examples and <a href="https://infini-gram.readthedocs.io/en/latest/pkg.html#query-types">here</a> for more details. </p><pre><code>from infini_gram.engine import InfiniGramEngine
from transformers import AutoTokenizer

# instantiate tokenizer (must match tokenizer used for indexing)
tokenizer = AutoTokenizer.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    add_bos_token=False,
    add_eos_token=False,
)

# connect to infini-gram engine
engine = InfiniGramEngine(
    index_dir=&lt;path to index&gt;,
    eos_token_id=tokenizer.eos_token_id,
)

# sample n-gram / sequence
inp = "This is my sample n-gram sequence."
inp_ids = tokenizer.encode(inp)

# find matching n-grams in dataset
result = engine.find(input_ids=input_ids)

# n-gram count
result = engine.count(input_ids=inp_ids)

# n-gram probability
result = engine.prob(
    prompt_ids=inp_ids[:-1],
    cont_id=inp_ids[-1],
)

# next token distribution
result = engine.ntd(prompt_ids=inp_ids)

# infini-gram probability
result = engine.infgram_prob(
    prompt_ids=inp_ids[:-1],
    cont_id=inp_ids[-1],
)</code></pre><h4><a href="https://arxiv.org/abs/2504.07096">OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens</a> [2]</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sFmj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aedfd0f-700f-41bd-85b3-3eff8c3ae7dd_1232x810.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sFmj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aedfd0f-700f-41bd-85b3-3eff8c3ae7dd_1232x810.png 424w, https://substackcdn.com/image/fetch/$s_!sFmj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aedfd0f-700f-41bd-85b3-3eff8c3ae7dd_1232x810.png 848w, https://substackcdn.com/image/fetch/$s_!sFmj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aedfd0f-700f-41bd-85b3-3eff8c3ae7dd_1232x810.png 1272w, https://substackcdn.com/image/fetch/$s_!sFmj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aedfd0f-700f-41bd-85b3-3eff8c3ae7dd_1232x810.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sFmj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aedfd0f-700f-41bd-85b3-3eff8c3ae7dd_1232x810.png" width="1232" height="810" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5aedfd0f-700f-41bd-85b3-3eff8c3ae7dd_1232x810.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:810,&quot;width&quot;:1232,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:554364,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/162722014?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aedfd0f-700f-41bd-85b3-3eff8c3ae7dd_1232x810.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sFmj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aedfd0f-700f-41bd-85b3-3eff8c3ae7dd_1232x810.png 424w, https://substackcdn.com/image/fetch/$s_!sFmj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aedfd0f-700f-41bd-85b3-3eff8c3ae7dd_1232x810.png 848w, https://substackcdn.com/image/fetch/$s_!sFmj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aedfd0f-700f-41bd-85b3-3eff8c3ae7dd_1232x810.png 1272w, https://substackcdn.com/image/fetch/$s_!sFmj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5aedfd0f-700f-41bd-85b3-3eff8c3ae7dd_1232x810.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p>OLMoTrace [2] pioneers a novel approach for efficiently attributing the output of an LLM to examples within its training data. This approach is deployed within the <a href="https://playground.allenai.org/">Ai2 playground</a> (shown above) and can perform a trace to retrieve training documents that are relevant to an LLM&#8217;s output in seconds. Given that LLMs are trained over massive datasets, we might wonder how such a real-time trace would be possible. Luckily, we have already learned the answer: <em>infini-grams</em>! </p><blockquote><p><em>&#8220;The purpose of OLMOTRACE is to give users a tool to explore where LMs may have learned to generate certain word sequences, focusing on verbatim matching as the most direct connection between LM outputs and the training data.&#8221;</em> - from [2]</p></blockquote><p><strong>Tracing strategy.</strong> The key idea behind OLMoTrace is to find examples of long and unique token sequences that are present both in the model&#8217;s output and its training dataset. Given a prompt and LLM response as input, OLMoTrace will return the following:</p><ul><li><p>A set of notable textual spans found in the LLM&#8217;s response.</p></li><li><p>A list of the most relevant document spans from the LLM&#8217;s training data associated with each response span. </p></li></ul><p>Unlike vector search, these matches between the model&#8217;s output and training data must be verbatim. Exact token matches can be quickly identified with a suffix array, as discussed in the last section. However, ensuring that the best possible matching documents are identified and returned requires a four-step algorithm that is built on top of the standard infini-gram functionality.</p><p><strong>(Step 1) Maximal Matching Spans. </strong>After tokenizing the LLM&#8217;s response, we find all text spans in this response that satisfy three properties:</p><ol><li><p><em>Existence</em>: the span has an exact match in the training data. </p></li><li><p><em>Maximality</em>: the span is not a sub-span of another matching span. </p></li><li><p><em>Self-contained</em>: the span is not incomplete; e.g., beginning or ending with incomplete words or containing punctuation in the middle of the span. </p></li></ol><p>These properties are illustrated within the figure below. Here, we see that there are three matching spans. However, all spans except for one&#8212;<em>outlined in green</em>&#8212;are removed due to either not being <em>i)</em> maximal or <em>ii)</em> self-contained.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xZn3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57e984b7-bba7-44d3-a877-c3d74693180d_2178x642.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xZn3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57e984b7-bba7-44d3-a877-c3d74693180d_2178x642.png 424w, https://substackcdn.com/image/fetch/$s_!xZn3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57e984b7-bba7-44d3-a877-c3d74693180d_2178x642.png 848w, https://substackcdn.com/image/fetch/$s_!xZn3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57e984b7-bba7-44d3-a877-c3d74693180d_2178x642.png 1272w, https://substackcdn.com/image/fetch/$s_!xZn3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57e984b7-bba7-44d3-a877-c3d74693180d_2178x642.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xZn3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57e984b7-bba7-44d3-a877-c3d74693180d_2178x642.png" width="1456" height="429" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/57e984b7-bba7-44d3-a877-c3d74693180d_2178x642.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:429,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:136947,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/162722014?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57e984b7-bba7-44d3-a877-c3d74693180d_2178x642.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xZn3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57e984b7-bba7-44d3-a877-c3d74693180d_2178x642.png 424w, https://substackcdn.com/image/fetch/$s_!xZn3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57e984b7-bba7-44d3-a877-c3d74693180d_2178x642.png 848w, https://substackcdn.com/image/fetch/$s_!xZn3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57e984b7-bba7-44d3-a877-c3d74693180d_2178x642.png 1272w, https://substackcdn.com/image/fetch/$s_!xZn3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57e984b7-bba7-44d3-a877-c3d74693180d_2178x642.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Illustration of maximal and self-contained spans</figcaption></figure></div><p>Computing maximal spans naively is inefficient, but authors in [2] propose a more efficient algorithm that relies upon the <code>find</code> operation in the infini-gram index. Given a sequence of tokens as input, the <code>find</code> operation returns:</p><ul><li><p>The count of matching spans in the index.</p></li><li><p>A range of segments<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a> that can be used to look up matching data spans. </p></li></ul><p>However, if the returned count is zero&#8212;<em>indicating that our data has no exact matches for this sequence</em>&#8212;the <code>find</code> operation will still return an (empty) segment range. Because the suffix array is sorted lexicographically, the index of this range corresponds to the longest matching prefix of the sequence in our dataset.</p><pre><code># run find operation with infini-gram engine
result = engine.find(input_ids=inp_ids)

"""
### .find() output example (match): 
    {
        'cnt': 10,
        'segment_by_shard': [(13693395, 13693405)],
    }

### .find() output example (no match):
    {
        'cnt': 0,
        'segment_by_shard': [(85267640, 85267640)],
    }
"""

# lookup training documents from .find()
rank_start, rank_end = result['segment_by_shard'][0]
ranks = [r for r in range(rank_start, rank_end)]
for r in ranks:
    docs = engine.get_doc_by_rank(
        s=0,  # assumes suffix array has a single shard
        rank=r,
        max_disp_len=len(inp_ids) * 5,  # size of doc chunk
    )
    doc_text = [tokenizer.decode(d['token_ids']) for d in docs]
    print(f'Number of documents: {len(docs)}')
    print(f'Matching document: {doc_text[0]}')</code></pre><p>This property of the <code>find</code> operation is leveraged in [2] to create an efficient algorithm for span matching. As shown in the figure below, this algorithm operates by running a single <code>find</code> operation for every suffix of the input sequence, <em>yielding the longest matching prefix for each suffix</em>. Once all of these matching spans have been identified, we can make another pass through this list to remove any matching spans that are not maximal or self-contained. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ofmw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb435fcab-0df9-4a07-b6f4-fc7c07e646d7_2314x804.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ofmw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb435fcab-0df9-4a07-b6f4-fc7c07e646d7_2314x804.png 424w, https://substackcdn.com/image/fetch/$s_!ofmw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb435fcab-0df9-4a07-b6f4-fc7c07e646d7_2314x804.png 848w, https://substackcdn.com/image/fetch/$s_!ofmw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb435fcab-0df9-4a07-b6f4-fc7c07e646d7_2314x804.png 1272w, https://substackcdn.com/image/fetch/$s_!ofmw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb435fcab-0df9-4a07-b6f4-fc7c07e646d7_2314x804.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ofmw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb435fcab-0df9-4a07-b6f4-fc7c07e646d7_2314x804.png" width="1456" height="506" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b435fcab-0df9-4a07-b6f4-fc7c07e646d7_2314x804.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:506,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:509793,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/162722014?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb435fcab-0df9-4a07-b6f4-fc7c07e646d7_2314x804.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ofmw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb435fcab-0df9-4a07-b6f4-fc7c07e646d7_2314x804.png 424w, https://substackcdn.com/image/fetch/$s_!ofmw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb435fcab-0df9-4a07-b6f4-fc7c07e646d7_2314x804.png 848w, https://substackcdn.com/image/fetch/$s_!ofmw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb435fcab-0df9-4a07-b6f4-fc7c07e646d7_2314x804.png 1272w, https://substackcdn.com/image/fetch/$s_!ofmw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb435fcab-0df9-4a07-b6f4-fc7c07e646d7_2314x804.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p><strong>(Step 2) Span Filtering.</strong> If our list of maximal spans computed as described above is long, we need some strategy to identify the most useful and relevant of these spans. To do this, authors in [2] score spans according to their span unigram probability (lower is better)&#8212;<em>or the product of unigram probabilities for each token in the span.</em> The unigram probability of a given token, which is usually precomputed for all tokens and stored in a cache, can be computed as shown below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QtrX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb6f7bb6-632c-40d1-bfbf-db4f585a4969_1090x466.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QtrX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb6f7bb6-632c-40d1-bfbf-db4f585a4969_1090x466.png 424w, https://substackcdn.com/image/fetch/$s_!QtrX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb6f7bb6-632c-40d1-bfbf-db4f585a4969_1090x466.png 848w, https://substackcdn.com/image/fetch/$s_!QtrX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb6f7bb6-632c-40d1-bfbf-db4f585a4969_1090x466.png 1272w, https://substackcdn.com/image/fetch/$s_!QtrX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb6f7bb6-632c-40d1-bfbf-db4f585a4969_1090x466.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QtrX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb6f7bb6-632c-40d1-bfbf-db4f585a4969_1090x466.png" width="360" height="153.90825688073394" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eb6f7bb6-632c-40d1-bfbf-db4f585a4969_1090x466.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:466,&quot;width&quot;:1090,&quot;resizeWidth&quot;:360,&quot;bytes&quot;:95615,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/162722014?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb6f7bb6-632c-40d1-bfbf-db4f585a4969_1090x466.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QtrX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb6f7bb6-632c-40d1-bfbf-db4f585a4969_1090x466.png 424w, https://substackcdn.com/image/fetch/$s_!QtrX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb6f7bb6-632c-40d1-bfbf-db4f585a4969_1090x466.png 848w, https://substackcdn.com/image/fetch/$s_!QtrX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb6f7bb6-632c-40d1-bfbf-db4f585a4969_1090x466.png 1272w, https://substackcdn.com/image/fetch/$s_!QtrX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb6f7bb6-632c-40d1-bfbf-db4f585a4969_1090x466.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Computing a token&#8217;s unigram probability</figcaption></figure></div><p>In [2], authors sort spans by their span unigram probability and keep only the first <code>K</code> spans in this list, where <code>K = ceil(0.05 x L)</code> for a sequence of length <code>L</code>.</p><p><strong>(Step 3-4) Merge Spans and Get Documents.</strong> To avoid clutter, overlapping spans are merged together in OLMoTrace. Documents for each of these final spans are retrieved. But, the number of documents associated with each span can be large, so we must sub-select documents; e.g., authors in [2] retain ten documents per span. To find the most relevant documents, we can rank them according to the <a href="https://pypi.org/project/rank-bm25/">BM25 score</a> between the LLM&#8217;s output and the retrieved document.</p><blockquote><p><em>&#8220;To prioritize showing the most relevant documents, in the document panel we rank all documents by a BM25 score in descending order. The per-document BM25 score is computed by treating the collection of retrieved documents as a corpus, and the concatenation of user prompt and LM response as the query.&#8221;</em> - from [2]</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rZWx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F250f798c-fb39-46d7-83c4-da76cbbeccda_2150x1062.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rZWx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F250f798c-fb39-46d7-83c4-da76cbbeccda_2150x1062.png 424w, https://substackcdn.com/image/fetch/$s_!rZWx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F250f798c-fb39-46d7-83c4-da76cbbeccda_2150x1062.png 848w, https://substackcdn.com/image/fetch/$s_!rZWx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F250f798c-fb39-46d7-83c4-da76cbbeccda_2150x1062.png 1272w, https://substackcdn.com/image/fetch/$s_!rZWx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F250f798c-fb39-46d7-83c4-da76cbbeccda_2150x1062.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rZWx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F250f798c-fb39-46d7-83c4-da76cbbeccda_2150x1062.png" width="1456" height="719" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/250f798c-fb39-46d7-83c4-da76cbbeccda_2150x1062.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:719,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1064812,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/162722014?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F250f798c-fb39-46d7-83c4-da76cbbeccda_2150x1062.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rZWx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F250f798c-fb39-46d7-83c4-da76cbbeccda_2150x1062.png 424w, https://substackcdn.com/image/fetch/$s_!rZWx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F250f798c-fb39-46d7-83c4-da76cbbeccda_2150x1062.png 848w, https://substackcdn.com/image/fetch/$s_!rZWx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F250f798c-fb39-46d7-83c4-da76cbbeccda_2150x1062.png 1272w, https://substackcdn.com/image/fetch/$s_!rZWx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F250f798c-fb39-46d7-83c4-da76cbbeccda_2150x1062.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p><strong>Example implementation.</strong> The inference pipeline for OLMoTrace is shown in the figure above. To better understand how this works, let&#8217;s (quickly) implement the core functionality using the infini-gram package in Python. To build an infini-gram index, we need to put all of our LLM&#8217;s training data into a single directory. The infini-gram package expects the data to be formatted as one or more <code>.jsonl</code> files, where each file contains <code>text</code> and <code>metadata</code> fields; see below. Each line of the <code>.jsonl</code> file corresponds to a single document in our training dataset.</p><pre><code>{
    'text': 'This is a training sequence for our LLM...',
    'metadata': {
        'source': &lt;url&gt;,
        'category': 'general',
        'year': 2025,
        ...
    },
}</code></pre><p>Once our data has been formatted as such, we can build the infini-gram index as outlined before. Additionally, OLMoTrace requires us to pre-compute unigram probabilities for all tokens. Both of these steps are implemented below. This code assumes that we use the <a href="https://huggingface.co/meta-llama/Llama-2-7b-hf">Llama 2 tokenizer</a> to perform tracing and that we only require a single shard for our infini-gram index. The underlying tokenizer <a href="https://infini-gram.readthedocs.io/en/latest/indexing.html">can be modified</a>, and support for multiple shards in the index may be required when working with very large datasets (i.e., more than 500B tokens).</p><div class="github-gist" data-attrs="{&quot;innerHTML&quot;:&quot;<div id=\&quot;gist138083793\&quot; class=\&quot;gist\&quot;>\n    <div class=\&quot;gist-file\&quot; translate=\&quot;no\&quot; data-color-mode=\&quot;light\&quot; data-light-theme=\&quot;light\&quot;>\n      <div class=\&quot;gist-data\&quot;>\n        <div class=\&quot;js-gist-file-update-container js-task-list-container\&quot;>\n  <div id=\&quot;file-olmo_trace_index-py\&quot; class=\&quot;file my-2\&quot;>\n    \n    <div itemprop=\&quot;text\&quot;\n      class=\&quot;Box-body p-0 blob-wrapper data type-python  \&quot;\n      style=\&quot;overflow: auto\&quot; tabindex=\&quot;0\&quot; role=\&quot;region\&quot;\n      aria-label=\&quot;olmo_trace_index.py content, created by wolfecameron on 04:30AM today.\&quot;\n    >\n\n        \n<div class=\&quot;js-check-hidden-unicode js-blob-code-container blob-code-content\&quot;>\n\n  <template class=\&quot;js-file-alert-template\&quot;>\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash flash-warn flash-full d-flex flex-items-center\&quot;>\n  <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n    <span>\n      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.\n      <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.co/hiddenchars\&quot; target=\&quot;_blank\&quot;>Learn more about bidirectional Unicode characters</a>\n    </span>\n\n\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash-action\&quot;>        <a href=\&quot;{{ revealButtonHref }}\&quot; data-view-component=\&quot;true\&quot; class=\&quot;btn-sm btn\&quot;>    Show hidden characters\n</a>\n</div>\n</div></template>\n<template class=\&quot;js-line-alert-template\&quot;>\n  <span aria-label=\&quot;This line has hidden Unicode characters\&quot; data-view-component=\&quot;true\&quot; class=\&quot;line-alert tooltipped tooltipped-e\&quot;>\n    <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n</span></template>\n\n  <table data-hpc class=\&quot;highlight tab-size js-file-line-container\&quot; data-tab-size=\&quot;8\&quot; data-paste-markdown-skip data-tagsearch-path=\&quot;olmo_trace_index.py\&quot;>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L1\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;1\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC1\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>import</span> <span class=pl-s1>os</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L2\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;2\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC2\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>import</span> <span class=pl-s1>json</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L3\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;3\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC3\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>from</span> <span class=pl-s1>collections</span> <span class=pl-k>import</span> <span class=pl-v>Counter</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L4\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;4\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC4\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>import</span> <span class=pl-s1>tempfile</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L5\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;5\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC5\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L6\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;6\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC6\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>from</span> <span class=pl-s1>transformers</span> <span class=pl-k>import</span> <span class=pl-v>AutoTokenizer</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L7\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;7\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC7\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L8\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;8\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC8\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># load tokenizer / data</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L9\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;9\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC9\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>enc</span> <span class=pl-c1>=</span> <span class=pl-v>AutoTokenizer</span>.<span class=pl-c1>from_pretrained</span>(<span class=pl-s>&amp;quot;meta-llama/Llama-2-7b-hf&amp;quot;</span>, <span class=pl-s1>add_bos_token</span><span class=pl-c1>=</span><span class=pl-c1>False</span>, <span class=pl-s1>add_eos_token</span><span class=pl-c1>=</span><span class=pl-c1>False</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L10\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;10\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC10\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>data_rows</span> <span class=pl-c1>=</span> [{<span class=pl-s>&amp;#39;text&amp;#39;</span>: <span class=pl-s>&amp;#39;here is some training data&amp;#39;</span>}, ...]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L11\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;11\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC11\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L12\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;12\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC12\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># compute / save unigram probabilities</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L13\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;13\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC13\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>all_toks</span> <span class=pl-c1>=</span> []</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L14\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;14\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC14\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>for</span> <span class=pl-s1>x</span> <span class=pl-c1>in</span> <span class=pl-s1>data_rows</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L15\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;15\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC15\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>all_toks</span>.<span class=pl-c1>extend</span>(<span class=pl-s1>enc</span>.<span class=pl-c1>encode</span>(<span class=pl-s1>x</span>[<span class=pl-s>&amp;#39;text&amp;#39;</span>]))</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L16\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;16\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC16\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>total_toks</span> <span class=pl-c1>=</span> <span class=pl-en>len</span>(<span class=pl-s1>all_toks</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L17\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;17\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC17\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>tok_count</span> <span class=pl-c1>=</span> <span class=pl-en>Counter</span>(<span class=pl-s1>all_toks</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L18\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;18\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC18\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>unigram_probs</span> <span class=pl-c1>=</span> {}</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L19\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;19\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC19\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>for</span> <span class=pl-s1>tid</span> <span class=pl-c1>in</span> <span class=pl-s1>tok_count</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L20\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;20\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC20\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>cnt</span> <span class=pl-c1>=</span> <span class=pl-s1>tok_count</span>[<span class=pl-s1>tid</span>]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L21\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;21\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC21\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>unigram_probs</span>[<span class=pl-s1>tid</span>] <span class=pl-c1>=</span> <span class=pl-s1>cnt</span> <span class=pl-c1>/</span> <span class=pl-s1>total_toks</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L22\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;22\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC22\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>with</span> <span class=pl-s1>open</span>(<span class=pl-c1>&amp;lt;</span><span class=pl-s1>save</span> <span class=pl-s1>path</span><span class=pl-c1>&amp;gt;</span>, <span class=pl-s>&amp;#39;w&amp;#39;</span>) <span class=pl-k>as</span> <span class=pl-s1>json_file</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L23\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;23\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC23\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>json</span>.<span class=pl-c1>dump</span>(<span class=pl-s1>unigram_probs</span>, <span class=pl-s1>json_file</span>, <span class=pl-s1>indent</span><span class=pl-c1>=</span><span class=pl-c1>4</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L24\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;24\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC24\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L25\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;25\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC25\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># build infinigram index</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L26\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;26\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC26\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>data_dir</span> <span class=pl-c1>=</span> <span class=pl-c1>&amp;lt;</span><span class=pl-s1>path</span> <span class=pl-s1>to</span> <span class=pl-s1>data</span><span class=pl-c1>&amp;gt;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L27\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;27\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC27\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>save_dir</span> <span class=pl-c1>=</span> <span class=pl-c1>&amp;lt;</span><span class=pl-s1>save</span> <span class=pl-s1>index</span> <span class=pl-s1>here</span><span class=pl-c1>&amp;gt;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L28\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;28\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC28\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>temp_dir</span> <span class=pl-c1>=</span> <span class=pl-s1>tempfile</span>.<span class=pl-c1>TemporaryDirectory</span>()</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L29\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;29\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC29\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>command</span> <span class=pl-c1>=</span> (</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L30\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;30\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC30\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s>f&amp;quot;python -m infini_gram.indexing --data_dir <span class=pl-s1><span class=pl-kos>{</span><span class=pl-s1>data_dir</span><span class=pl-kos>}</span></span> &amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L31\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;31\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC31\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s>f&amp;quot;--temp_dir <span class=pl-s1><span class=pl-kos>{</span><span class=pl-s1>temp_dir</span>.<span class=pl-c1>name</span><span class=pl-kos>}</span></span> --save_dir <span class=pl-s1><span class=pl-kos>{</span><span class=pl-s1>save_dir</span><span class=pl-kos>}</span></span> &amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L32\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;32\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC32\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s>f&amp;quot;--tokenizer llama --cpus 12 --mem 64  --shards 1 &amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L33\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;33\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC33\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s>f&amp;quot;--add_metadata --ulimit 100000 &amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L34\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;34\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC34\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L35\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;35\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC35\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-en>print</span>(<span class=pl-s1>command</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L36\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;36\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC36\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>os</span>.<span class=pl-c1>system</span>(<span class=pl-s1>command</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace_index-py-L37\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;37\&quot;></td>\n          <td id=\&quot;file-olmo_trace_index-py-LC37\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>temp_dir</span>.<span class=pl-c1>cleanup</span>()</td>\n        </tr>\n  </table>\n</div>\n\n\n    </div>\n\n  </div>\n</div>\n\n      </div>\n      <div class=\&quot;gist-meta\&quot;>\n        <a href=\&quot;https://gist.github.com/wolfecameron/6120678a88bf52d7be524266c82c409a/raw/b3f5df3886bcbad62089db0f476ba02e6bdaa7c0/olmo_trace_index.py\&quot; style=\&quot;float:right\&quot; class=\&quot;Link--inTextBlock\&quot;>view raw</a>\n        <a href=\&quot;https://gist.github.com/wolfecameron/6120678a88bf52d7be524266c82c409a#file-olmo_trace_index-py\&quot; class=\&quot;Link--inTextBlock\&quot;>\n          olmo_trace_index.py\n        </a>\n        hosted with &amp;#10084; by <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.com\&quot;>GitHub</a>\n      </div>\n    </div>\n</div>\n&quot;,&quot;stylesheet&quot;:&quot;https://github.githubassets.com/assets/gist-embed-b1ee75c43dbe.css&quot;}" data-component-name="GitgistToDOM"><link rel="stylesheet" href="https://github.githubassets.com/assets/gist-embed-b1ee75c43dbe.css"><div id="gist138083793" class="gist">
    <div class="gist-file" data-color-mode="light" data-light-theme="light">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container">
  <div id="file-olmo_trace_index-py" class="file my-2">
    
    <div itemprop="text" class="Box-body p-0 blob-wrapper data type-python  " style="overflow:auto">

        
<div class="js-check-hidden-unicode js-blob-code-container blob-code-content">

  
  <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
  
    

    <span>
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
    </span>


  <div data-view-component="true" class="flash-action">        <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn">    Show hidden characters
</a>
</div>
</div>

  <span data-view-component="true" class="line-alert tooltipped tooltipped-e">
    
    

</span>

  <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="olmo_trace_index.py">
        <tbody><tr>
          <td id="file-olmo_trace_index-py-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
          <td id="file-olmo_trace_index-py-LC1" class="blob-code blob-code-inner js-file-line"><span class="pl-k">import</span> <span class="pl-s1">os</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
          <td id="file-olmo_trace_index-py-LC2" class="blob-code blob-code-inner js-file-line"><span class="pl-k">import</span> <span class="pl-s1">json</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
          <td id="file-olmo_trace_index-py-LC3" class="blob-code blob-code-inner js-file-line"><span class="pl-k">from</span> <span class="pl-s1">collections</span> <span class="pl-k">import</span> <span class="pl-v">Counter</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
          <td id="file-olmo_trace_index-py-LC4" class="blob-code blob-code-inner js-file-line"><span class="pl-k">import</span> <span class="pl-s1">tempfile</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
          <td id="file-olmo_trace_index-py-LC5" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
          <td id="file-olmo_trace_index-py-LC6" class="blob-code blob-code-inner js-file-line"><span class="pl-k">from</span> <span class="pl-s1">transformers</span> <span class="pl-k">import</span> <span class="pl-v">AutoTokenizer</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
          <td id="file-olmo_trace_index-py-LC7" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
          <td id="file-olmo_trace_index-py-LC8" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># load tokenizer / data</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
          <td id="file-olmo_trace_index-py-LC9" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">enc</span> <span class="pl-c1">=</span> <span class="pl-v">AutoTokenizer</span>.<span class="pl-c1">from_pretrained</span>(<span class="pl-s">"meta-llama/Llama-2-7b-hf"</span>, <span class="pl-s1">add_bos_token</span><span class="pl-c1">=</span><span class="pl-c1">False</span>, <span class="pl-s1">add_eos_token</span><span class="pl-c1">=</span><span class="pl-c1">False</span>)</td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
          <td id="file-olmo_trace_index-py-LC10" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">data_rows</span> <span class="pl-c1">=</span> [{<span class="pl-s">'text'</span>: <span class="pl-s">'here is some training data'</span>}, ...]</td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
          <td id="file-olmo_trace_index-py-LC11" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
          <td id="file-olmo_trace_index-py-LC12" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># compute / save unigram probabilities</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
          <td id="file-olmo_trace_index-py-LC13" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">all_toks</span> <span class="pl-c1">=</span> []</td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
          <td id="file-olmo_trace_index-py-LC14" class="blob-code blob-code-inner js-file-line"><span class="pl-k">for</span> <span class="pl-s1">x</span> <span class="pl-c1">in</span> <span class="pl-s1">data_rows</span>:</td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
          <td id="file-olmo_trace_index-py-LC15" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">all_toks</span>.<span class="pl-c1">extend</span>(<span class="pl-s1">enc</span>.<span class="pl-c1">encode</span>(<span class="pl-s1">x</span>[<span class="pl-s">'text'</span>]))</td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
          <td id="file-olmo_trace_index-py-LC16" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">total_toks</span> <span class="pl-c1">=</span> <span class="pl-en">len</span>(<span class="pl-s1">all_toks</span>)</td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
          <td id="file-olmo_trace_index-py-LC17" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">tok_count</span> <span class="pl-c1">=</span> <span class="pl-en">Counter</span>(<span class="pl-s1">all_toks</span>)</td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
          <td id="file-olmo_trace_index-py-LC18" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">unigram_probs</span> <span class="pl-c1">=</span> {}</td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
          <td id="file-olmo_trace_index-py-LC19" class="blob-code blob-code-inner js-file-line"><span class="pl-k">for</span> <span class="pl-s1">tid</span> <span class="pl-c1">in</span> <span class="pl-s1">tok_count</span>:</td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td>
          <td id="file-olmo_trace_index-py-LC20" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">cnt</span> <span class="pl-c1">=</span> <span class="pl-s1">tok_count</span>[<span class="pl-s1">tid</span>]</td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td>
          <td id="file-olmo_trace_index-py-LC21" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">unigram_probs</span>[<span class="pl-s1">tid</span>] <span class="pl-c1">=</span> <span class="pl-s1">cnt</span> <span class="pl-c1">/</span> <span class="pl-s1">total_toks</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td>
          <td id="file-olmo_trace_index-py-LC22" class="blob-code blob-code-inner js-file-line"><span class="pl-k">with</span> <span class="pl-s1">open</span>(<span class="pl-c1">&lt;</span><span class="pl-s1">save</span> <span class="pl-s1">path</span><span class="pl-c1">&gt;</span>, <span class="pl-s">'w'</span>) <span class="pl-k">as</span> <span class="pl-s1">json_file</span>:</td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td>
          <td id="file-olmo_trace_index-py-LC23" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">json</span>.<span class="pl-c1">dump</span>(<span class="pl-s1">unigram_probs</span>, <span class="pl-s1">json_file</span>, <span class="pl-s1">indent</span><span class="pl-c1">=</span><span class="pl-c1">4</span>)</td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td>
          <td id="file-olmo_trace_index-py-LC24" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L25" class="blob-num js-line-number js-blob-rnum" data-line-number="25"></td>
          <td id="file-olmo_trace_index-py-LC25" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># build infinigram index</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L26" class="blob-num js-line-number js-blob-rnum" data-line-number="26"></td>
          <td id="file-olmo_trace_index-py-LC26" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">data_dir</span> <span class="pl-c1">=</span> <span class="pl-c1">&lt;</span><span class="pl-s1">path</span> <span class="pl-s1">to</span> <span class="pl-s1">data</span><span class="pl-c1">&gt;</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L27" class="blob-num js-line-number js-blob-rnum" data-line-number="27"></td>
          <td id="file-olmo_trace_index-py-LC27" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">save_dir</span> <span class="pl-c1">=</span> <span class="pl-c1">&lt;</span><span class="pl-s1">save</span> <span class="pl-s1">index</span> <span class="pl-s1">here</span><span class="pl-c1">&gt;</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L28" class="blob-num js-line-number js-blob-rnum" data-line-number="28"></td>
          <td id="file-olmo_trace_index-py-LC28" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">temp_dir</span> <span class="pl-c1">=</span> <span class="pl-s1">tempfile</span>.<span class="pl-c1">TemporaryDirectory</span>()</td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L29" class="blob-num js-line-number js-blob-rnum" data-line-number="29"></td>
          <td id="file-olmo_trace_index-py-LC29" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">command</span> <span class="pl-c1">=</span> (</td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L30" class="blob-num js-line-number js-blob-rnum" data-line-number="30"></td>
          <td id="file-olmo_trace_index-py-LC30" class="blob-code blob-code-inner js-file-line">    <span class="pl-s">f"python -m infini_gram.indexing --data_dir <span class="pl-s1"><span class="pl-kos">{</span><span class="pl-s1">data_dir</span><span class="pl-kos">}</span></span> "</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L31" class="blob-num js-line-number js-blob-rnum" data-line-number="31"></td>
          <td id="file-olmo_trace_index-py-LC31" class="blob-code blob-code-inner js-file-line">    <span class="pl-s">f"--temp_dir <span class="pl-s1"><span class="pl-kos">{</span><span class="pl-s1">temp_dir</span>.<span class="pl-c1">name</span><span class="pl-kos">}</span></span> --save_dir <span class="pl-s1"><span class="pl-kos">{</span><span class="pl-s1">save_dir</span><span class="pl-kos">}</span></span> "</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L32" class="blob-num js-line-number js-blob-rnum" data-line-number="32"></td>
          <td id="file-olmo_trace_index-py-LC32" class="blob-code blob-code-inner js-file-line">    <span class="pl-s">f"--tokenizer llama --cpus 12 --mem 64  --shards 1 "</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L33" class="blob-num js-line-number js-blob-rnum" data-line-number="33"></td>
          <td id="file-olmo_trace_index-py-LC33" class="blob-code blob-code-inner js-file-line">    <span class="pl-s">f"--add_metadata --ulimit 100000 "</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L34" class="blob-num js-line-number js-blob-rnum" data-line-number="34"></td>
          <td id="file-olmo_trace_index-py-LC34" class="blob-code blob-code-inner js-file-line">)</td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L35" class="blob-num js-line-number js-blob-rnum" data-line-number="35"></td>
          <td id="file-olmo_trace_index-py-LC35" class="blob-code blob-code-inner js-file-line"><span class="pl-en">print</span>(<span class="pl-s1">command</span>)</td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L36" class="blob-num js-line-number js-blob-rnum" data-line-number="36"></td>
          <td id="file-olmo_trace_index-py-LC36" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">os</span>.<span class="pl-c1">system</span>(<span class="pl-s1">command</span>)</td>
        </tr>
        <tr>
          <td id="file-olmo_trace_index-py-L37" class="blob-num js-line-number js-blob-rnum" data-line-number="37"></td>
          <td id="file-olmo_trace_index-py-LC37" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">temp_dir</span>.<span class="pl-c1">cleanup</span>()</td>
        </tr>
  </tbody></table>
</div>


    </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/wolfecameron/6120678a88bf52d7be524266c82c409a/raw/b3f5df3886bcbad62089db0f476ba02e6bdaa7c0/olmo_trace_index.py" style="float:right" class="Link--inTextBlock">view raw</a>
        <a href="https://gist.github.com/wolfecameron/6120678a88bf52d7be524266c82c409a#file-olmo_trace_index-py" class="Link--inTextBlock">
          olmo_trace_index.py
        </a>
        hosted with &#10084; by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
      </div>
    </div>
</div>
</div><p>Now that the infini-gram index has been built, we can trace a sequence of text over our training dataset&#8212;<em>following the algorithm proposed by OLMoTrace in [2]</em>&#8212;as shown in the code below. This code returns both a set of spans and their associated documents with metadata from the training corpus.</p><div class="github-gist" data-attrs="{&quot;innerHTML&quot;:&quot;<div id=\&quot;gist138084024\&quot; class=\&quot;gist\&quot;>\n    <div class=\&quot;gist-file\&quot; translate=\&quot;no\&quot; data-color-mode=\&quot;light\&quot; data-light-theme=\&quot;light\&quot;>\n      <div class=\&quot;gist-data\&quot;>\n        <div class=\&quot;js-gist-file-update-container js-task-list-container\&quot;>\n  <div id=\&quot;file-olmo_trace-py\&quot; class=\&quot;file my-2\&quot;>\n    \n    <div itemprop=\&quot;text\&quot;\n      class=\&quot;Box-body p-0 blob-wrapper data type-python  \&quot;\n      style=\&quot;overflow: auto\&quot; tabindex=\&quot;0\&quot; role=\&quot;region\&quot;\n      aria-label=\&quot;olmo_trace.py content, created by wolfecameron on 04:50AM today.\&quot;\n    >\n\n        \n<div class=\&quot;js-check-hidden-unicode js-blob-code-container blob-code-content\&quot;>\n\n  <template class=\&quot;js-file-alert-template\&quot;>\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash flash-warn flash-full d-flex flex-items-center\&quot;>\n  <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n    <span>\n      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.\n      <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.co/hiddenchars\&quot; target=\&quot;_blank\&quot;>Learn more about bidirectional Unicode characters</a>\n    </span>\n\n\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash-action\&quot;>        <a href=\&quot;{{ revealButtonHref }}\&quot; data-view-component=\&quot;true\&quot; class=\&quot;btn-sm btn\&quot;>    Show hidden characters\n</a>\n</div>\n</div></template>\n<template class=\&quot;js-line-alert-template\&quot;>\n  <span aria-label=\&quot;This line has hidden Unicode characters\&quot; data-view-component=\&quot;true\&quot; class=\&quot;line-alert tooltipped tooltipped-e\&quot;>\n    <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n</span></template>\n\n  <table data-hpc class=\&quot;highlight tab-size js-file-line-container\&quot; data-tab-size=\&quot;8\&quot; data-paste-markdown-skip data-tagsearch-path=\&quot;olmo_trace.py\&quot;>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L1\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;1\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC1\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>import</span> <span class=pl-s1>ast</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L2\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;2\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC2\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>import</span> <span class=pl-s1>math</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L3\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;3\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC3\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>import</span> <span class=pl-s1>random</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L4\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;4\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC4\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L5\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;5\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC5\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>from</span> <span class=pl-s1>infini_gram</span>.<span class=pl-s1>engine</span> <span class=pl-k>import</span> <span class=pl-v>InfiniGramEngine</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L6\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;6\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC6\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>from</span> <span class=pl-s1>transformers</span> <span class=pl-k>import</span> <span class=pl-v>AutoTokenizer</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L7\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;7\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC7\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L8\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;8\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC8\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>def</span> <span class=pl-en>compute_longest_prefix</span>(<span class=pl-s1>query</span>, <span class=pl-s1>doc</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L9\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;9\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC9\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s>&amp;quot;&amp;quot;&amp;quot;helper function for computing longest prefix of query that exists</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L10\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;10\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC10\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>    within a document&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L11\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;11\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC11\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L12\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;12\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC12\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>shared_prefix_length</span>(<span class=pl-s1>list1</span>, <span class=pl-s1>list2</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L13\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;13\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC13\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>prefix_length</span> <span class=pl-c1>=</span> <span class=pl-c1>0</span>    </td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L14\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;14\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC14\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>for</span> <span class=pl-s1>elem1</span>, <span class=pl-s1>elem2</span> <span class=pl-c1>in</span> <span class=pl-en>zip</span>(<span class=pl-s1>list1</span>, <span class=pl-s1>list2</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L15\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;15\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC15\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-k>if</span> <span class=pl-s1>elem1</span> <span class=pl-c1>==</span> <span class=pl-s1>elem2</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L16\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;16\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC16\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>                <span class=pl-s1>prefix_length</span> <span class=pl-c1>+=</span> <span class=pl-c1>1</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L17\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;17\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC17\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-k>else</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L18\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;18\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC18\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>                <span class=pl-k>break</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L19\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;19\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC19\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>return</span> <span class=pl-s1>prefix_length</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L20\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;20\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC20\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L21\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;21\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC21\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>first_id</span> <span class=pl-c1>=</span> <span class=pl-s1>query</span>[<span class=pl-c1>0</span>]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L22\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;22\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC22\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>start_idx</span> <span class=pl-c1>=</span> [<span class=pl-s1>index</span> <span class=pl-k>for</span> <span class=pl-s1>index</span>, <span class=pl-s1>value</span> <span class=pl-c1>in</span> <span class=pl-en>enumerate</span>(<span class=pl-s1>doc</span>) <span class=pl-k>if</span> <span class=pl-s1>value</span> <span class=pl-c1>==</span> <span class=pl-s1>first_id</span>]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L23\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;23\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC23\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>longest_prefix</span> <span class=pl-c1>=</span> <span class=pl-c1>0</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L24\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;24\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC24\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>for</span> <span class=pl-s1>si</span> <span class=pl-c1>in</span> <span class=pl-s1>start_idx</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L25\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;25\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC25\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>longest_prefix</span> <span class=pl-c1>=</span> <span class=pl-en>max</span>(</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L26\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;26\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC26\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>longest_prefix</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L27\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;27\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC27\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-en>shared_prefix_length</span>(<span class=pl-s1>query</span>, <span class=pl-s1>doc</span>[<span class=pl-s1>si</span>:]),</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L28\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;28\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC28\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        )</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L29\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;29\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC29\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>return</span> <span class=pl-s1>longest_prefix</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L30\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;30\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC30\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L31\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;31\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC31\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># setup</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L32\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;32\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC32\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>enc</span> <span class=pl-c1>=</span> <span class=pl-v>AutoTokenizer</span>.<span class=pl-c1>from_pretrained</span>(<span class=pl-s>&amp;quot;meta-llama/Llama-2-7b-hf&amp;quot;</span>, <span class=pl-s1>add_bos_token</span><span class=pl-c1>=</span><span class=pl-c1>False</span>, <span class=pl-s1>add_eos_token</span><span class=pl-c1>=</span><span class=pl-c1>False</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L33\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;33\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC33\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>engine</span> <span class=pl-c1>=</span> <span class=pl-en>InfiniGramEngine</span>(<span class=pl-s1>index_dir</span><span class=pl-c1>=</span><span class=pl-c1>&amp;lt;</span><span class=pl-s1>path</span> <span class=pl-s1>to</span> <span class=pl-s1>index</span><span class=pl-c1>&amp;gt;</span>, <span class=pl-s1>eos_token_id</span><span class=pl-c1>=</span><span class=pl-s1>enc</span>.<span class=pl-c1>eos_token_id</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L34\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;34\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC34\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>unigram_probs</span> <span class=pl-c1>=</span> {<span class=pl-c1>1</span>: <span class=pl-c1>0.5</span>, <span class=pl-c1>2</span>: <span class=pl-c1>0.5</span>} <span class=pl-c># load pre-computed probabilities</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L35\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;35\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC35\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L36\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;36\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC36\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># LLM output / query to search</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L37\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;37\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC37\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>generation</span> <span class=pl-c1>=</span> <span class=pl-s>&amp;#39;Here is the output of the LLM that we want to search for in our data.&amp;#39;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L38\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;38\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC38\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>gen_ids</span> <span class=pl-c1>=</span> <span class=pl-s1>enc</span>.<span class=pl-c1>encode</span>(<span class=pl-s1>generation</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L39\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;39\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC39\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L40\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;40\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC40\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L41\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;41\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC41\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L42\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;42\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC42\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>Step One: find maximal matching spans</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L43\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;43\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC43\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L44\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;44\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC44\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c1>L</span> <span class=pl-c1>=</span> <span class=pl-en>len</span>(<span class=pl-s1>gen_ids</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L45\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;45\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC45\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>max_doc_toks</span> <span class=pl-c1>=</span> <span class=pl-en>len</span>(<span class=pl-s1>gen_ids</span>) <span class=pl-c1>*</span> <span class=pl-c1>2</span>  <span class=pl-c># size of spans to retrieve in documents</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L46\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;46\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC46\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L47\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;47\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC47\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># find longest prefix match for every suffix in the query</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L48\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;48\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC48\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>spans</span> <span class=pl-c1>=</span> []</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L49\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;49\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC49\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>for</span> <span class=pl-s1>start</span> <span class=pl-c1>in</span> <span class=pl-en>range</span>(<span class=pl-en>len</span>(<span class=pl-s1>gen_ids</span>) <span class=pl-c1>-</span> <span class=pl-c1>1</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L50\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;50\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC50\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>_suffix</span> <span class=pl-c1>=</span> <span class=pl-s1>gen_ids</span>[<span class=pl-s1>start</span>:]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L51\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;51\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC51\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>_suff_res</span> <span class=pl-c1>=</span> <span class=pl-s1>engine</span>.<span class=pl-c1>find</span>(<span class=pl-s1>input_ids</span><span class=pl-c1>=</span><span class=pl-s1>_suffix</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L52\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;52\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC52\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L53\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;53\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC53\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-c># if no match, get the longest matching prefix using find result</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L54\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;54\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC54\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>if</span> <span class=pl-s1>_suff_res</span>[<span class=pl-s>&amp;#39;cnt&amp;#39;</span>] <span class=pl-c1>==</span> <span class=pl-c1>0</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L55\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;55\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC55\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>_shards</span> <span class=pl-c1>=</span> <span class=pl-s1>_suff_res</span>[<span class=pl-s>&amp;#39;segment_by_shard&amp;#39;</span>]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L56\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;56\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC56\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>assert</span> <span class=pl-en>len</span>(<span class=pl-s1>_shards</span>) <span class=pl-c1>==</span> <span class=pl-c1>1</span>  <span class=pl-c># assume only one shard</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L57\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;57\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC57\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>_doc_ids</span> <span class=pl-c1>=</span> <span class=pl-s1>engine</span>.<span class=pl-c1>get_doc_by_rank</span>(</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L58\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;58\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC58\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>s</span><span class=pl-c1>=</span><span class=pl-c1>0</span>,  <span class=pl-c># assume only one shard</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L59\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;59\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC59\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>rank</span><span class=pl-c1>=</span><span class=pl-s1>_shards</span>[<span class=pl-c1>0</span>][<span class=pl-c1>0</span>],</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L60\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;60\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC60\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>max_disp_len</span><span class=pl-c1>=</span><span class=pl-s1>max_doc_toks</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L61\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;61\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC61\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        )[<span class=pl-s>&amp;#39;token_ids&amp;#39;</span>]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L62\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;62\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC62\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>matched_toks</span> <span class=pl-c1>=</span> <span class=pl-en>compute_longest_prefix</span>(<span class=pl-s1>_suffix</span>, <span class=pl-s1>_doc_ids</span>)  <span class=pl-c># get longest matching prefix</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L63\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;63\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC63\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>elif</span> <span class=pl-s1>_suff_res</span>[<span class=pl-s>&amp;#39;cnt&amp;#39;</span>] <span class=pl-c1>&amp;gt;</span> <span class=pl-c1>0</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L64\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;64\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC64\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>matched_toks</span> <span class=pl-c1>=</span> <span class=pl-en>len</span>(<span class=pl-s1>_suffix</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L65\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;65\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC65\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>spans</span>.<span class=pl-c1>append</span>((<span class=pl-s1>start</span>, <span class=pl-s1>start</span> <span class=pl-c1>+</span> <span class=pl-s1>matched_toks</span>))</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L66\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;66\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC66\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L67\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;67\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC67\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># remove partial and non-self-contained spans</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L68\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;68\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC68\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>full_spans</span> <span class=pl-c1>=</span> []</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L69\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;69\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC69\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>for</span> <span class=pl-s1>start</span>, <span class=pl-s1>end</span> <span class=pl-c1>in</span> <span class=pl-s1>spans</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L70\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;70\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC70\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>span_ids</span> <span class=pl-c1>=</span> <span class=pl-s1>gen_ids</span>[<span class=pl-s1>start</span>: <span class=pl-s1>end</span>]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L71\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;71\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC71\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>span_text</span> <span class=pl-c1>=</span> <span class=pl-s1>enc</span>.<span class=pl-c1>decode</span>(<span class=pl-s1>span_ids</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L72\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;72\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC72\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L73\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;73\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC73\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-c># check for internal punctuation</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L74\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;74\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC74\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>has_internal_punc</span> <span class=pl-c1>=</span> <span class=pl-c1>False</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L75\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;75\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC75\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>punc_chars</span> <span class=pl-c1>=</span> <span class=pl-s>&amp;quot;!.?<span class=pl-cce>\\n</span>&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L76\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;76\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC76\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>for</span> <span class=pl-s1>ch</span> <span class=pl-c1>in</span> <span class=pl-s1>span_text</span>[:<span class=pl-c1>-</span><span class=pl-c1>1</span>]:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L77\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;77\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC77\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>if</span> <span class=pl-s1>ch</span> <span class=pl-c1>in</span> <span class=pl-s1>punc_chars</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L78\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;78\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC78\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>has_internal_punc</span> <span class=pl-c1>=</span> <span class=pl-c1>True</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L79\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;79\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC79\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-k>break</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L80\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;80\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC80\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>if</span> <span class=pl-s1>has_internal_punc</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L81\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;81\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC81\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>continue</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L82\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;82\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC82\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L83\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;83\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC83\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-c># check if first token is a continuation of a word</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L84\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;84\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC84\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>first_tok_id</span> <span class=pl-c1>=</span> <span class=pl-s1>span_ids</span>[<span class=pl-c1>0</span>]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L85\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;85\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC85\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>first_tok</span> <span class=pl-c1>=</span> <span class=pl-s1>enc</span>.<span class=pl-c1>convert_ids_to_tokens</span>(<span class=pl-s1>first_tok_id</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L86\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;86\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC86\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>if</span> <span class=pl-s1>first_tok</span>[<span class=pl-c1>0</span>] <span class=pl-c1>!=</span> <span class=pl-s>&amp;#39;&#9601;&amp;#39;</span>:  <span class=pl-c># assumes Llama 2 token format</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L87\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;87\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC87\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>continue</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L88\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;88\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC88\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L89\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;89\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC89\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-c># no sub-token follows the last token</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L90\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;90\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC90\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>if</span> <span class=pl-s1>end</span> <span class=pl-c1>&amp;lt;</span> <span class=pl-en>len</span>(<span class=pl-s1>gen_ids</span>) <span class=pl-c1>and</span> <span class=pl-s1>tokenizer</span>.<span class=pl-c1>convert_ids_to_tokens</span>(<span class=pl-s1>gen_ids</span>[<span class=pl-s1>end</span>])[<span class=pl-c1>0</span>] <span class=pl-c1>!=</span> <span class=pl-s>&amp;quot;&#9601;&amp;quot;</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L91\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;91\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC91\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>continue</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L92\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;92\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC92\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>full_spans</span>.<span class=pl-c1>append</span>((<span class=pl-s1>start</span>, <span class=pl-s1>end</span>, <span class=pl-s1>span_ids</span>, <span class=pl-s1>span_text</span>))    </td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L93\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;93\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC93\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L94\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;94\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC94\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># remove non-maximal spans</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L95\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;95\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC95\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>maximal_spans</span> <span class=pl-c1>=</span> []</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L96\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;96\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC96\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>max_end_pos</span> <span class=pl-c1>=</span> <span class=pl-c1>-</span><span class=pl-c1>1</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L97\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;97\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC97\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>full_spans</span> <span class=pl-c1>=</span> <span class=pl-en>sorted</span>(<span class=pl-s1>full_spans</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L98\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;98\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC98\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>for</span> <span class=pl-s1>start</span>, <span class=pl-s1>end</span>, <span class=pl-s1>ids</span>, <span class=pl-s1>text</span> <span class=pl-c1>in</span> <span class=pl-s1>full_spans</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L99\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;99\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC99\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>if</span> <span class=pl-s1>end</span> <span class=pl-c1>&amp;gt;</span> <span class=pl-s1>max_end_pos</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L100\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;100\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC100\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>maximal_spans</span>.<span class=pl-c1>append</span>((<span class=pl-s1>start</span>, <span class=pl-s1>end</span>, <span class=pl-s1>ids</span>, <span class=pl-s1>text</span>))</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L101\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;101\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC101\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>max_end_pos</span> <span class=pl-c1>=</span> <span class=pl-s1>end</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L102\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;102\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC102\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L103\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;103\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC103\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L104\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;104\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC104\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L105\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;105\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC105\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>Step Two: filter to keep long / unique spans</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L106\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;106\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC106\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L107\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;107\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC107\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c1>K</span> <span class=pl-c1>=</span> <span class=pl-s1>math</span>.<span class=pl-c1>ceil</span>(<span class=pl-c1>0.05</span> <span class=pl-c1>*</span> <span class=pl-c1>L</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L108\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;108\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC108\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>assert</span> <span class=pl-c1>K</span> <span class=pl-c1>&amp;gt;</span> <span class=pl-c1>0</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L109\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;109\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC109\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>filt_spans</span> <span class=pl-c1>=</span> []</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L110\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;110\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC110\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>for</span> <span class=pl-s1>start</span>, <span class=pl-s1>end</span>, <span class=pl-s1>ids</span>, <span class=pl-s1>text</span> <span class=pl-c1>in</span> <span class=pl-s1>maximal_spans</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L111\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;111\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC111\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>span_uni_prob</span> <span class=pl-c1>=</span> [<span class=pl-s1>unigram_probs</span>.<span class=pl-c1>get</span>(<span class=pl-s1>_id</span>) <span class=pl-k>for</span> <span class=pl-s1>_id</span> <span class=pl-c1>in</span> <span class=pl-s1>ids</span>]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L112\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;112\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC112\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>span_uni_prob</span> <span class=pl-c1>=</span> <span class=pl-s1>math</span>.<span class=pl-c1>prod</span>(<span class=pl-s1>span_uni_prob</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L113\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;113\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC113\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>filt_spans</span>.<span class=pl-c1>append</span>((<span class=pl-s1>start</span>, <span class=pl-s1>end</span>, <span class=pl-s1>ids</span>, <span class=pl-s1>text</span>, <span class=pl-s1>span_uni_prob</span>))</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L114\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;114\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC114\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>filt_spans</span> <span class=pl-c1>=</span> <span class=pl-en>sorted</span>(<span class=pl-s1>filt_spans</span>, <span class=pl-s1>key</span><span class=pl-c1>=</span><span class=pl-k>lambda</span> <span class=pl-s1>x</span>: <span class=pl-s1>x</span>[<span class=pl-c1>-</span><span class=pl-c1>1</span>])</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L115\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;115\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC115\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>filt_spans</span> <span class=pl-c1>=</span> <span class=pl-s1>filt_spans</span>[:<span class=pl-c1>K</span>]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L116\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;116\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC116\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>filt_spans</span> <span class=pl-c1>=</span> <span class=pl-en>sorted</span>(<span class=pl-s1>filt_spans</span>)  <span class=pl-c># sort based on start position again</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L117\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;117\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC117\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L118\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;118\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC118\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L119\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;119\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC119\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L120\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;120\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC120\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>Step Three: retrieve Enclosing Docs</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L121\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;121\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC121\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L122\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;122\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC122\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>docs_per_span</span> <span class=pl-c1>=</span> <span class=pl-c1>10</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L123\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;123\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC123\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>span_to_docs</span> <span class=pl-c1>=</span> <span class=pl-en>defaultdict</span>(<span class=pl-s1>list</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L124\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;124\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC124\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>for</span> <span class=pl-s1>i</span>, (<span class=pl-s1>start</span>, <span class=pl-s1>end</span>, <span class=pl-s1>ids</span>, <span class=pl-s1>text</span>, <span class=pl-s1>uni_prob</span>) <span class=pl-c1>in</span> <span class=pl-en>enumerate</span>(<span class=pl-s1>filt_spans</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L125\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;125\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC125\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-c># run retrieval in infinigram index to get documents</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L126\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;126\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC126\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>span_res</span> <span class=pl-c1>=</span> <span class=pl-s1>engine</span>.<span class=pl-c1>find</span>(<span class=pl-s1>input_ids</span><span class=pl-c1>=</span><span class=pl-s1>ids</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L127\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;127\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC127\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>assert</span> <span class=pl-s1>span_res</span>[<span class=pl-s>&amp;#39;cnt&amp;#39;</span>] <span class=pl-c1>&amp;gt;</span> <span class=pl-c1>0</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L128\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;128\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC128\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>assert</span> <span class=pl-en>len</span>(<span class=pl-s1>span_res</span>[<span class=pl-s>&amp;#39;segment_by_shard&amp;#39;</span>]) <span class=pl-c1>==</span> <span class=pl-c1>1</span>  <span class=pl-c># assume only one shard</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L129\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;129\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC129\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L130\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;130\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC130\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>rank_start</span>, <span class=pl-s1>rank_end</span> <span class=pl-c1>=</span> <span class=pl-s1>span_res</span>[<span class=pl-s>&amp;#39;segment_by_shard&amp;#39;</span>][<span class=pl-c1>0</span>]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L131\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;131\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC131\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>ranks</span> <span class=pl-c1>=</span> [<span class=pl-s1>r</span> <span class=pl-k>for</span> <span class=pl-s1>r</span> <span class=pl-c1>in</span> <span class=pl-en>range</span>(<span class=pl-s1>rank_start</span>, <span class=pl-s1>rank_end</span>)]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L132\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;132\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC132\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>if</span> <span class=pl-en>len</span>(<span class=pl-s1>ranks</span>) <span class=pl-c1>&amp;gt;</span> <span class=pl-s1>docs_per_span</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L133\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;133\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC133\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># retrieve fixed number of documents for each span</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L134\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;134\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC134\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>ranks</span> <span class=pl-c1>=</span> <span class=pl-en>sorted</span>(<span class=pl-s1>random</span>.<span class=pl-c1>sample</span>(<span class=pl-s1>ranks</span>, <span class=pl-s1>docs_per_span</span>))</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L135\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;135\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC135\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L136\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;136\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC136\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-c># NOTE: we can instead rank documents by BM25 score here!</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L137\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;137\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC137\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>for</span> <span class=pl-s1>r</span> <span class=pl-c1>in</span> <span class=pl-s1>ranks</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L138\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;138\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC138\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>_doc</span> <span class=pl-c1>=</span> <span class=pl-s1>engine</span>.<span class=pl-c1>get_doc_by_rank</span>(</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L139\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;139\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC139\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>s</span><span class=pl-c1>=</span><span class=pl-c1>0</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L140\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;140\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC140\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>rank</span><span class=pl-c1>=</span><span class=pl-s1>r</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L141\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;141\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC141\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>max_disp_len</span><span class=pl-c1>=</span><span class=pl-s1>max_doc_toks</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L142\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;142\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC142\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        )</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L143\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;143\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC143\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>_doc_meta</span> <span class=pl-c1>=</span> <span class=pl-s1>ast</span>.<span class=pl-c1>literal_eval</span>(<span class=pl-s1>_doc</span>[<span class=pl-s>&amp;#39;metadata&amp;#39;</span>])[<span class=pl-s>&amp;#39;metadata&amp;#39;</span>]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L144\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;144\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC144\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>_doc_text</span> <span class=pl-c1>=</span> <span class=pl-s1>enc</span>.<span class=pl-c1>decode</span>(<span class=pl-s1>_doc</span>[<span class=pl-s>&amp;#39;token_ids&amp;#39;</span>])</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L145\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;145\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC145\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>_doc_data</span> <span class=pl-c1>=</span> {</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L146\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;146\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC146\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s>&amp;quot;text&amp;quot;</span>: <span class=pl-s1>_doc_text</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L147\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;147\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC147\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-c1>**</span><span class=pl-s1>_doc_meta</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L148\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;148\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC148\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        }</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L149\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;149\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC149\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>span_to_docs</span>[<span class=pl-s1>i</span>].<span class=pl-c1>append</span>(<span class=pl-s1>_doc_data</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L150\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;150\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC150\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L151\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;151\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC151\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        </td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L152\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;152\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC152\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L153\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;153\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC153\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>Step Four: merge overlapping spans</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L154\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;154\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC154\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L155\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;155\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC155\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># get indices of spans to merge together</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L156\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;156\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC156\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>merged_spans</span> <span class=pl-c1>=</span> [[<span class=pl-c1>0</span>]]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L157\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;157\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC157\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>curr_idx</span> <span class=pl-c1>=</span> <span class=pl-c1>0</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L158\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;158\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC158\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>curr_start</span> <span class=pl-c1>=</span> <span class=pl-s1>filt_spans</span>[<span class=pl-c1>0</span>][<span class=pl-c1>0</span>]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L159\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;159\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC159\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>curr_end</span> <span class=pl-c1>=</span> <span class=pl-s1>filt_spans</span>[<span class=pl-c1>0</span>][<span class=pl-c1>1</span>]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L160\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;160\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC160\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>for</span> <span class=pl-s1>i</span>, <span class=pl-s1>next_span</span> <span class=pl-c1>in</span> <span class=pl-en>enumerate</span>(<span class=pl-s1>filt_spans</span>[<span class=pl-c1>1</span>:]):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L161\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;161\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC161\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>start</span> <span class=pl-c1>=</span> <span class=pl-s1>next_span</span>[<span class=pl-c1>0</span>]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L162\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;162\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC162\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>end</span> <span class=pl-c1>=</span> <span class=pl-s1>next_span</span>[<span class=pl-c1>1</span>]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L163\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;163\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC163\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>if</span> <span class=pl-s1>start</span> <span class=pl-c1>&amp;lt;</span> <span class=pl-s1>curr_end</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L164\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;164\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC164\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>curr_end</span> <span class=pl-c1>=</span> <span class=pl-en>max</span>(<span class=pl-s1>curr_end</span>, <span class=pl-s1>end</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L165\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;165\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC165\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>merged_spans</span>[<span class=pl-s1>curr_idx</span>].<span class=pl-c1>append</span>(<span class=pl-s1>i</span> <span class=pl-c1>+</span> <span class=pl-c1>1</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L166\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;166\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC166\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>else</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L167\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;167\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC167\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>curr_start</span>, <span class=pl-s1>curr_end</span> <span class=pl-c1>=</span> <span class=pl-s1>start</span>, <span class=pl-s1>end</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L168\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;168\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC168\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>curr_idx</span> <span class=pl-c1>+=</span> <span class=pl-c1>1</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L169\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;169\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC169\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>merged_spans</span>.<span class=pl-c1>append</span>([<span class=pl-s1>i</span> <span class=pl-c1>+</span> <span class=pl-c1>1</span>])</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L170\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;170\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC170\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>assert</span> <span class=pl-en>len</span>(<span class=pl-s1>merged_spans</span>) <span class=pl-c1>==</span> <span class=pl-s1>curr_idx</span> <span class=pl-c1>+</span> <span class=pl-c1>1</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L171\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;171\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC171\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L172\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;172\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC172\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># merge spans into a final set</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L173\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;173\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC173\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>final_spans</span> <span class=pl-c1>=</span> []</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L174\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;174\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC174\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>for</span> <span class=pl-s1>ms</span> <span class=pl-c1>in</span> <span class=pl-s1>merged_spans</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L175\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;175\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC175\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>all_docs</span> <span class=pl-c1>=</span> []</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L176\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;176\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC176\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>docs_per_merged_span</span> <span class=pl-c1>=</span> <span class=pl-s1>math</span>.<span class=pl-c1>ceil</span>(<span class=pl-s1>docs_per_span</span> <span class=pl-c1>/</span> <span class=pl-en>float</span>(<span class=pl-en>len</span>(<span class=pl-s1>ms</span>)))  <span class=pl-c># subsample docs for spans being merged</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L177\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;177\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC177\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>for</span> <span class=pl-s1>i</span> <span class=pl-c1>in</span> <span class=pl-s1>ms</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L178\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;178\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC178\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># take top docs from each span being merged</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L179\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;179\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC179\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>all_docs</span>.<span class=pl-c1>extend</span>(<span class=pl-s1>span_to_docs</span>[<span class=pl-s1>i</span>][:<span class=pl-s1>docs_per_merged_span</span>])</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L180\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;180\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC180\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>_spans</span> <span class=pl-c1>=</span> [<span class=pl-s1>filt_spans</span>[<span class=pl-s1>i</span>] <span class=pl-k>for</span> <span class=pl-s1>i</span> <span class=pl-c1>in</span> <span class=pl-s1>ms</span>]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L181\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;181\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC181\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>start</span> <span class=pl-c1>=</span> <span class=pl-en>min</span>([<span class=pl-s1>x</span>[<span class=pl-c1>0</span>] <span class=pl-k>for</span> <span class=pl-s1>x</span> <span class=pl-c1>in</span> <span class=pl-s1>_spans</span>])</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L182\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;182\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC182\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>end</span> <span class=pl-c1>=</span> <span class=pl-en>max</span>([<span class=pl-s1>x</span>[<span class=pl-c1>1</span>] <span class=pl-k>for</span> <span class=pl-s1>x</span> <span class=pl-c1>in</span> <span class=pl-s1>_spans</span>])</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L183\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;183\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC183\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>text</span> <span class=pl-c1>=</span> <span class=pl-s1>enc</span>.<span class=pl-c1>decode</span>(<span class=pl-s1>gen_ids</span>[<span class=pl-s1>start</span>: <span class=pl-s1>end</span>])</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L184\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;184\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC184\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>final_spans</span>.<span class=pl-c1>append</span>({</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L185\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;185\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC185\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s>&amp;quot;start&amp;quot;</span>: <span class=pl-s1>start</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L186\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;186\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC186\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s>&amp;quot;end&amp;quot;</span>: <span class=pl-s1>end</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L187\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;187\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC187\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s>&amp;quot;text&amp;quot;</span>: <span class=pl-s1>text</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L188\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;188\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC188\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s>&amp;quot;docs&amp;quot;</span>: <span class=pl-s1>all_docs</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L189\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;189\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC189\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    })</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L190\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;190\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC190\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L191\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;191\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC191\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L192\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;192\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC192\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L193\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;193\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC193\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>Step Five: observe tracing results</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L194\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;194\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC194\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L195\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;195\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC195\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>docs_to_print</span> <span class=pl-c1>=</span> <span class=pl-c1>5</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L196\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;196\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC196\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-en>print</span>(<span class=pl-s>f&amp;#39;Query Text: <span class=pl-s1><span class=pl-kos>{</span><span class=pl-s1>enc</span>.<span class=pl-c1>decode</span>(<span class=pl-s1>gen_ids</span>)<span class=pl-kos>}</span></span>&amp;#39;</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L197\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;197\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC197\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>for</span> <span class=pl-s1>i</span>, <span class=pl-s1>sp</span> <span class=pl-c1>in</span> <span class=pl-en>enumerate</span>(<span class=pl-s1>final_spans</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L198\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;198\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC198\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-en>print</span>(<span class=pl-s>&amp;quot;<span class=pl-cce>\\n</span>&amp;quot;</span> <span class=pl-c1>+</span> <span class=pl-s>&amp;quot;=&amp;quot;</span><span class=pl-c1>*</span><span class=pl-c1>20</span> <span class=pl-c1>+</span> <span class=pl-s>f&amp;quot; SPAN <span class=pl-s1><span class=pl-kos>{</span><span class=pl-s1>i</span> <span class=pl-c1>+</span> <span class=pl-c1>1</span><span class=pl-kos>}</span></span> / <span class=pl-s1><span class=pl-kos>{</span><span class=pl-en>len</span>(<span class=pl-s1>final_spans</span>)<span class=pl-kos>}</span></span> &amp;quot;</span> <span class=pl-c1>+</span> <span class=pl-s>&amp;quot;=&amp;quot;</span><span class=pl-c1>*</span><span class=pl-c1>20</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L199\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;199\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC199\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-en>print</span>(<span class=pl-s>f&amp;quot;Span Text: <span class=pl-s1><span class=pl-kos>{</span><span class=pl-s1>sp</span>[<span class=pl-s>&amp;#39;text&amp;#39;</span>]<span class=pl-kos>}</span></span><span class=pl-cce>\\n</span>&amp;quot;</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L200\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;200\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC200\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>for</span> <span class=pl-s1>j</span>, <span class=pl-s1>doc</span> <span class=pl-c1>in</span> <span class=pl-en>enumerate</span>(<span class=pl-s1>sp</span>[<span class=pl-s>&amp;#39;docs&amp;#39;</span>]):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L201\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;201\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC201\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-en>print</span>(<span class=pl-s>&amp;quot;-&amp;quot;</span><span class=pl-c1>*</span><span class=pl-c1>10</span> <span class=pl-c1>+</span> <span class=pl-s>f&amp;quot; Document <span class=pl-s1><span class=pl-kos>{</span><span class=pl-s1>j</span> <span class=pl-c1>+</span> <span class=pl-c1>1</span><span class=pl-kos>}</span></span> / <span class=pl-s1><span class=pl-kos>{</span><span class=pl-en>len</span>(<span class=pl-s1>sp</span>[<span class=pl-s>&amp;#39;docs&amp;#39;</span>])<span class=pl-kos>}</span></span> &amp;quot;</span> <span class=pl-c1>+</span> <span class=pl-s>&amp;quot;-&amp;quot;</span><span class=pl-c1>*</span><span class=pl-c1>10</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L202\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;202\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC202\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>for</span> <span class=pl-s1>k</span> <span class=pl-c1>in</span> [<span class=pl-s>&amp;#39;text&amp;#39;</span>, <span class=pl-s>&amp;#39;movie_id&amp;#39;</span>, <span class=pl-s>&amp;#39;src_lang&amp;#39;</span>, <span class=pl-s>&amp;#39;start_frame&amp;#39;</span>, <span class=pl-s>&amp;#39;end_frame&amp;#39;</span>]:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L203\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;203\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC203\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-k>if</span> <span class=pl-s1>k</span> <span class=pl-c1>==</span> <span class=pl-s>&amp;#39;text&amp;#39;</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L204\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;204\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC204\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>                <span class=pl-s1>v</span> <span class=pl-c1>=</span> <span class=pl-s1>doc</span>[<span class=pl-s1>k</span>].<span class=pl-c1>replace</span>(<span class=pl-s>&amp;#39;<span class=pl-cce>\\n</span>&amp;#39;</span>, <span class=pl-s>&amp;#39; &amp;#39;</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L205\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;205\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC205\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-k>else</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L206\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;206\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC206\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>                <span class=pl-s1>v</span> <span class=pl-c1>=</span> <span class=pl-s1>doc</span>[<span class=pl-s1>k</span>]</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-olmo_trace-py-L207\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;207\&quot;></td>\n          <td id=\&quot;file-olmo_trace-py-LC207\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-en>print</span>(<span class=pl-s>f&amp;quot;- <span class=pl-s1><span class=pl-kos>{</span><span class=pl-s1>k</span><span class=pl-kos>}</span></span> --&amp;gt; <span class=pl-s1><span class=pl-kos>{</span><span class=pl-s1>v</span><span class=pl-kos>}</span></span>&amp;quot;</span>)</td>\n        </tr>\n  </table>\n</div>\n\n\n    </div>\n\n  </div>\n</div>\n\n      </div>\n      <div class=\&quot;gist-meta\&quot;>\n        <a href=\&quot;https://gist.github.com/wolfecameron/306aa72a0c5095db460e2ccea9b06777/raw/e1040a0e8198f9d82bbe20bcc7246416ed80bb0f/olmo_trace.py\&quot; style=\&quot;float:right\&quot; class=\&quot;Link--inTextBlock\&quot;>view raw</a>\n        <a href=\&quot;https://gist.github.com/wolfecameron/306aa72a0c5095db460e2ccea9b06777#file-olmo_trace-py\&quot; class=\&quot;Link--inTextBlock\&quot;>\n          olmo_trace.py\n        </a>\n        hosted with &amp;#10084; by <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.com\&quot;>GitHub</a>\n      </div>\n    </div>\n</div>\n&quot;,&quot;stylesheet&quot;:&quot;https://github.githubassets.com/assets/gist-embed-b1ee75c43dbe.css&quot;}" data-component-name="GitgistToDOM"><link rel="stylesheet" href="https://github.githubassets.com/assets/gist-embed-b1ee75c43dbe.css"><div id="gist138084024" class="gist">
    <div class="gist-file" data-color-mode="light" data-light-theme="light">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container">
  <div id="file-olmo_trace-py" class="file my-2">
    
    <div itemprop="text" class="Box-body p-0 blob-wrapper data type-python  " style="overflow:auto">

        
<div class="js-check-hidden-unicode js-blob-code-container blob-code-content">

  
  <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
  
    

    <span>
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
    </span>


  <div data-view-component="true" class="flash-action">        <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn">    Show hidden characters
</a>
</div>
</div>

  <span data-view-component="true" class="line-alert tooltipped tooltipped-e">
    
    

</span>

  <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="olmo_trace.py">
        <tbody><tr>
          <td id="file-olmo_trace-py-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
          <td id="file-olmo_trace-py-LC1" class="blob-code blob-code-inner js-file-line"><span class="pl-k">import</span> <span class="pl-s1">ast</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
          <td id="file-olmo_trace-py-LC2" class="blob-code blob-code-inner js-file-line"><span class="pl-k">import</span> <span class="pl-s1">math</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
          <td id="file-olmo_trace-py-LC3" class="blob-code blob-code-inner js-file-line"><span class="pl-k">import</span> <span class="pl-s1">random</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
          <td id="file-olmo_trace-py-LC4" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
          <td id="file-olmo_trace-py-LC5" class="blob-code blob-code-inner js-file-line"><span class="pl-k">from</span> <span class="pl-s1">infini_gram</span>.<span class="pl-s1">engine</span> <span class="pl-k">import</span> <span class="pl-v">InfiniGramEngine</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
          <td id="file-olmo_trace-py-LC6" class="blob-code blob-code-inner js-file-line"><span class="pl-k">from</span> <span class="pl-s1">transformers</span> <span class="pl-k">import</span> <span class="pl-v">AutoTokenizer</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
          <td id="file-olmo_trace-py-LC7" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
          <td id="file-olmo_trace-py-LC8" class="blob-code blob-code-inner js-file-line"><span class="pl-k">def</span> <span class="pl-en">compute_longest_prefix</span>(<span class="pl-s1">query</span>, <span class="pl-s1">doc</span>):</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
          <td id="file-olmo_trace-py-LC9" class="blob-code blob-code-inner js-file-line">    <span class="pl-s">"""helper function for computing longest prefix of query that exists</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
          <td id="file-olmo_trace-py-LC10" class="blob-code blob-code-inner js-file-line"><span class="pl-s">    within a document"""</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
          <td id="file-olmo_trace-py-LC11" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
          <td id="file-olmo_trace-py-LC12" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">shared_prefix_length</span>(<span class="pl-s1">list1</span>, <span class="pl-s1">list2</span>):</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
          <td id="file-olmo_trace-py-LC13" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">prefix_length</span> <span class="pl-c1">=</span> <span class="pl-c1">0</span>    </td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
          <td id="file-olmo_trace-py-LC14" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">for</span> <span class="pl-s1">elem1</span>, <span class="pl-s1">elem2</span> <span class="pl-c1">in</span> <span class="pl-en">zip</span>(<span class="pl-s1">list1</span>, <span class="pl-s1">list2</span>):</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
          <td id="file-olmo_trace-py-LC15" class="blob-code blob-code-inner js-file-line">            <span class="pl-k">if</span> <span class="pl-s1">elem1</span> <span class="pl-c1">==</span> <span class="pl-s1">elem2</span>:</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
          <td id="file-olmo_trace-py-LC16" class="blob-code blob-code-inner js-file-line">                <span class="pl-s1">prefix_length</span> <span class="pl-c1">+=</span> <span class="pl-c1">1</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
          <td id="file-olmo_trace-py-LC17" class="blob-code blob-code-inner js-file-line">            <span class="pl-k">else</span>:</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
          <td id="file-olmo_trace-py-LC18" class="blob-code blob-code-inner js-file-line">                <span class="pl-k">break</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
          <td id="file-olmo_trace-py-LC19" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">return</span> <span class="pl-s1">prefix_length</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td>
          <td id="file-olmo_trace-py-LC20" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td>
          <td id="file-olmo_trace-py-LC21" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">first_id</span> <span class="pl-c1">=</span> <span class="pl-s1">query</span>[<span class="pl-c1">0</span>]</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td>
          <td id="file-olmo_trace-py-LC22" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">start_idx</span> <span class="pl-c1">=</span> [<span class="pl-s1">index</span> <span class="pl-k">for</span> <span class="pl-s1">index</span>, <span class="pl-s1">value</span> <span class="pl-c1">in</span> <span class="pl-en">enumerate</span>(<span class="pl-s1">doc</span>) <span class="pl-k">if</span> <span class="pl-s1">value</span> <span class="pl-c1">==</span> <span class="pl-s1">first_id</span>]</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td>
          <td id="file-olmo_trace-py-LC23" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">longest_prefix</span> <span class="pl-c1">=</span> <span class="pl-c1">0</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td>
          <td id="file-olmo_trace-py-LC24" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">for</span> <span class="pl-s1">si</span> <span class="pl-c1">in</span> <span class="pl-s1">start_idx</span>:</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L25" class="blob-num js-line-number js-blob-rnum" data-line-number="25"></td>
          <td id="file-olmo_trace-py-LC25" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">longest_prefix</span> <span class="pl-c1">=</span> <span class="pl-en">max</span>(</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L26" class="blob-num js-line-number js-blob-rnum" data-line-number="26"></td>
          <td id="file-olmo_trace-py-LC26" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">longest_prefix</span>,</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L27" class="blob-num js-line-number js-blob-rnum" data-line-number="27"></td>
          <td id="file-olmo_trace-py-LC27" class="blob-code blob-code-inner js-file-line">            <span class="pl-en">shared_prefix_length</span>(<span class="pl-s1">query</span>, <span class="pl-s1">doc</span>[<span class="pl-s1">si</span>:]),</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L28" class="blob-num js-line-number js-blob-rnum" data-line-number="28"></td>
          <td id="file-olmo_trace-py-LC28" class="blob-code blob-code-inner js-file-line">        )</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L29" class="blob-num js-line-number js-blob-rnum" data-line-number="29"></td>
          <td id="file-olmo_trace-py-LC29" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">return</span> <span class="pl-s1">longest_prefix</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L30" class="blob-num js-line-number js-blob-rnum" data-line-number="30"></td>
          <td id="file-olmo_trace-py-LC30" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L31" class="blob-num js-line-number js-blob-rnum" data-line-number="31"></td>
          <td id="file-olmo_trace-py-LC31" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># setup</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L32" class="blob-num js-line-number js-blob-rnum" data-line-number="32"></td>
          <td id="file-olmo_trace-py-LC32" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">enc</span> <span class="pl-c1">=</span> <span class="pl-v">AutoTokenizer</span>.<span class="pl-c1">from_pretrained</span>(<span class="pl-s">"meta-llama/Llama-2-7b-hf"</span>, <span class="pl-s1">add_bos_token</span><span class="pl-c1">=</span><span class="pl-c1">False</span>, <span class="pl-s1">add_eos_token</span><span class="pl-c1">=</span><span class="pl-c1">False</span>)</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L33" class="blob-num js-line-number js-blob-rnum" data-line-number="33"></td>
          <td id="file-olmo_trace-py-LC33" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">engine</span> <span class="pl-c1">=</span> <span class="pl-en">InfiniGramEngine</span>(<span class="pl-s1">index_dir</span><span class="pl-c1">=</span><span class="pl-c1">&lt;</span><span class="pl-s1">path</span> <span class="pl-s1">to</span> <span class="pl-s1">index</span><span class="pl-c1">&gt;</span>, <span class="pl-s1">eos_token_id</span><span class="pl-c1">=</span><span class="pl-s1">enc</span>.<span class="pl-c1">eos_token_id</span>)</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L34" class="blob-num js-line-number js-blob-rnum" data-line-number="34"></td>
          <td id="file-olmo_trace-py-LC34" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">unigram_probs</span> <span class="pl-c1">=</span> {<span class="pl-c1">1</span>: <span class="pl-c1">0.5</span>, <span class="pl-c1">2</span>: <span class="pl-c1">0.5</span>} <span class="pl-c"># load pre-computed probabilities</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L35" class="blob-num js-line-number js-blob-rnum" data-line-number="35"></td>
          <td id="file-olmo_trace-py-LC35" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L36" class="blob-num js-line-number js-blob-rnum" data-line-number="36"></td>
          <td id="file-olmo_trace-py-LC36" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># LLM output / query to search</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L37" class="blob-num js-line-number js-blob-rnum" data-line-number="37"></td>
          <td id="file-olmo_trace-py-LC37" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">generation</span> <span class="pl-c1">=</span> <span class="pl-s">'Here is the output of the LLM that we want to search for in our data.'</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L38" class="blob-num js-line-number js-blob-rnum" data-line-number="38"></td>
          <td id="file-olmo_trace-py-LC38" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">gen_ids</span> <span class="pl-c1">=</span> <span class="pl-s1">enc</span>.<span class="pl-c1">encode</span>(<span class="pl-s1">generation</span>)</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L39" class="blob-num js-line-number js-blob-rnum" data-line-number="39"></td>
          <td id="file-olmo_trace-py-LC39" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L40" class="blob-num js-line-number js-blob-rnum" data-line-number="40"></td>
          <td id="file-olmo_trace-py-LC40" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L41" class="blob-num js-line-number js-blob-rnum" data-line-number="41"></td>
          <td id="file-olmo_trace-py-LC41" class="blob-code blob-code-inner js-file-line"><span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L42" class="blob-num js-line-number js-blob-rnum" data-line-number="42"></td>
          <td id="file-olmo_trace-py-LC42" class="blob-code blob-code-inner js-file-line"><span class="pl-s">Step One: find maximal matching spans</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L43" class="blob-num js-line-number js-blob-rnum" data-line-number="43"></td>
          <td id="file-olmo_trace-py-LC43" class="blob-code blob-code-inner js-file-line"><span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L44" class="blob-num js-line-number js-blob-rnum" data-line-number="44"></td>
          <td id="file-olmo_trace-py-LC44" class="blob-code blob-code-inner js-file-line"><span class="pl-c1">L</span> <span class="pl-c1">=</span> <span class="pl-en">len</span>(<span class="pl-s1">gen_ids</span>)</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L45" class="blob-num js-line-number js-blob-rnum" data-line-number="45"></td>
          <td id="file-olmo_trace-py-LC45" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">max_doc_toks</span> <span class="pl-c1">=</span> <span class="pl-en">len</span>(<span class="pl-s1">gen_ids</span>) <span class="pl-c1">*</span> <span class="pl-c1">2</span>  <span class="pl-c"># size of spans to retrieve in documents</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L46" class="blob-num js-line-number js-blob-rnum" data-line-number="46"></td>
          <td id="file-olmo_trace-py-LC46" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L47" class="blob-num js-line-number js-blob-rnum" data-line-number="47"></td>
          <td id="file-olmo_trace-py-LC47" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># find longest prefix match for every suffix in the query</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L48" class="blob-num js-line-number js-blob-rnum" data-line-number="48"></td>
          <td id="file-olmo_trace-py-LC48" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">spans</span> <span class="pl-c1">=</span> []</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L49" class="blob-num js-line-number js-blob-rnum" data-line-number="49"></td>
          <td id="file-olmo_trace-py-LC49" class="blob-code blob-code-inner js-file-line"><span class="pl-k">for</span> <span class="pl-s1">start</span> <span class="pl-c1">in</span> <span class="pl-en">range</span>(<span class="pl-en">len</span>(<span class="pl-s1">gen_ids</span>) <span class="pl-c1">-</span> <span class="pl-c1">1</span>):</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L50" class="blob-num js-line-number js-blob-rnum" data-line-number="50"></td>
          <td id="file-olmo_trace-py-LC50" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">_suffix</span> <span class="pl-c1">=</span> <span class="pl-s1">gen_ids</span>[<span class="pl-s1">start</span>:]</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L51" class="blob-num js-line-number js-blob-rnum" data-line-number="51"></td>
          <td id="file-olmo_trace-py-LC51" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">_suff_res</span> <span class="pl-c1">=</span> <span class="pl-s1">engine</span>.<span class="pl-c1">find</span>(<span class="pl-s1">input_ids</span><span class="pl-c1">=</span><span class="pl-s1">_suffix</span>)</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L52" class="blob-num js-line-number js-blob-rnum" data-line-number="52"></td>
          <td id="file-olmo_trace-py-LC52" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L53" class="blob-num js-line-number js-blob-rnum" data-line-number="53"></td>
          <td id="file-olmo_trace-py-LC53" class="blob-code blob-code-inner js-file-line">    <span class="pl-c"># if no match, get the longest matching prefix using find result</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L54" class="blob-num js-line-number js-blob-rnum" data-line-number="54"></td>
          <td id="file-olmo_trace-py-LC54" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">if</span> <span class="pl-s1">_suff_res</span>[<span class="pl-s">'cnt'</span>] <span class="pl-c1">==</span> <span class="pl-c1">0</span>:</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L55" class="blob-num js-line-number js-blob-rnum" data-line-number="55"></td>
          <td id="file-olmo_trace-py-LC55" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">_shards</span> <span class="pl-c1">=</span> <span class="pl-s1">_suff_res</span>[<span class="pl-s">'segment_by_shard'</span>]</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L56" class="blob-num js-line-number js-blob-rnum" data-line-number="56"></td>
          <td id="file-olmo_trace-py-LC56" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">assert</span> <span class="pl-en">len</span>(<span class="pl-s1">_shards</span>) <span class="pl-c1">==</span> <span class="pl-c1">1</span>  <span class="pl-c"># assume only one shard</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L57" class="blob-num js-line-number js-blob-rnum" data-line-number="57"></td>
          <td id="file-olmo_trace-py-LC57" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">_doc_ids</span> <span class="pl-c1">=</span> <span class="pl-s1">engine</span>.<span class="pl-c1">get_doc_by_rank</span>(</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L58" class="blob-num js-line-number js-blob-rnum" data-line-number="58"></td>
          <td id="file-olmo_trace-py-LC58" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">s</span><span class="pl-c1">=</span><span class="pl-c1">0</span>,  <span class="pl-c"># assume only one shard</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L59" class="blob-num js-line-number js-blob-rnum" data-line-number="59"></td>
          <td id="file-olmo_trace-py-LC59" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">rank</span><span class="pl-c1">=</span><span class="pl-s1">_shards</span>[<span class="pl-c1">0</span>][<span class="pl-c1">0</span>],</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L60" class="blob-num js-line-number js-blob-rnum" data-line-number="60"></td>
          <td id="file-olmo_trace-py-LC60" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">max_disp_len</span><span class="pl-c1">=</span><span class="pl-s1">max_doc_toks</span>,</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L61" class="blob-num js-line-number js-blob-rnum" data-line-number="61"></td>
          <td id="file-olmo_trace-py-LC61" class="blob-code blob-code-inner js-file-line">        )[<span class="pl-s">'token_ids'</span>]</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L62" class="blob-num js-line-number js-blob-rnum" data-line-number="62"></td>
          <td id="file-olmo_trace-py-LC62" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">matched_toks</span> <span class="pl-c1">=</span> <span class="pl-en">compute_longest_prefix</span>(<span class="pl-s1">_suffix</span>, <span class="pl-s1">_doc_ids</span>)  <span class="pl-c"># get longest matching prefix</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L63" class="blob-num js-line-number js-blob-rnum" data-line-number="63"></td>
          <td id="file-olmo_trace-py-LC63" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">elif</span> <span class="pl-s1">_suff_res</span>[<span class="pl-s">'cnt'</span>] <span class="pl-c1">&gt;</span> <span class="pl-c1">0</span>:</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L64" class="blob-num js-line-number js-blob-rnum" data-line-number="64"></td>
          <td id="file-olmo_trace-py-LC64" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">matched_toks</span> <span class="pl-c1">=</span> <span class="pl-en">len</span>(<span class="pl-s1">_suffix</span>)</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L65" class="blob-num js-line-number js-blob-rnum" data-line-number="65"></td>
          <td id="file-olmo_trace-py-LC65" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">spans</span>.<span class="pl-c1">append</span>((<span class="pl-s1">start</span>, <span class="pl-s1">start</span> <span class="pl-c1">+</span> <span class="pl-s1">matched_toks</span>))</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L66" class="blob-num js-line-number js-blob-rnum" data-line-number="66"></td>
          <td id="file-olmo_trace-py-LC66" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L67" class="blob-num js-line-number js-blob-rnum" data-line-number="67"></td>
          <td id="file-olmo_trace-py-LC67" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># remove partial and non-self-contained spans</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L68" class="blob-num js-line-number js-blob-rnum" data-line-number="68"></td>
          <td id="file-olmo_trace-py-LC68" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">full_spans</span> <span class="pl-c1">=</span> []</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L69" class="blob-num js-line-number js-blob-rnum" data-line-number="69"></td>
          <td id="file-olmo_trace-py-LC69" class="blob-code blob-code-inner js-file-line"><span class="pl-k">for</span> <span class="pl-s1">start</span>, <span class="pl-s1">end</span> <span class="pl-c1">in</span> <span class="pl-s1">spans</span>:</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L70" class="blob-num js-line-number js-blob-rnum" data-line-number="70"></td>
          <td id="file-olmo_trace-py-LC70" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">span_ids</span> <span class="pl-c1">=</span> <span class="pl-s1">gen_ids</span>[<span class="pl-s1">start</span>: <span class="pl-s1">end</span>]</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L71" class="blob-num js-line-number js-blob-rnum" data-line-number="71"></td>
          <td id="file-olmo_trace-py-LC71" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">span_text</span> <span class="pl-c1">=</span> <span class="pl-s1">enc</span>.<span class="pl-c1">decode</span>(<span class="pl-s1">span_ids</span>)</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L72" class="blob-num js-line-number js-blob-rnum" data-line-number="72"></td>
          <td id="file-olmo_trace-py-LC72" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L73" class="blob-num js-line-number js-blob-rnum" data-line-number="73"></td>
          <td id="file-olmo_trace-py-LC73" class="blob-code blob-code-inner js-file-line">    <span class="pl-c"># check for internal punctuation</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L74" class="blob-num js-line-number js-blob-rnum" data-line-number="74"></td>
          <td id="file-olmo_trace-py-LC74" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">has_internal_punc</span> <span class="pl-c1">=</span> <span class="pl-c1">False</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L75" class="blob-num js-line-number js-blob-rnum" data-line-number="75"></td>
          <td id="file-olmo_trace-py-LC75" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">punc_chars</span> <span class="pl-c1">=</span> <span class="pl-s">"!.?<span class="pl-cce">\n</span>"</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L76" class="blob-num js-line-number js-blob-rnum" data-line-number="76"></td>
          <td id="file-olmo_trace-py-LC76" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">for</span> <span class="pl-s1">ch</span> <span class="pl-c1">in</span> <span class="pl-s1">span_text</span>[:<span class="pl-c1">-</span><span class="pl-c1">1</span>]:</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L77" class="blob-num js-line-number js-blob-rnum" data-line-number="77"></td>
          <td id="file-olmo_trace-py-LC77" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">if</span> <span class="pl-s1">ch</span> <span class="pl-c1">in</span> <span class="pl-s1">punc_chars</span>:</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L78" class="blob-num js-line-number js-blob-rnum" data-line-number="78"></td>
          <td id="file-olmo_trace-py-LC78" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">has_internal_punc</span> <span class="pl-c1">=</span> <span class="pl-c1">True</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L79" class="blob-num js-line-number js-blob-rnum" data-line-number="79"></td>
          <td id="file-olmo_trace-py-LC79" class="blob-code blob-code-inner js-file-line">            <span class="pl-k">break</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L80" class="blob-num js-line-number js-blob-rnum" data-line-number="80"></td>
          <td id="file-olmo_trace-py-LC80" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">if</span> <span class="pl-s1">has_internal_punc</span>:</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L81" class="blob-num js-line-number js-blob-rnum" data-line-number="81"></td>
          <td id="file-olmo_trace-py-LC81" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">continue</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L82" class="blob-num js-line-number js-blob-rnum" data-line-number="82"></td>
          <td id="file-olmo_trace-py-LC82" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L83" class="blob-num js-line-number js-blob-rnum" data-line-number="83"></td>
          <td id="file-olmo_trace-py-LC83" class="blob-code blob-code-inner js-file-line">    <span class="pl-c"># check if first token is a continuation of a word</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L84" class="blob-num js-line-number js-blob-rnum" data-line-number="84"></td>
          <td id="file-olmo_trace-py-LC84" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">first_tok_id</span> <span class="pl-c1">=</span> <span class="pl-s1">span_ids</span>[<span class="pl-c1">0</span>]</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L85" class="blob-num js-line-number js-blob-rnum" data-line-number="85"></td>
          <td id="file-olmo_trace-py-LC85" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">first_tok</span> <span class="pl-c1">=</span> <span class="pl-s1">enc</span>.<span class="pl-c1">convert_ids_to_tokens</span>(<span class="pl-s1">first_tok_id</span>)</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L86" class="blob-num js-line-number js-blob-rnum" data-line-number="86"></td>
          <td id="file-olmo_trace-py-LC86" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">if</span> <span class="pl-s1">first_tok</span>[<span class="pl-c1">0</span>] <span class="pl-c1">!=</span> <span class="pl-s">'&#9601;'</span>:  <span class="pl-c"># assumes Llama 2 token format</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L87" class="blob-num js-line-number js-blob-rnum" data-line-number="87"></td>
          <td id="file-olmo_trace-py-LC87" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">continue</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L88" class="blob-num js-line-number js-blob-rnum" data-line-number="88"></td>
          <td id="file-olmo_trace-py-LC88" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L89" class="blob-num js-line-number js-blob-rnum" data-line-number="89"></td>
          <td id="file-olmo_trace-py-LC89" class="blob-code blob-code-inner js-file-line">    <span class="pl-c"># no sub-token follows the last token</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L90" class="blob-num js-line-number js-blob-rnum" data-line-number="90"></td>
          <td id="file-olmo_trace-py-LC90" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">if</span> <span class="pl-s1">end</span> <span class="pl-c1">&lt;</span> <span class="pl-en">len</span>(<span class="pl-s1">gen_ids</span>) <span class="pl-c1">and</span> <span class="pl-s1">tokenizer</span>.<span class="pl-c1">convert_ids_to_tokens</span>(<span class="pl-s1">gen_ids</span>[<span class="pl-s1">end</span>])[<span class="pl-c1">0</span>] <span class="pl-c1">!=</span> <span class="pl-s">"&#9601;"</span>:</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L91" class="blob-num js-line-number js-blob-rnum" data-line-number="91"></td>
          <td id="file-olmo_trace-py-LC91" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">continue</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L92" class="blob-num js-line-number js-blob-rnum" data-line-number="92"></td>
          <td id="file-olmo_trace-py-LC92" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">full_spans</span>.<span class="pl-c1">append</span>((<span class="pl-s1">start</span>, <span class="pl-s1">end</span>, <span class="pl-s1">span_ids</span>, <span class="pl-s1">span_text</span>))    </td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L93" class="blob-num js-line-number js-blob-rnum" data-line-number="93"></td>
          <td id="file-olmo_trace-py-LC93" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L94" class="blob-num js-line-number js-blob-rnum" data-line-number="94"></td>
          <td id="file-olmo_trace-py-LC94" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># remove non-maximal spans</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L95" class="blob-num js-line-number js-blob-rnum" data-line-number="95"></td>
          <td id="file-olmo_trace-py-LC95" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">maximal_spans</span> <span class="pl-c1">=</span> []</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L96" class="blob-num js-line-number js-blob-rnum" data-line-number="96"></td>
          <td id="file-olmo_trace-py-LC96" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">max_end_pos</span> <span class="pl-c1">=</span> <span class="pl-c1">-</span><span class="pl-c1">1</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L97" class="blob-num js-line-number js-blob-rnum" data-line-number="97"></td>
          <td id="file-olmo_trace-py-LC97" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">full_spans</span> <span class="pl-c1">=</span> <span class="pl-en">sorted</span>(<span class="pl-s1">full_spans</span>)</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L98" class="blob-num js-line-number js-blob-rnum" data-line-number="98"></td>
          <td id="file-olmo_trace-py-LC98" class="blob-code blob-code-inner js-file-line"><span class="pl-k">for</span> <span class="pl-s1">start</span>, <span class="pl-s1">end</span>, <span class="pl-s1">ids</span>, <span class="pl-s1">text</span> <span class="pl-c1">in</span> <span class="pl-s1">full_spans</span>:</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L99" class="blob-num js-line-number js-blob-rnum" data-line-number="99"></td>
          <td id="file-olmo_trace-py-LC99" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">if</span> <span class="pl-s1">end</span> <span class="pl-c1">&gt;</span> <span class="pl-s1">max_end_pos</span>:</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L100" class="blob-num js-line-number js-blob-rnum" data-line-number="100"></td>
          <td id="file-olmo_trace-py-LC100" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">maximal_spans</span>.<span class="pl-c1">append</span>((<span class="pl-s1">start</span>, <span class="pl-s1">end</span>, <span class="pl-s1">ids</span>, <span class="pl-s1">text</span>))</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L101" class="blob-num js-line-number js-blob-rnum" data-line-number="101"></td>
          <td id="file-olmo_trace-py-LC101" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">max_end_pos</span> <span class="pl-c1">=</span> <span class="pl-s1">end</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L102" class="blob-num js-line-number js-blob-rnum" data-line-number="102"></td>
          <td id="file-olmo_trace-py-LC102" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L103" class="blob-num js-line-number js-blob-rnum" data-line-number="103"></td>
          <td id="file-olmo_trace-py-LC103" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L104" class="blob-num js-line-number js-blob-rnum" data-line-number="104"></td>
          <td id="file-olmo_trace-py-LC104" class="blob-code blob-code-inner js-file-line"><span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L105" class="blob-num js-line-number js-blob-rnum" data-line-number="105"></td>
          <td id="file-olmo_trace-py-LC105" class="blob-code blob-code-inner js-file-line"><span class="pl-s">Step Two: filter to keep long / unique spans</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L106" class="blob-num js-line-number js-blob-rnum" data-line-number="106"></td>
          <td id="file-olmo_trace-py-LC106" class="blob-code blob-code-inner js-file-line"><span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L107" class="blob-num js-line-number js-blob-rnum" data-line-number="107"></td>
          <td id="file-olmo_trace-py-LC107" class="blob-code blob-code-inner js-file-line"><span class="pl-c1">K</span> <span class="pl-c1">=</span> <span class="pl-s1">math</span>.<span class="pl-c1">ceil</span>(<span class="pl-c1">0.05</span> <span class="pl-c1">*</span> <span class="pl-c1">L</span>)</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L108" class="blob-num js-line-number js-blob-rnum" data-line-number="108"></td>
          <td id="file-olmo_trace-py-LC108" class="blob-code blob-code-inner js-file-line"><span class="pl-k">assert</span> <span class="pl-c1">K</span> <span class="pl-c1">&gt;</span> <span class="pl-c1">0</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L109" class="blob-num js-line-number js-blob-rnum" data-line-number="109"></td>
          <td id="file-olmo_trace-py-LC109" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">filt_spans</span> <span class="pl-c1">=</span> []</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L110" class="blob-num js-line-number js-blob-rnum" data-line-number="110"></td>
          <td id="file-olmo_trace-py-LC110" class="blob-code blob-code-inner js-file-line"><span class="pl-k">for</span> <span class="pl-s1">start</span>, <span class="pl-s1">end</span>, <span class="pl-s1">ids</span>, <span class="pl-s1">text</span> <span class="pl-c1">in</span> <span class="pl-s1">maximal_spans</span>:</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L111" class="blob-num js-line-number js-blob-rnum" data-line-number="111"></td>
          <td id="file-olmo_trace-py-LC111" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">span_uni_prob</span> <span class="pl-c1">=</span> [<span class="pl-s1">unigram_probs</span>.<span class="pl-c1">get</span>(<span class="pl-s1">_id</span>) <span class="pl-k">for</span> <span class="pl-s1">_id</span> <span class="pl-c1">in</span> <span class="pl-s1">ids</span>]</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L112" class="blob-num js-line-number js-blob-rnum" data-line-number="112"></td>
          <td id="file-olmo_trace-py-LC112" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">span_uni_prob</span> <span class="pl-c1">=</span> <span class="pl-s1">math</span>.<span class="pl-c1">prod</span>(<span class="pl-s1">span_uni_prob</span>)</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L113" class="blob-num js-line-number js-blob-rnum" data-line-number="113"></td>
          <td id="file-olmo_trace-py-LC113" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">filt_spans</span>.<span class="pl-c1">append</span>((<span class="pl-s1">start</span>, <span class="pl-s1">end</span>, <span class="pl-s1">ids</span>, <span class="pl-s1">text</span>, <span class="pl-s1">span_uni_prob</span>))</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L114" class="blob-num js-line-number js-blob-rnum" data-line-number="114"></td>
          <td id="file-olmo_trace-py-LC114" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">filt_spans</span> <span class="pl-c1">=</span> <span class="pl-en">sorted</span>(<span class="pl-s1">filt_spans</span>, <span class="pl-s1">key</span><span class="pl-c1">=</span><span class="pl-k">lambda</span> <span class="pl-s1">x</span>: <span class="pl-s1">x</span>[<span class="pl-c1">-</span><span class="pl-c1">1</span>])</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L115" class="blob-num js-line-number js-blob-rnum" data-line-number="115"></td>
          <td id="file-olmo_trace-py-LC115" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">filt_spans</span> <span class="pl-c1">=</span> <span class="pl-s1">filt_spans</span>[:<span class="pl-c1">K</span>]</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L116" class="blob-num js-line-number js-blob-rnum" data-line-number="116"></td>
          <td id="file-olmo_trace-py-LC116" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">filt_spans</span> <span class="pl-c1">=</span> <span class="pl-en">sorted</span>(<span class="pl-s1">filt_spans</span>)  <span class="pl-c"># sort based on start position again</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L117" class="blob-num js-line-number js-blob-rnum" data-line-number="117"></td>
          <td id="file-olmo_trace-py-LC117" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L118" class="blob-num js-line-number js-blob-rnum" data-line-number="118"></td>
          <td id="file-olmo_trace-py-LC118" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L119" class="blob-num js-line-number js-blob-rnum" data-line-number="119"></td>
          <td id="file-olmo_trace-py-LC119" class="blob-code blob-code-inner js-file-line"><span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L120" class="blob-num js-line-number js-blob-rnum" data-line-number="120"></td>
          <td id="file-olmo_trace-py-LC120" class="blob-code blob-code-inner js-file-line"><span class="pl-s">Step Three: retrieve Enclosing Docs</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L121" class="blob-num js-line-number js-blob-rnum" data-line-number="121"></td>
          <td id="file-olmo_trace-py-LC121" class="blob-code blob-code-inner js-file-line"><span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L122" class="blob-num js-line-number js-blob-rnum" data-line-number="122"></td>
          <td id="file-olmo_trace-py-LC122" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">docs_per_span</span> <span class="pl-c1">=</span> <span class="pl-c1">10</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L123" class="blob-num js-line-number js-blob-rnum" data-line-number="123"></td>
          <td id="file-olmo_trace-py-LC123" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">span_to_docs</span> <span class="pl-c1">=</span> <span class="pl-en">defaultdict</span>(<span class="pl-s1">list</span>)</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L124" class="blob-num js-line-number js-blob-rnum" data-line-number="124"></td>
          <td id="file-olmo_trace-py-LC124" class="blob-code blob-code-inner js-file-line"><span class="pl-k">for</span> <span class="pl-s1">i</span>, (<span class="pl-s1">start</span>, <span class="pl-s1">end</span>, <span class="pl-s1">ids</span>, <span class="pl-s1">text</span>, <span class="pl-s1">uni_prob</span>) <span class="pl-c1">in</span> <span class="pl-en">enumerate</span>(<span class="pl-s1">filt_spans</span>):</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L125" class="blob-num js-line-number js-blob-rnum" data-line-number="125"></td>
          <td id="file-olmo_trace-py-LC125" class="blob-code blob-code-inner js-file-line">    <span class="pl-c"># run retrieval in infinigram index to get documents</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L126" class="blob-num js-line-number js-blob-rnum" data-line-number="126"></td>
          <td id="file-olmo_trace-py-LC126" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">span_res</span> <span class="pl-c1">=</span> <span class="pl-s1">engine</span>.<span class="pl-c1">find</span>(<span class="pl-s1">input_ids</span><span class="pl-c1">=</span><span class="pl-s1">ids</span>)</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L127" class="blob-num js-line-number js-blob-rnum" data-line-number="127"></td>
          <td id="file-olmo_trace-py-LC127" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">assert</span> <span class="pl-s1">span_res</span>[<span class="pl-s">'cnt'</span>] <span class="pl-c1">&gt;</span> <span class="pl-c1">0</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L128" class="blob-num js-line-number js-blob-rnum" data-line-number="128"></td>
          <td id="file-olmo_trace-py-LC128" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">assert</span> <span class="pl-en">len</span>(<span class="pl-s1">span_res</span>[<span class="pl-s">'segment_by_shard'</span>]) <span class="pl-c1">==</span> <span class="pl-c1">1</span>  <span class="pl-c"># assume only one shard</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L129" class="blob-num js-line-number js-blob-rnum" data-line-number="129"></td>
          <td id="file-olmo_trace-py-LC129" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L130" class="blob-num js-line-number js-blob-rnum" data-line-number="130"></td>
          <td id="file-olmo_trace-py-LC130" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">rank_start</span>, <span class="pl-s1">rank_end</span> <span class="pl-c1">=</span> <span class="pl-s1">span_res</span>[<span class="pl-s">'segment_by_shard'</span>][<span class="pl-c1">0</span>]</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L131" class="blob-num js-line-number js-blob-rnum" data-line-number="131"></td>
          <td id="file-olmo_trace-py-LC131" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">ranks</span> <span class="pl-c1">=</span> [<span class="pl-s1">r</span> <span class="pl-k">for</span> <span class="pl-s1">r</span> <span class="pl-c1">in</span> <span class="pl-en">range</span>(<span class="pl-s1">rank_start</span>, <span class="pl-s1">rank_end</span>)]</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L132" class="blob-num js-line-number js-blob-rnum" data-line-number="132"></td>
          <td id="file-olmo_trace-py-LC132" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">if</span> <span class="pl-en">len</span>(<span class="pl-s1">ranks</span>) <span class="pl-c1">&gt;</span> <span class="pl-s1">docs_per_span</span>:</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L133" class="blob-num js-line-number js-blob-rnum" data-line-number="133"></td>
          <td id="file-olmo_trace-py-LC133" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># retrieve fixed number of documents for each span</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L134" class="blob-num js-line-number js-blob-rnum" data-line-number="134"></td>
          <td id="file-olmo_trace-py-LC134" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">ranks</span> <span class="pl-c1">=</span> <span class="pl-en">sorted</span>(<span class="pl-s1">random</span>.<span class="pl-c1">sample</span>(<span class="pl-s1">ranks</span>, <span class="pl-s1">docs_per_span</span>))</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L135" class="blob-num js-line-number js-blob-rnum" data-line-number="135"></td>
          <td id="file-olmo_trace-py-LC135" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L136" class="blob-num js-line-number js-blob-rnum" data-line-number="136"></td>
          <td id="file-olmo_trace-py-LC136" class="blob-code blob-code-inner js-file-line">    <span class="pl-c"># NOTE: we can instead rank documents by BM25 score here!</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L137" class="blob-num js-line-number js-blob-rnum" data-line-number="137"></td>
          <td id="file-olmo_trace-py-LC137" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">for</span> <span class="pl-s1">r</span> <span class="pl-c1">in</span> <span class="pl-s1">ranks</span>:</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L138" class="blob-num js-line-number js-blob-rnum" data-line-number="138"></td>
          <td id="file-olmo_trace-py-LC138" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">_doc</span> <span class="pl-c1">=</span> <span class="pl-s1">engine</span>.<span class="pl-c1">get_doc_by_rank</span>(</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L139" class="blob-num js-line-number js-blob-rnum" data-line-number="139"></td>
          <td id="file-olmo_trace-py-LC139" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">s</span><span class="pl-c1">=</span><span class="pl-c1">0</span>,</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L140" class="blob-num js-line-number js-blob-rnum" data-line-number="140"></td>
          <td id="file-olmo_trace-py-LC140" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">rank</span><span class="pl-c1">=</span><span class="pl-s1">r</span>,</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L141" class="blob-num js-line-number js-blob-rnum" data-line-number="141"></td>
          <td id="file-olmo_trace-py-LC141" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">max_disp_len</span><span class="pl-c1">=</span><span class="pl-s1">max_doc_toks</span>,</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L142" class="blob-num js-line-number js-blob-rnum" data-line-number="142"></td>
          <td id="file-olmo_trace-py-LC142" class="blob-code blob-code-inner js-file-line">        )</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L143" class="blob-num js-line-number js-blob-rnum" data-line-number="143"></td>
          <td id="file-olmo_trace-py-LC143" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">_doc_meta</span> <span class="pl-c1">=</span> <span class="pl-s1">ast</span>.<span class="pl-c1">literal_eval</span>(<span class="pl-s1">_doc</span>[<span class="pl-s">'metadata'</span>])[<span class="pl-s">'metadata'</span>]</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L144" class="blob-num js-line-number js-blob-rnum" data-line-number="144"></td>
          <td id="file-olmo_trace-py-LC144" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">_doc_text</span> <span class="pl-c1">=</span> <span class="pl-s1">enc</span>.<span class="pl-c1">decode</span>(<span class="pl-s1">_doc</span>[<span class="pl-s">'token_ids'</span>])</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L145" class="blob-num js-line-number js-blob-rnum" data-line-number="145"></td>
          <td id="file-olmo_trace-py-LC145" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">_doc_data</span> <span class="pl-c1">=</span> {</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L146" class="blob-num js-line-number js-blob-rnum" data-line-number="146"></td>
          <td id="file-olmo_trace-py-LC146" class="blob-code blob-code-inner js-file-line">            <span class="pl-s">"text"</span>: <span class="pl-s1">_doc_text</span>,</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L147" class="blob-num js-line-number js-blob-rnum" data-line-number="147"></td>
          <td id="file-olmo_trace-py-LC147" class="blob-code blob-code-inner js-file-line">            <span class="pl-c1">**</span><span class="pl-s1">_doc_meta</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L148" class="blob-num js-line-number js-blob-rnum" data-line-number="148"></td>
          <td id="file-olmo_trace-py-LC148" class="blob-code blob-code-inner js-file-line">        }</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L149" class="blob-num js-line-number js-blob-rnum" data-line-number="149"></td>
          <td id="file-olmo_trace-py-LC149" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">span_to_docs</span>[<span class="pl-s1">i</span>].<span class="pl-c1">append</span>(<span class="pl-s1">_doc_data</span>)</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L150" class="blob-num js-line-number js-blob-rnum" data-line-number="150"></td>
          <td id="file-olmo_trace-py-LC150" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L151" class="blob-num js-line-number js-blob-rnum" data-line-number="151"></td>
          <td id="file-olmo_trace-py-LC151" class="blob-code blob-code-inner js-file-line">        </td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L152" class="blob-num js-line-number js-blob-rnum" data-line-number="152"></td>
          <td id="file-olmo_trace-py-LC152" class="blob-code blob-code-inner js-file-line"><span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L153" class="blob-num js-line-number js-blob-rnum" data-line-number="153"></td>
          <td id="file-olmo_trace-py-LC153" class="blob-code blob-code-inner js-file-line"><span class="pl-s">Step Four: merge overlapping spans</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L154" class="blob-num js-line-number js-blob-rnum" data-line-number="154"></td>
          <td id="file-olmo_trace-py-LC154" class="blob-code blob-code-inner js-file-line"><span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L155" class="blob-num js-line-number js-blob-rnum" data-line-number="155"></td>
          <td id="file-olmo_trace-py-LC155" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># get indices of spans to merge together</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L156" class="blob-num js-line-number js-blob-rnum" data-line-number="156"></td>
          <td id="file-olmo_trace-py-LC156" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">merged_spans</span> <span class="pl-c1">=</span> [[<span class="pl-c1">0</span>]]</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L157" class="blob-num js-line-number js-blob-rnum" data-line-number="157"></td>
          <td id="file-olmo_trace-py-LC157" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">curr_idx</span> <span class="pl-c1">=</span> <span class="pl-c1">0</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L158" class="blob-num js-line-number js-blob-rnum" data-line-number="158"></td>
          <td id="file-olmo_trace-py-LC158" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">curr_start</span> <span class="pl-c1">=</span> <span class="pl-s1">filt_spans</span>[<span class="pl-c1">0</span>][<span class="pl-c1">0</span>]</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L159" class="blob-num js-line-number js-blob-rnum" data-line-number="159"></td>
          <td id="file-olmo_trace-py-LC159" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">curr_end</span> <span class="pl-c1">=</span> <span class="pl-s1">filt_spans</span>[<span class="pl-c1">0</span>][<span class="pl-c1">1</span>]</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L160" class="blob-num js-line-number js-blob-rnum" data-line-number="160"></td>
          <td id="file-olmo_trace-py-LC160" class="blob-code blob-code-inner js-file-line"><span class="pl-k">for</span> <span class="pl-s1">i</span>, <span class="pl-s1">next_span</span> <span class="pl-c1">in</span> <span class="pl-en">enumerate</span>(<span class="pl-s1">filt_spans</span>[<span class="pl-c1">1</span>:]):</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L161" class="blob-num js-line-number js-blob-rnum" data-line-number="161"></td>
          <td id="file-olmo_trace-py-LC161" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">start</span> <span class="pl-c1">=</span> <span class="pl-s1">next_span</span>[<span class="pl-c1">0</span>]</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L162" class="blob-num js-line-number js-blob-rnum" data-line-number="162"></td>
          <td id="file-olmo_trace-py-LC162" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">end</span> <span class="pl-c1">=</span> <span class="pl-s1">next_span</span>[<span class="pl-c1">1</span>]</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L163" class="blob-num js-line-number js-blob-rnum" data-line-number="163"></td>
          <td id="file-olmo_trace-py-LC163" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">if</span> <span class="pl-s1">start</span> <span class="pl-c1">&lt;</span> <span class="pl-s1">curr_end</span>:</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L164" class="blob-num js-line-number js-blob-rnum" data-line-number="164"></td>
          <td id="file-olmo_trace-py-LC164" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">curr_end</span> <span class="pl-c1">=</span> <span class="pl-en">max</span>(<span class="pl-s1">curr_end</span>, <span class="pl-s1">end</span>)</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L165" class="blob-num js-line-number js-blob-rnum" data-line-number="165"></td>
          <td id="file-olmo_trace-py-LC165" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">merged_spans</span>[<span class="pl-s1">curr_idx</span>].<span class="pl-c1">append</span>(<span class="pl-s1">i</span> <span class="pl-c1">+</span> <span class="pl-c1">1</span>)</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L166" class="blob-num js-line-number js-blob-rnum" data-line-number="166"></td>
          <td id="file-olmo_trace-py-LC166" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">else</span>:</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L167" class="blob-num js-line-number js-blob-rnum" data-line-number="167"></td>
          <td id="file-olmo_trace-py-LC167" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">curr_start</span>, <span class="pl-s1">curr_end</span> <span class="pl-c1">=</span> <span class="pl-s1">start</span>, <span class="pl-s1">end</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L168" class="blob-num js-line-number js-blob-rnum" data-line-number="168"></td>
          <td id="file-olmo_trace-py-LC168" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">curr_idx</span> <span class="pl-c1">+=</span> <span class="pl-c1">1</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L169" class="blob-num js-line-number js-blob-rnum" data-line-number="169"></td>
          <td id="file-olmo_trace-py-LC169" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">merged_spans</span>.<span class="pl-c1">append</span>([<span class="pl-s1">i</span> <span class="pl-c1">+</span> <span class="pl-c1">1</span>])</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L170" class="blob-num js-line-number js-blob-rnum" data-line-number="170"></td>
          <td id="file-olmo_trace-py-LC170" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">assert</span> <span class="pl-en">len</span>(<span class="pl-s1">merged_spans</span>) <span class="pl-c1">==</span> <span class="pl-s1">curr_idx</span> <span class="pl-c1">+</span> <span class="pl-c1">1</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L171" class="blob-num js-line-number js-blob-rnum" data-line-number="171"></td>
          <td id="file-olmo_trace-py-LC171" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L172" class="blob-num js-line-number js-blob-rnum" data-line-number="172"></td>
          <td id="file-olmo_trace-py-LC172" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># merge spans into a final set</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L173" class="blob-num js-line-number js-blob-rnum" data-line-number="173"></td>
          <td id="file-olmo_trace-py-LC173" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">final_spans</span> <span class="pl-c1">=</span> []</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L174" class="blob-num js-line-number js-blob-rnum" data-line-number="174"></td>
          <td id="file-olmo_trace-py-LC174" class="blob-code blob-code-inner js-file-line"><span class="pl-k">for</span> <span class="pl-s1">ms</span> <span class="pl-c1">in</span> <span class="pl-s1">merged_spans</span>:</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L175" class="blob-num js-line-number js-blob-rnum" data-line-number="175"></td>
          <td id="file-olmo_trace-py-LC175" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">all_docs</span> <span class="pl-c1">=</span> []</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L176" class="blob-num js-line-number js-blob-rnum" data-line-number="176"></td>
          <td id="file-olmo_trace-py-LC176" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">docs_per_merged_span</span> <span class="pl-c1">=</span> <span class="pl-s1">math</span>.<span class="pl-c1">ceil</span>(<span class="pl-s1">docs_per_span</span> <span class="pl-c1">/</span> <span class="pl-en">float</span>(<span class="pl-en">len</span>(<span class="pl-s1">ms</span>)))  <span class="pl-c"># subsample docs for spans being merged</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L177" class="blob-num js-line-number js-blob-rnum" data-line-number="177"></td>
          <td id="file-olmo_trace-py-LC177" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">for</span> <span class="pl-s1">i</span> <span class="pl-c1">in</span> <span class="pl-s1">ms</span>:</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L178" class="blob-num js-line-number js-blob-rnum" data-line-number="178"></td>
          <td id="file-olmo_trace-py-LC178" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># take top docs from each span being merged</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L179" class="blob-num js-line-number js-blob-rnum" data-line-number="179"></td>
          <td id="file-olmo_trace-py-LC179" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">all_docs</span>.<span class="pl-c1">extend</span>(<span class="pl-s1">span_to_docs</span>[<span class="pl-s1">i</span>][:<span class="pl-s1">docs_per_merged_span</span>])</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L180" class="blob-num js-line-number js-blob-rnum" data-line-number="180"></td>
          <td id="file-olmo_trace-py-LC180" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">_spans</span> <span class="pl-c1">=</span> [<span class="pl-s1">filt_spans</span>[<span class="pl-s1">i</span>] <span class="pl-k">for</span> <span class="pl-s1">i</span> <span class="pl-c1">in</span> <span class="pl-s1">ms</span>]</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L181" class="blob-num js-line-number js-blob-rnum" data-line-number="181"></td>
          <td id="file-olmo_trace-py-LC181" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">start</span> <span class="pl-c1">=</span> <span class="pl-en">min</span>([<span class="pl-s1">x</span>[<span class="pl-c1">0</span>] <span class="pl-k">for</span> <span class="pl-s1">x</span> <span class="pl-c1">in</span> <span class="pl-s1">_spans</span>])</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L182" class="blob-num js-line-number js-blob-rnum" data-line-number="182"></td>
          <td id="file-olmo_trace-py-LC182" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">end</span> <span class="pl-c1">=</span> <span class="pl-en">max</span>([<span class="pl-s1">x</span>[<span class="pl-c1">1</span>] <span class="pl-k">for</span> <span class="pl-s1">x</span> <span class="pl-c1">in</span> <span class="pl-s1">_spans</span>])</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L183" class="blob-num js-line-number js-blob-rnum" data-line-number="183"></td>
          <td id="file-olmo_trace-py-LC183" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">text</span> <span class="pl-c1">=</span> <span class="pl-s1">enc</span>.<span class="pl-c1">decode</span>(<span class="pl-s1">gen_ids</span>[<span class="pl-s1">start</span>: <span class="pl-s1">end</span>])</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L184" class="blob-num js-line-number js-blob-rnum" data-line-number="184"></td>
          <td id="file-olmo_trace-py-LC184" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">final_spans</span>.<span class="pl-c1">append</span>({</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L185" class="blob-num js-line-number js-blob-rnum" data-line-number="185"></td>
          <td id="file-olmo_trace-py-LC185" class="blob-code blob-code-inner js-file-line">        <span class="pl-s">"start"</span>: <span class="pl-s1">start</span>,</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L186" class="blob-num js-line-number js-blob-rnum" data-line-number="186"></td>
          <td id="file-olmo_trace-py-LC186" class="blob-code blob-code-inner js-file-line">        <span class="pl-s">"end"</span>: <span class="pl-s1">end</span>,</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L187" class="blob-num js-line-number js-blob-rnum" data-line-number="187"></td>
          <td id="file-olmo_trace-py-LC187" class="blob-code blob-code-inner js-file-line">        <span class="pl-s">"text"</span>: <span class="pl-s1">text</span>,</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L188" class="blob-num js-line-number js-blob-rnum" data-line-number="188"></td>
          <td id="file-olmo_trace-py-LC188" class="blob-code blob-code-inner js-file-line">        <span class="pl-s">"docs"</span>: <span class="pl-s1">all_docs</span>,</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L189" class="blob-num js-line-number js-blob-rnum" data-line-number="189"></td>
          <td id="file-olmo_trace-py-LC189" class="blob-code blob-code-inner js-file-line">    })</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L190" class="blob-num js-line-number js-blob-rnum" data-line-number="190"></td>
          <td id="file-olmo_trace-py-LC190" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L191" class="blob-num js-line-number js-blob-rnum" data-line-number="191"></td>
          <td id="file-olmo_trace-py-LC191" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L192" class="blob-num js-line-number js-blob-rnum" data-line-number="192"></td>
          <td id="file-olmo_trace-py-LC192" class="blob-code blob-code-inner js-file-line"><span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L193" class="blob-num js-line-number js-blob-rnum" data-line-number="193"></td>
          <td id="file-olmo_trace-py-LC193" class="blob-code blob-code-inner js-file-line"><span class="pl-s">Step Five: observe tracing results</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L194" class="blob-num js-line-number js-blob-rnum" data-line-number="194"></td>
          <td id="file-olmo_trace-py-LC194" class="blob-code blob-code-inner js-file-line"><span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L195" class="blob-num js-line-number js-blob-rnum" data-line-number="195"></td>
          <td id="file-olmo_trace-py-LC195" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">docs_to_print</span> <span class="pl-c1">=</span> <span class="pl-c1">5</span></td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L196" class="blob-num js-line-number js-blob-rnum" data-line-number="196"></td>
          <td id="file-olmo_trace-py-LC196" class="blob-code blob-code-inner js-file-line"><span class="pl-en">print</span>(<span class="pl-s">f'Query Text: <span class="pl-s1"><span class="pl-kos">{</span><span class="pl-s1">enc</span>.<span class="pl-c1">decode</span>(<span class="pl-s1">gen_ids</span>)<span class="pl-kos">}</span></span>'</span>)</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L197" class="blob-num js-line-number js-blob-rnum" data-line-number="197"></td>
          <td id="file-olmo_trace-py-LC197" class="blob-code blob-code-inner js-file-line"><span class="pl-k">for</span> <span class="pl-s1">i</span>, <span class="pl-s1">sp</span> <span class="pl-c1">in</span> <span class="pl-en">enumerate</span>(<span class="pl-s1">final_spans</span>):</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L198" class="blob-num js-line-number js-blob-rnum" data-line-number="198"></td>
          <td id="file-olmo_trace-py-LC198" class="blob-code blob-code-inner js-file-line">    <span class="pl-en">print</span>(<span class="pl-s">"<span class="pl-cce">\n</span>"</span> <span class="pl-c1">+</span> <span class="pl-s">"="</span><span class="pl-c1">*</span><span class="pl-c1">20</span> <span class="pl-c1">+</span> <span class="pl-s">f" SPAN <span class="pl-s1"><span class="pl-kos">{</span><span class="pl-s1">i</span> <span class="pl-c1">+</span> <span class="pl-c1">1</span><span class="pl-kos">}</span></span> / <span class="pl-s1"><span class="pl-kos">{</span><span class="pl-en">len</span>(<span class="pl-s1">final_spans</span>)<span class="pl-kos">}</span></span> "</span> <span class="pl-c1">+</span> <span class="pl-s">"="</span><span class="pl-c1">*</span><span class="pl-c1">20</span>)</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L199" class="blob-num js-line-number js-blob-rnum" data-line-number="199"></td>
          <td id="file-olmo_trace-py-LC199" class="blob-code blob-code-inner js-file-line">    <span class="pl-en">print</span>(<span class="pl-s">f"Span Text: <span class="pl-s1"><span class="pl-kos">{</span><span class="pl-s1">sp</span>[<span class="pl-s">'text'</span>]<span class="pl-kos">}</span></span><span class="pl-cce">\n</span>"</span>)</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L200" class="blob-num js-line-number js-blob-rnum" data-line-number="200"></td>
          <td id="file-olmo_trace-py-LC200" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">for</span> <span class="pl-s1">j</span>, <span class="pl-s1">doc</span> <span class="pl-c1">in</span> <span class="pl-en">enumerate</span>(<span class="pl-s1">sp</span>[<span class="pl-s">'docs'</span>]):</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L201" class="blob-num js-line-number js-blob-rnum" data-line-number="201"></td>
          <td id="file-olmo_trace-py-LC201" class="blob-code blob-code-inner js-file-line">        <span class="pl-en">print</span>(<span class="pl-s">"-"</span><span class="pl-c1">*</span><span class="pl-c1">10</span> <span class="pl-c1">+</span> <span class="pl-s">f" Document <span class="pl-s1"><span class="pl-kos">{</span><span class="pl-s1">j</span> <span class="pl-c1">+</span> <span class="pl-c1">1</span><span class="pl-kos">}</span></span> / <span class="pl-s1"><span class="pl-kos">{</span><span class="pl-en">len</span>(<span class="pl-s1">sp</span>[<span class="pl-s">'docs'</span>])<span class="pl-kos">}</span></span> "</span> <span class="pl-c1">+</span> <span class="pl-s">"-"</span><span class="pl-c1">*</span><span class="pl-c1">10</span>)</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L202" class="blob-num js-line-number js-blob-rnum" data-line-number="202"></td>
          <td id="file-olmo_trace-py-LC202" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">for</span> <span class="pl-s1">k</span> <span class="pl-c1">in</span> [<span class="pl-s">'text'</span>, <span class="pl-s">'movie_id'</span>, <span class="pl-s">'src_lang'</span>, <span class="pl-s">'start_frame'</span>, <span class="pl-s">'end_frame'</span>]:</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L203" class="blob-num js-line-number js-blob-rnum" data-line-number="203"></td>
          <td id="file-olmo_trace-py-LC203" class="blob-code blob-code-inner js-file-line">            <span class="pl-k">if</span> <span class="pl-s1">k</span> <span class="pl-c1">==</span> <span class="pl-s">'text'</span>:</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L204" class="blob-num js-line-number js-blob-rnum" data-line-number="204"></td>
          <td id="file-olmo_trace-py-LC204" class="blob-code blob-code-inner js-file-line">                <span class="pl-s1">v</span> <span class="pl-c1">=</span> <span class="pl-s1">doc</span>[<span class="pl-s1">k</span>].<span class="pl-c1">replace</span>(<span class="pl-s">'<span class="pl-cce">\n</span>'</span>, <span class="pl-s">' '</span>)</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L205" class="blob-num js-line-number js-blob-rnum" data-line-number="205"></td>
          <td id="file-olmo_trace-py-LC205" class="blob-code blob-code-inner js-file-line">            <span class="pl-k">else</span>:</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L206" class="blob-num js-line-number js-blob-rnum" data-line-number="206"></td>
          <td id="file-olmo_trace-py-LC206" class="blob-code blob-code-inner js-file-line">                <span class="pl-s1">v</span> <span class="pl-c1">=</span> <span class="pl-s1">doc</span>[<span class="pl-s1">k</span>]</td>
        </tr>
        <tr>
          <td id="file-olmo_trace-py-L207" class="blob-num js-line-number js-blob-rnum" data-line-number="207"></td>
          <td id="file-olmo_trace-py-LC207" class="blob-code blob-code-inner js-file-line">            <span class="pl-en">print</span>(<span class="pl-s">f"- <span class="pl-s1"><span class="pl-kos">{</span><span class="pl-s1">k</span><span class="pl-kos">}</span></span> --&gt; <span class="pl-s1"><span class="pl-kos">{</span><span class="pl-s1">v</span><span class="pl-kos">}</span></span>"</span>)</td>
        </tr>
  </tbody></table>
</div>


    </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/wolfecameron/306aa72a0c5095db460e2ccea9b06777/raw/e1040a0e8198f9d82bbe20bcc7246416ed80bb0f/olmo_trace.py" style="float:right" class="Link--inTextBlock">view raw</a>
        <a href="https://gist.github.com/wolfecameron/306aa72a0c5095db460e2ccea9b06777#file-olmo_trace-py" class="Link--inTextBlock">
          olmo_trace.py
        </a>
        hosted with &#10084; by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
      </div>
    </div>
</div>
</div><p>As we can see, the core functionality of OLMoTrace is not that complicated&#8212;<em>most of the complex code is already abstracted away by the infini-gram package</em>! For those who are interested, I would highly recommend testing out this code on your own model and data to get a feel for the types of results it can return!</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pLsI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F332e82aa-8b1d-4c48-8baf-13d820ba8e81_1840x432.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pLsI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F332e82aa-8b1d-4c48-8baf-13d820ba8e81_1840x432.png 424w, https://substackcdn.com/image/fetch/$s_!pLsI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F332e82aa-8b1d-4c48-8baf-13d820ba8e81_1840x432.png 848w, https://substackcdn.com/image/fetch/$s_!pLsI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F332e82aa-8b1d-4c48-8baf-13d820ba8e81_1840x432.png 1272w, https://substackcdn.com/image/fetch/$s_!pLsI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F332e82aa-8b1d-4c48-8baf-13d820ba8e81_1840x432.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pLsI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F332e82aa-8b1d-4c48-8baf-13d820ba8e81_1840x432.png" width="1456" height="342" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/332e82aa-8b1d-4c48-8baf-13d820ba8e81_1840x432.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:342,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:551834,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/162722014?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F332e82aa-8b1d-4c48-8baf-13d820ba8e81_1840x432.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!pLsI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F332e82aa-8b1d-4c48-8baf-13d820ba8e81_1840x432.png 424w, https://substackcdn.com/image/fetch/$s_!pLsI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F332e82aa-8b1d-4c48-8baf-13d820ba8e81_1840x432.png 848w, https://substackcdn.com/image/fetch/$s_!pLsI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F332e82aa-8b1d-4c48-8baf-13d820ba8e81_1840x432.png 1272w, https://substackcdn.com/image/fetch/$s_!pLsI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F332e82aa-8b1d-4c48-8baf-13d820ba8e81_1840x432.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">OLMoTrace use cases (from [2])</figcaption></figure></div><p><strong>Applications of OLMoTrace. </strong>OLMoTrace specializes in finding long and unique spans that exactly match between an LLM&#8217;s output and its training data. Exact matches are a useful proxy for finding training data that may contribute to a certain output from our LLM. In [2], a variety of different use cases are considered:</p><ul><li><p><em>Fact checking</em>: compare factual statements made by the LLM to similar factual statements within its training data. </p></li><li><p><em>Creative expressions</em>: check if &#8220;creative&#8221; outputs from the LLM are actually creative, or just directly copied from training data. </p></li><li><p><em>Reasoning capabilities</em>: check if the LLM copies the reasoning process used to solve verifiable problems (e.g., math) from its training data. </p></li></ul><p>In each of these cases, we can learn something new about our LLM by tracing its output to find regions of the training data with a notable, verbatim match.</p><h4>Reasoning Models and Future Research</h4><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Brs9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f3ea8fb-4672-4580-b9e5-6f9520114cf0_2344x498.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Brs9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f3ea8fb-4672-4580-b9e5-6f9520114cf0_2344x498.png 424w, https://substackcdn.com/image/fetch/$s_!Brs9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f3ea8fb-4672-4580-b9e5-6f9520114cf0_2344x498.png 848w, https://substackcdn.com/image/fetch/$s_!Brs9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f3ea8fb-4672-4580-b9e5-6f9520114cf0_2344x498.png 1272w, https://substackcdn.com/image/fetch/$s_!Brs9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f3ea8fb-4672-4580-b9e5-6f9520114cf0_2344x498.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Brs9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f3ea8fb-4672-4580-b9e5-6f9520114cf0_2344x498.png" width="1456" height="309" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1f3ea8fb-4672-4580-b9e5-6f9520114cf0_2344x498.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:309,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:266244,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/162722014?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f3ea8fb-4672-4580-b9e5-6f9520114cf0_2344x498.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Brs9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f3ea8fb-4672-4580-b9e5-6f9520114cf0_2344x498.png 424w, https://substackcdn.com/image/fetch/$s_!Brs9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f3ea8fb-4672-4580-b9e5-6f9520114cf0_2344x498.png 848w, https://substackcdn.com/image/fetch/$s_!Brs9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f3ea8fb-4672-4580-b9e5-6f9520114cf0_2344x498.png 1272w, https://substackcdn.com/image/fetch/$s_!Brs9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f3ea8fb-4672-4580-b9e5-6f9520114cf0_2344x498.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Stages of LLM training (from [4, 5, 6])</figcaption></figure></div><p><strong>Extension to reasoning models.</strong> As shown above, LLMs are usually trained in several phases, each of which have unique styles of data:</p><ul><li><p><em><a href="https://cameronrwolfe.substack.com/p/understanding-and-using-supervised">Supervised Finetuning (SFT)</a></em>: trains the LLM using concrete examples of prompt-response pairs that the LLM should replicate.</p></li><li><p><em><a href="https://cameronrwolfe.substack.com/p/the-story-of-rlhf-origins-motivations">Reinforcement Learning from Human Feedback (RLHF)</a></em>: trains the model using preference pairs (i.e., a single prompt with two responses, where one of the two responses is identified as better than the other). </p></li><li><p><em><a href="https://cameronrwolfe.substack.com/p/demystifying-reasoning-models">Reinforcement Learning from Verifiable Rewards (RLVR)</a></em>: uses pure RL to reward the model for correctly solving verifiable problems as determined by a rule-based (usually deterministic) verification function. </p></li></ul><p>Despite these unique data formats, we can apply OLMoTrace to each stage of training with minimal changes! We can easily build an infini-gram index over supervised examples and preference pairs (though we may want to treat the positive and negative completions in the preference pair differently). For RLVR, however, <em>we may need to think more deeply about how the data should be traced</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zfsl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zfsl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 424w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 848w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 1272w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zfsl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png" width="1456" height="499" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:499,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zfsl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 424w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 848w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 1272w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>When training an LLM with RLVR, we have a dataset of problems with verifiable solutions; e.g., a math problem with a known solution or a coding problem with test cases. We can easily check whether the LLM solves such problems correctly (e.g., by string matching or something slightly more robust); see above. Then, the model learns how to solve these problems on its own via a self-evolution process powered by large-scale RL training, as demonstrated by <a href="https://cameronrwolfe.substack.com/i/153722335/open-reasoning-deepseek-r-and-more">DeepSeek-R1</a> [7].</p><blockquote><p><em>&#8220;We explore the potential of LLMs to develop reasoning capabilities without any supervised data, focusing on their self-evolution through a pure reinforcement learning process.&#8221; </em>- from [7]</p></blockquote><p>During RL training, we see in [7] that LLMs learn to output complex chains of thought&#8212;<em>sometimes</em> <em>thousands of tokens in length!</em>&#8212;to improve their reasoning capabilities. If we want to index these reasoning traces, however, we run into an interesting problem. Namely, the reasoning traces are not actually part of our training data&#8212;<em>they are generated by the LLM during the RL training process</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!COPD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!COPD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 424w, https://substackcdn.com/image/fetch/$s_!COPD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 848w, https://substackcdn.com/image/fetch/$s_!COPD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 1272w, https://substackcdn.com/image/fetch/$s_!COPD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!COPD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png" width="568" height="316.7692307692308" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:812,&quot;width&quot;:1456,&quot;resizeWidth&quot;:568,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!COPD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 424w, https://substackcdn.com/image/fetch/$s_!COPD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 848w, https://substackcdn.com/image/fetch/$s_!COPD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 1272w, https://substackcdn.com/image/fetch/$s_!COPD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [7])</figcaption></figure></div><p>Similarly, the LLM generates completions that are ranked by a reward model and used for policy updates during RLHF; see <a href="https://huggingface.co/blog/rlhf">here</a> for further explanation. If we want to capture patterns learned during RL training&#8212;<em>including both RLHF and RLVR</em>&#8212;we have to keep track of the completions generated by our LLM during training. Given access to these completions, we can index them like any other training data, add them to an infini-gram index, and trace them using OLMoTrace. </p><p><strong>Related (and future) research.</strong> Despite the utility of OLMoTrace, exact matches do NOT guarantee causality&#8212;<em>there are many reasons an LLM may have generated an output</em>. Just because we find training data that is similar to an output from our LLM does not mean that this data is guaranteed to have caused this output. </p><p>Attempting to provide deeper insight into the outputs of an LLM, several parallel veins of research are investigating alternative strategies for explainability. For example, many papers have been recently published on the topic of teaching LLMs how to cite sources when generating output [8, 9, 10]; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ss5p!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff79a885a-b083-4dc6-b86d-33001a12fd90_1278x838.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ss5p!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff79a885a-b083-4dc6-b86d-33001a12fd90_1278x838.png 424w, https://substackcdn.com/image/fetch/$s_!Ss5p!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff79a885a-b083-4dc6-b86d-33001a12fd90_1278x838.png 848w, https://substackcdn.com/image/fetch/$s_!Ss5p!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff79a885a-b083-4dc6-b86d-33001a12fd90_1278x838.png 1272w, https://substackcdn.com/image/fetch/$s_!Ss5p!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff79a885a-b083-4dc6-b86d-33001a12fd90_1278x838.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ss5p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff79a885a-b083-4dc6-b86d-33001a12fd90_1278x838.png" width="608" height="398.67292644757435" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f79a885a-b083-4dc6-b86d-33001a12fd90_1278x838.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:838,&quot;width&quot;:1278,&quot;resizeWidth&quot;:608,&quot;bytes&quot;:300740,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/162722014?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff79a885a-b083-4dc6-b86d-33001a12fd90_1278x838.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ss5p!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff79a885a-b083-4dc6-b86d-33001a12fd90_1278x838.png 424w, https://substackcdn.com/image/fetch/$s_!Ss5p!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff79a885a-b083-4dc6-b86d-33001a12fd90_1278x838.png 848w, https://substackcdn.com/image/fetch/$s_!Ss5p!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff79a885a-b083-4dc6-b86d-33001a12fd90_1278x838.png 1272w, https://substackcdn.com/image/fetch/$s_!Ss5p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff79a885a-b083-4dc6-b86d-33001a12fd90_1278x838.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [8])</figcaption></figure></div><p>Such an ability to cite sources can be incorporated into the LLM&#8217;s standard training process&#8212;<em>e.g., pretraining [8] or RLHF [9]</em>&#8212;such that the model learns when and how to provide evidence for its answers. However, there is still no guarantee that these citations truly explain how an output was generated.</p><p>The field of <a href="https://distill.pub/2020/circuits/zoom-in/">mechanistic interpretability</a> seeks to study the internals of neural networks to gain an understanding of why they produce the outputs that they do. Although deep neural networks are typically portrayed as block boxes, we can discover many repeated circuits and features in these networks when studied at a microscopic level (i.e., small sets of weights). For example, vision networks tend to have dedicated units for detecting curves, edges and much more.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EN03!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16d01ddd-a81c-442b-bccd-0f8af4d3c5ca_2200x1660.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EN03!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16d01ddd-a81c-442b-bccd-0f8af4d3c5ca_2200x1660.webp 424w, https://substackcdn.com/image/fetch/$s_!EN03!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16d01ddd-a81c-442b-bccd-0f8af4d3c5ca_2200x1660.webp 848w, https://substackcdn.com/image/fetch/$s_!EN03!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16d01ddd-a81c-442b-bccd-0f8af4d3c5ca_2200x1660.webp 1272w, https://substackcdn.com/image/fetch/$s_!EN03!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16d01ddd-a81c-442b-bccd-0f8af4d3c5ca_2200x1660.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EN03!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16d01ddd-a81c-442b-bccd-0f8af4d3c5ca_2200x1660.webp" width="1456" height="1099" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/16d01ddd-a81c-442b-bccd-0f8af4d3c5ca_2200x1660.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1099,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Abstract Feature Examples&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Abstract Feature Examples" title="Abstract Feature Examples" srcset="https://substackcdn.com/image/fetch/$s_!EN03!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16d01ddd-a81c-442b-bccd-0f8af4d3c5ca_2200x1660.webp 424w, https://substackcdn.com/image/fetch/$s_!EN03!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16d01ddd-a81c-442b-bccd-0f8af4d3c5ca_2200x1660.webp 848w, https://substackcdn.com/image/fetch/$s_!EN03!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16d01ddd-a81c-442b-bccd-0f8af4d3c5ca_2200x1660.webp 1272w, https://substackcdn.com/image/fetch/$s_!EN03!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16d01ddd-a81c-442b-bccd-0f8af4d3c5ca_2200x1660.webp 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The topic of mechanistic interpretability was largely popularized by <a href="https://www.anthropic.com/">Anthropic</a>. In a <a href="https://www.anthropic.com/research/mapping-mind-language-model">recent report</a>, researchers performed a large-scale study of features in Claude Sonnet using <a href="https://www.anthropic.com/research/towards-monosemanticity-decomposing-language-models-with-dictionary-learning">dictionary learning</a>. As shown above, this study discovered millions of features for advanced concepts, such as people, places, bugs in code and more. </p><blockquote><p><em>&#8220;We have identified how millions of concepts are represented inside Claude Sonnet, one of our deployed large language models. This is the first ever detailed look inside a modern, production-grade large language model.&#8221;</em> - from [11]</p></blockquote><p>Additionally, authors analyze the &#8220;distance&#8221; between features and find some interesting properties; e.g., the Golden Gate Bridge feature is close to that of Alcatraz. Such research, though nascent, is arguably the most promising avenue for truly understanding why and how LLMs produce certain outputs.</p><h2>Conclusions</h2><p>As we have learned, optimizing our training dataset is one of the most impactful and important aspects of the LLM training process. To effectively curate and debug our data, we should begin by looking at the data itself&#8212;<em>not by training models</em>! First, we should manually inspect our data and develop an understanding of its various properties, patterns and quirks. To scale the manual inspection process, we can rely upon both heuristics (when possible) and machine learning models; e.g., fastText or LLM judges. This data-focused curation process focuses upon fixing issues and improving data quality before training any LLMs!</p><blockquote><p><em>&#8220;One pattern I noticed is that great AI researchers are willing to manually inspect lots of data. And more than that, they build infrastructure that allows them to manually inspect data quickly. Though not glamorous, manually examining data gives valuable intuitions about the problem.&#8221;</em> - <a href="https://x.com/_jasonwei/status/1708921475829481683?s=20">Jason Wei</a></p></blockquote><p>Once we start training LLMs, we can use the LLM&#8217;s outputs to find issues in our data. More specifically, we can:</p><ol><li><p>Identify problematic LLM outputs via our evaluation framework.</p></li><li><p>Trace these outputs to corresponding regions of the training data.</p></li></ol><p>Although we can use standard search techniques&#8212;<em>like lexical or vector search</em>&#8212;for tracing data, there are specialized tracing techniques that have been specifically developed for LLMs like OLMoTrace [2]. These techniques are easy (and quick) to setup, highly informative and can be scaled to arbitrarily large datasets, <em>making them a very practical choice for debugging LLM training datasets</em>.</p><h4>New to the newsletter?</h4><p>Hi! I&#8217;m <a href="https://cameronrwolfe.me/">Cameron R. Wolfe</a>, Deep Learning Ph.D. and Senior Research Scientist at <a href="https://research.netflix.com/research-area/nlp-and-conversations">Netflix</a>. This is the Deep (Learning) Focus newsletter, where I help readers better understand important topics in AI research. If you like the newsletter, please subscribe, share it, or follow me on <a href="https://twitter.com/cwolferesearch">X</a> and <a href="https://www.linkedin.com/in/cameron-r-wolfe-ph-d-04744a238/">LinkedIn</a>!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://cameronrwolfe.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://cameronrwolfe.substack.com/subscribe?"><span>Subscribe now</span></a></p><h4>Bibliography</h4><p>[1] Liu, Jiacheng, et al. "Infini-gram: Scaling unbounded n-gram language models to a trillion tokens." <em>arXiv preprint arXiv:2401.17377</em> (2024).</p><p>[2] Liu, Jiacheng, et al. "OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens." <em>arXiv preprint arXiv:2504.07096</em> (2025).</p><p>[3] Touvron, Hugo, et al. "Llama 2: Open foundation and fine-tuned chat models." <em>arXiv preprint arXiv:2307.09288</em> (2023).</p><p>[4] Kaplan, Jared, et al. "Scaling laws for neural language models." <em>arXiv preprint arXiv:2001.08361</em> (2020).</p><p>[5] Ouyang, Long, et al. "Training language models to follow instructions with human feedback." <em>Advances in neural information processing systems</em> 35 (2022): 27730-27744.</p><p>[6] Lambert, Nathan, et al. "T\" ulu 3: Pushing frontiers in open language model post-training." <em>arXiv preprint arXiv:2411.15124</em> (2024).</p><p>[7] Guo, Daya, et al. "Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning." <em>arXiv preprint arXiv:2501.12948</em> (2025).</p><p>[8] Khalifa, Muhammad, et al. "Source-aware training enables knowledge attribution in language models." <em>arXiv preprint arXiv:2404.01019</em> (2024).</p><p>[9] Glaese, Amelia, et al. "Improving alignment of dialogue agents via targeted human judgements, 2022." <em>URL https://storage. googleapis. com/deepmind-media/DeepMind. com/Authors-Notes/sparrow/sparrow-final. pdf</em> (2022).</p><p>[10] Huang, Chengyu, et al. "Training language models to generate text with citations via fine-grained rewards." <em>arXiv preprint arXiv:2402.04315</em> (2024).</p><p>[11] Anthropic. &#8220;Mapping the Mind of a Large Language Model&#8221; <a href="https://www.anthropic.com/research/mapping-mind-language-model">https://www.anthropic.com/research/mapping-mind-language-model</a> (2025).</p><p>[12] Liu, Yang, et al. "G-eval: NLG evaluation using gpt-4 with better human alignment." <em>arXiv preprint arXiv:2303.16634</em> (2023).<br>[13] Meta. &#8220;The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation&#8221;<em><a href="https://ai.meta.com/blog/llama-4-multimodal-intelligence/">https://ai.meta.com/blog/llama-4-multimodal-intelligence/</a></em>(2025).</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>The papers that generate the largest interest tend to fall into this category; e.g., recent examples include <a href="https://arxiv.org/abs/2402.03300">GRPO</a>, <a href="https://arxiv.org/abs/2502.09992">diffusion LLMs</a>, and <a href="https://arxiv.org/abs/2411.15124">RLVR</a>. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Specifically, Llama 3 was post-trained using only SFT and DPO, while Llama 4 uses a more sophisticated pipeline of SFT, online RL, and lightweight DPO; see <a href="https://cameronrwolfe.substack.com/i/161016210/post-training">here</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>The rule of thumb for what constitutes &#8220;enough&#8221; manual data inspection is that it&#8217;s more than you want it to be. Seriously, spend more time manually inspecting your data. You won&#8217;t regret it!</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>For example, Llama 3 has a multi-stage pretraining process where select data sources (e.g., reasoning datasets) are emphasized more heavily in later stages to improve the model&#8217;s capabilities in certain domains; see <a href="https://magazine.sebastianraschka.com/i/147749119/pre-training-iii-annealing-on-high-quality-data">here</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>Lexicographical ordering is a generalization of alphabetical ordering to support characters that go beyond the alphabet (e.g., numbers and symbols).</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>In [1], authors use the <code>\xff\xff</code> token as a separator between documents.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>Assume that our dataset contains <code>T</code> tokens and that the vocabulary size of our tokenizer is ~64K, <em>meaning that each token ID can be represented with two bytes</em>. The list of token IDs for this dataset consumes <code>2T</code> bytes. The suffix array is a list of <code>T</code> indices that point to positions in the token array, where each index is represented with <code>log(2T)/8</code> bytes. If <code>2B &lt; T &lt; 500B</code>, indices can be stored using 5 bytes, meaning that the combined size of the token and suffix arrays is just <code>7T</code> bytes!</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>These segments are just integers corresponding to the position of a matching span within the full token array. </p></div></div>]]></content:encoded></item><item><title><![CDATA[Llama 4: The Challenges of Creating a Frontier-Level LLM]]></title><description><![CDATA[The full story behind Llama 4 and Meta's huge pivot in research strategy...]]></description><link>https://cameronrwolfe.substack.com/p/llama-4</link><guid isPermaLink="false">https://cameronrwolfe.substack.com/p/llama-4</guid><dc:creator><![CDATA[Cameron R. Wolfe, Ph.D.]]></dc:creator><pubDate>Mon, 28 Apr 2025 09:33:20 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/4bd4b93b-9169-433e-bfa2-0613e8816420_2376x1332.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qg3x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96b19e7-5f56-4869-8328-6bad04c093b2_2376x1302.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qg3x!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96b19e7-5f56-4869-8328-6bad04c093b2_2376x1302.png 424w, https://substackcdn.com/image/fetch/$s_!qg3x!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96b19e7-5f56-4869-8328-6bad04c093b2_2376x1302.png 848w, https://substackcdn.com/image/fetch/$s_!qg3x!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96b19e7-5f56-4869-8328-6bad04c093b2_2376x1302.png 1272w, https://substackcdn.com/image/fetch/$s_!qg3x!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96b19e7-5f56-4869-8328-6bad04c093b2_2376x1302.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qg3x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96b19e7-5f56-4869-8328-6bad04c093b2_2376x1302.png" width="1456" height="798" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c96b19e7-5f56-4869-8328-6bad04c093b2_2376x1302.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:798,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:902679,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/161016210?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96b19e7-5f56-4869-8328-6bad04c093b2_2376x1302.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qg3x!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96b19e7-5f56-4869-8328-6bad04c093b2_2376x1302.png 424w, https://substackcdn.com/image/fetch/$s_!qg3x!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96b19e7-5f56-4869-8328-6bad04c093b2_2376x1302.png 848w, https://substackcdn.com/image/fetch/$s_!qg3x!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96b19e7-5f56-4869-8328-6bad04c093b2_2376x1302.png 1272w, https://substackcdn.com/image/fetch/$s_!qg3x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc96b19e7-5f56-4869-8328-6bad04c093b2_2376x1302.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1, 2, 4, 6, 12])</figcaption></figure></div><p>The recent release of Llama 4 [1] was far from perfect, but there is a lot that can be learned from this new generation of models. Put simply, <em>Llama 4 is a massive pivot in Meta&#8217;s research direction</em>. In response to increasing competition, Meta is reinventing the Llama series and clearly pushing to create a frontier-level LLM.  Given that LLM development is an iterative process, such significant changes incur a lot of risk&#8212;<em>there&#8217;s a huge chance that these models will perform poorly at first</em>. For now, Llama 4 is perceived as a loss, but the long term success of Llama will be determined by Meta&#8217;s ability to quickly iterate and improve upon these models.</p><p>The most beautiful&#8212;<em>or frightening for model developers</em>&#8212;aspect of open LLM research is the fact that these learnings are happening in public. We have the ability to study key changes being made by Meta to reach parity with top models in the space. By studying these changes, we gain a better understanding of how modern, frontier-level LLMs are developed. In this overview, we will do exactly this by gaining a deep understanding of LLama 4 and related models. Then, we will use this understanding to analyze key trends in LLM research, the future of Llama, and the changes that Meta must make to succeed after Llama 4.</p><h2>Llama 4 Model Architecture</h2><p>We will first overview Llama 4 model architectures, emphasizing key changes relative to prior generations of Llama models. As we will see, the new Llama models use a drastically different model architecture, signaling a clear pivot in research direction and strategy. Whereas prior Llama variants emphasized simplicity and usability, Llama 4 makes an obvious push towards parity with frontier-level LLM labs&#8212;<em>both closed and open</em>&#8212;by adopting techniques that improve performance and efficiency at the cost of higher complexity and scale.</p><h4>Mixture-of-Experts (MoE)</h4><blockquote><p><em>&#8220;We make design choices that seek to maximize our ability to scale the model development process. For example, we opt for a standard dense Transformer model architecture with minor adaptations, rather than for a mixture-of-experts model to maximize training stability.&#8221;</em> - from Llama 3 paper [2]</p></blockquote><p>Instead of using a dense <a href="https://cameronrwolfe.substack.com/p/decoder-only-transformers-the-workhorse">decoder-only transformer</a> (depicted below), Llama 4 are the first of the Llama models to use a <a href="https://cameronrwolfe.substack.com/p/nano-moe">Mixture-of-Experts (MoE)</a> architecture. Llama 3 avoided using an MoE for the purpose of stability and simplicity&#8212;<em>larger MoE models introduce extra complexity to training and inference</em>. With Llama 4, Meta falls in line with leading open (e.g., <a href="https://cameronrwolfe.substack.com/i/154340424/deepseek-v-and-deepseek-v">DeepSeek-v3</a> [4]) and proprietary models (e.g., <a href="https://semianalysis.com/2023/07/10/gpt-4-architecture-infrastructure/">GPT-4</a>) that have successfully adopted the MoE architecture. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5BkT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F379c5d72-9aca-4b50-bd9c-d5ad9454f477_1622x798.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5BkT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F379c5d72-9aca-4b50-bd9c-d5ad9454f477_1622x798.png 424w, https://substackcdn.com/image/fetch/$s_!5BkT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F379c5d72-9aca-4b50-bd9c-d5ad9454f477_1622x798.png 848w, https://substackcdn.com/image/fetch/$s_!5BkT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F379c5d72-9aca-4b50-bd9c-d5ad9454f477_1622x798.png 1272w, https://substackcdn.com/image/fetch/$s_!5BkT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F379c5d72-9aca-4b50-bd9c-d5ad9454f477_1622x798.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5BkT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F379c5d72-9aca-4b50-bd9c-d5ad9454f477_1622x798.png" width="1456" height="716" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/379c5d72-9aca-4b50-bd9c-d5ad9454f477_1622x798.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:716,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5BkT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F379c5d72-9aca-4b50-bd9c-d5ad9454f477_1622x798.png 424w, https://substackcdn.com/image/fetch/$s_!5BkT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F379c5d72-9aca-4b50-bd9c-d5ad9454f477_1622x798.png 848w, https://substackcdn.com/image/fetch/$s_!5BkT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F379c5d72-9aca-4b50-bd9c-d5ad9454f477_1622x798.png 1272w, https://substackcdn.com/image/fetch/$s_!5BkT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F379c5d72-9aca-4b50-bd9c-d5ad9454f477_1622x798.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The decoder-only transformer architecture</figcaption></figure></div><p>Put simply, dense models&#8212;<em>though simple and effective</em>&#8212;are difficult to scale. By using an MoE architecture, we can drastically improve the training (and inference) efficiency of very large models, thus enabling greater scale.</p><p><strong>What is an MoE?</strong> Most readers will be familiar with the motivation of using an MoE&#8212;<em>it is a modified version of the decoder-only transformer architecture that makes large models more compute efficient</em>. Most of the key ideas behind MoEs were proposed in the three papers below, and we will overview these ideas here.</p><ul><li><p><a href="https://arxiv.org/abs/1701.06538">The Sparsely-Gated Mixture-of-Experts Layer</a></p></li><li><p><a href="http://switch%20transformers/">Switch Transformers</a></p></li><li><p><a href="https://arxiv.org/abs/2202.08906">Stable and Transferable Mixture-of-Experts (ST-MoE)</a></p></li></ul><p>Compared to the decoder-only transformer, MoEs modify the feed-forward component of the transformer block. Instead of having a single feed-forward network in each block, we have several feed-forward networks, <em>each with their own independent weights</em>. We refer to each of these networks as an &#8220;expert&#8221;; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tPDR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fbb9a24-440d-4d26-8092-b6d72dafb55e_1482x858.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tPDR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fbb9a24-440d-4d26-8092-b6d72dafb55e_1482x858.png 424w, https://substackcdn.com/image/fetch/$s_!tPDR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fbb9a24-440d-4d26-8092-b6d72dafb55e_1482x858.png 848w, https://substackcdn.com/image/fetch/$s_!tPDR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fbb9a24-440d-4d26-8092-b6d72dafb55e_1482x858.png 1272w, https://substackcdn.com/image/fetch/$s_!tPDR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fbb9a24-440d-4d26-8092-b6d72dafb55e_1482x858.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tPDR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fbb9a24-440d-4d26-8092-b6d72dafb55e_1482x858.png" width="1456" height="843" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8fbb9a24-440d-4d26-8092-b6d72dafb55e_1482x858.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:843,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tPDR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fbb9a24-440d-4d26-8092-b6d72dafb55e_1482x858.png 424w, https://substackcdn.com/image/fetch/$s_!tPDR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fbb9a24-440d-4d26-8092-b6d72dafb55e_1482x858.png 848w, https://substackcdn.com/image/fetch/$s_!tPDR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fbb9a24-440d-4d26-8092-b6d72dafb55e_1482x858.png 1272w, https://substackcdn.com/image/fetch/$s_!tPDR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fbb9a24-440d-4d26-8092-b6d72dafb55e_1482x858.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Adding experts to a transformer block (<a href="https://arxiv.org/abs/2101.03961">source</a>)</figcaption></figure></div><p>To create an MoE architecture, we convert the transformer&#8217;s feed-forward layers into MoE&#8212;<em>or expert</em>&#8212;layers. Each expert in the MoE is identical in structure to the original, feed-forward network from that layer, and we usually convert only a subset of transformer layers into MoE layers; e.g., Llama 4 uses interleaved MoE layers where every other layer of the transformer becomes an expert layer.</p><blockquote><p><em>&#8220;Our new Llama 4 models are our first models that use a MoE architecture&#8230; MoE architectures are more compute efficient for training and inference and, given a fixed training FLOPs budget, delivers higher quality compared to a dense model.&#8221;</em> - from Llama 4 blog [1]</p></blockquote><p><strong>Routing mechanism.</strong> Obviously, making multiple copies of each feed-forward network in the transformer does not improve compute efficiency. To get an efficiency gain, <em>we need to add sparsity</em>. In other words, we don&#8217;t use every expert in each MoE layer. Instead, we select a subset of experts (e.g., one or two experts)&#8212;<em>referred to as the &#8220;active&#8221; experts or parameters</em>&#8212;to use for each token. This selection is done by passing each token vector through a linear layer that outputs a probability distribution over the set of experts; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FZCc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1189a50c-ad49-4e09-8fca-b800532e101a_1156x856.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FZCc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1189a50c-ad49-4e09-8fca-b800532e101a_1156x856.png 424w, https://substackcdn.com/image/fetch/$s_!FZCc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1189a50c-ad49-4e09-8fca-b800532e101a_1156x856.png 848w, https://substackcdn.com/image/fetch/$s_!FZCc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1189a50c-ad49-4e09-8fca-b800532e101a_1156x856.png 1272w, https://substackcdn.com/image/fetch/$s_!FZCc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1189a50c-ad49-4e09-8fca-b800532e101a_1156x856.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FZCc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1189a50c-ad49-4e09-8fca-b800532e101a_1156x856.png" width="410" height="303.598615916955" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1189a50c-ad49-4e09-8fca-b800532e101a_1156x856.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:856,&quot;width&quot;:1156,&quot;resizeWidth&quot;:410,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FZCc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1189a50c-ad49-4e09-8fca-b800532e101a_1156x856.png 424w, https://substackcdn.com/image/fetch/$s_!FZCc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1189a50c-ad49-4e09-8fca-b800532e101a_1156x856.png 848w, https://substackcdn.com/image/fetch/$s_!FZCc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1189a50c-ad49-4e09-8fca-b800532e101a_1156x856.png 1272w, https://substackcdn.com/image/fetch/$s_!FZCc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1189a50c-ad49-4e09-8fca-b800532e101a_1156x856.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Selecting experts with a routing mechanism</figcaption></figure></div><p>From here, we can process each token using only the experts that receive the highest probability. By doing this, we only use a portion of the model&#8217;s total parameters for each token&#8212;<em>the number of active parameters is much smaller than the model&#8217;s total parameters</em>. For this reason, we can train models with a large number of total parameters while incurring only a fraction of their total compute cost.</p><blockquote><p><em>&#8220;The gating network tends to converge to a state where it always produces large weights for the same few experts. This imbalance is self-reinforcing, as the favored experts are trained more rapidly and thus are selected even more by the gating network.&#8221;</em> - <a href="https://arxiv.org/abs/1701.06538">source</a></p></blockquote><p><strong>Load balancing and training stability.</strong> If we train an MoE similarly to a standard dense model, several issues are likely to occur. First, the model will quickly learn to route all tokens to a single expert&#8212;<em>a phenomenon known as &#8220;routing collapse&#8221;</em>. Additionally, MoEs are more likely to experience numerical instabilities during training, potentially leading to a divergence in the training loss; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!efMH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F213eacf6-6f4c-48ac-9fec-b81a24580b4b_1370x804.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!efMH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F213eacf6-6f4c-48ac-9fec-b81a24580b4b_1370x804.png 424w, https://substackcdn.com/image/fetch/$s_!efMH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F213eacf6-6f4c-48ac-9fec-b81a24580b4b_1370x804.png 848w, https://substackcdn.com/image/fetch/$s_!efMH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F213eacf6-6f4c-48ac-9fec-b81a24580b4b_1370x804.png 1272w, https://substackcdn.com/image/fetch/$s_!efMH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F213eacf6-6f4c-48ac-9fec-b81a24580b4b_1370x804.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!efMH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F213eacf6-6f4c-48ac-9fec-b81a24580b4b_1370x804.png" width="459" height="269.36934306569344" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/213eacf6-6f4c-48ac-9fec-b81a24580b4b_1370x804.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:804,&quot;width&quot;:1370,&quot;resizeWidth&quot;:459,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!efMH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F213eacf6-6f4c-48ac-9fec-b81a24580b4b_1370x804.png 424w, https://substackcdn.com/image/fetch/$s_!efMH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F213eacf6-6f4c-48ac-9fec-b81a24580b4b_1370x804.png 848w, https://substackcdn.com/image/fetch/$s_!efMH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F213eacf6-6f4c-48ac-9fec-b81a24580b4b_1370x804.png 1272w, https://substackcdn.com/image/fetch/$s_!efMH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F213eacf6-6f4c-48ac-9fec-b81a24580b4b_1370x804.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">An example of a training divergence (<a href="https://cameronrwolfe.substack.com/p/nano-moe">source</a>)</figcaption></figure></div><p>To avoid these issues and ensure that training is stable, most MoEs employ a load-balancing loss during training, which rewards the MoE for assigning equal probability to experts and routing tokens uniformly. Load-balancing losses modify the underlying training objective of the LLM by adding an extra loss term to the standard, next-token prediction loss; see below. As such, <em>these auxiliary losses can impact the performance of the model</em>, which has led some popular MoE-based LLMs (e.g., DeepSeek-v3) to avoid them altogether.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yYzN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa69f7cc-41ac-4b4f-9a13-c7b791a31430_1836x480.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yYzN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa69f7cc-41ac-4b4f-9a13-c7b791a31430_1836x480.png 424w, https://substackcdn.com/image/fetch/$s_!yYzN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa69f7cc-41ac-4b4f-9a13-c7b791a31430_1836x480.png 848w, https://substackcdn.com/image/fetch/$s_!yYzN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa69f7cc-41ac-4b4f-9a13-c7b791a31430_1836x480.png 1272w, https://substackcdn.com/image/fetch/$s_!yYzN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa69f7cc-41ac-4b4f-9a13-c7b791a31430_1836x480.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yYzN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa69f7cc-41ac-4b4f-9a13-c7b791a31430_1836x480.png" width="1456" height="381" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aa69f7cc-41ac-4b4f-9a13-c7b791a31430_1836x480.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:381,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yYzN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa69f7cc-41ac-4b4f-9a13-c7b791a31430_1836x480.png 424w, https://substackcdn.com/image/fetch/$s_!yYzN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa69f7cc-41ac-4b4f-9a13-c7b791a31430_1836x480.png 848w, https://substackcdn.com/image/fetch/$s_!yYzN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa69f7cc-41ac-4b4f-9a13-c7b791a31430_1836x480.png 1272w, https://substackcdn.com/image/fetch/$s_!yYzN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa69f7cc-41ac-4b4f-9a13-c7b791a31430_1836x480.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The auxiliary-loss-free load balancing strategy used by DeepSeek-v3 [4]</figcaption></figure></div><p>No statement is made in [1] as to the exact auxiliary losses used to train Llama 4 models (if any). To avoid training instability, we can use an auxiliary-loss-free load-balancing strategy similarly to DeepSeek-v3 and adopt a variety of <a href="https://cameronrwolfe.substack.com/i/155023686/best-practices-for-training-moes">extra tricks</a>; e.g., better weight initialization or selective precision. </p><p>The primary takeaway we should glean from this information is the simple fact that MoEs&#8212;<em>despite their many benefits</em>&#8212;are much harder to train compared to standard dense models. This is a classic tradeoff between simplicity and performance! These architectures are more complex. Therefore, there are more factors to consider and many more issues that can occur during training. For more details on MoE architectures and training, check out the links below.</p><ul><li><p><a href="https://cameronrwolfe.substack.com/p/moe-llms">Understanding MoE-based LLMs</a></p></li><li><p><a href="https://cameronrwolfe.substack.com/p/nano-moe">nanoMoE: Implementing an MoE-based LLM in PyTorch</a></p></li></ul><p><strong>Llama 4 architecture.</strong> Three flavors of Llama 4 models are presented in [1]:</p><ul><li><p><em>Scout</em>: 109B total parameters, 17B active parameters, 16 experts per layer.</p></li><li><p><em>Maverick</em>: 400B total parameters, 17B active parameters, 128 experts per layer.</p></li><li><p><em>Behemoth</em>: 2T total parameters, 288B active parameters, 128 experts per layer.</p></li></ul><p>Both the Llama 4 Scout and Maverick models are released openly&#8212;<em>under the <a href="https://github.com/meta-llama/llama-models/blob/main/models/llama4/LICENSE">Llama 4 community license agreement</a></em>&#8212;in [1], while the Behemoth model was just previewed (i.e., not yet released). Similarly to DeepSeek-v3, Llama 4 models use both shared and routed experts. For example, Llama 4 Maverick has one shared expert&#8212;<em>meaning that all tokens are passed to this expert with 100% probability</em>&#8212;and selects one active routed expert per token using a routing mechanism; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UlyU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ec49e67-8f67-4eea-8759-c27231ffacf5_1212x628.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UlyU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ec49e67-8f67-4eea-8759-c27231ffacf5_1212x628.png 424w, https://substackcdn.com/image/fetch/$s_!UlyU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ec49e67-8f67-4eea-8759-c27231ffacf5_1212x628.png 848w, https://substackcdn.com/image/fetch/$s_!UlyU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ec49e67-8f67-4eea-8759-c27231ffacf5_1212x628.png 1272w, https://substackcdn.com/image/fetch/$s_!UlyU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ec49e67-8f67-4eea-8759-c27231ffacf5_1212x628.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UlyU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ec49e67-8f67-4eea-8759-c27231ffacf5_1212x628.png" width="512" height="265.2937293729373" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8ec49e67-8f67-4eea-8759-c27231ffacf5_1212x628.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:628,&quot;width&quot;:1212,&quot;resizeWidth&quot;:512,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UlyU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ec49e67-8f67-4eea-8759-c27231ffacf5_1212x628.png 424w, https://substackcdn.com/image/fetch/$s_!UlyU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ec49e67-8f67-4eea-8759-c27231ffacf5_1212x628.png 848w, https://substackcdn.com/image/fetch/$s_!UlyU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ec49e67-8f67-4eea-8759-c27231ffacf5_1212x628.png 1272w, https://substackcdn.com/image/fetch/$s_!UlyU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ec49e67-8f67-4eea-8759-c27231ffacf5_1212x628.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Depiction of shared and routed experts (from [3])</figcaption></figure></div><p>Relative to other popular MoEs, Llama 4 models have a very small number of active parameters. However, these architectural settings are not uncommon when compared to top industry labs:</p><ul><li><p>Scout optimizes for inference efficiency and is reminiscent of models like Gemini Flash or GPT-4o-mini.</p></li><li><p>Maverick has an architecture that is relatively similar to DeepSeek-v3 (i.e., sparse model with a very large number of experts). </p></li><li><p>Behemoth&#8212;<em>the most powerful model in the suite</em>&#8212;is a GPT-4-esque, multi-trillion parameter foundation model. </p></li></ul><p>However, there are still differences between Llama 4 models and other popular LLMs. Only a single routed expert is selected per layer in Llama 4, whereas DeepSeek has multiple shared experts and eight active routed experts per layer (i.e., 37B active parameters and 671B total parameters). This smaller number of active parameters improves both the training and inference efficiency of Llama 4. In fact, Llama 4 models were <a href="https://x.com/scaling01/status/1908657167869100482">reported to have used less compute during training</a> relative to Llama 3 despite a drastic increase in data and model scale. </p><p><strong>Fine-grained experts.</strong> One popular design choice made by several modern MoE-based LLMs (e.g., DeepSeek-v3 and <a href="https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm">DBRX</a>) is the use of fine-grained experts. To use fine-grained experts, we just:</p><ol><li><p>Increase the number of experts in each MoE layer.</p></li><li><p>Decrease the size (number of parameters) for each individual expert.</p></li></ol><p>Usually, we also select a larger number of active experts in each layer to keep the number of active parameters (relatively) fixed in a fine-grained MoE model. We see both fine and coarse-grained experts used in the Llama 4 suite&#8212;<em>the Scout model has 16 total experts, while Maverick has 128 total experts</em>. Given that Maverick has 16&#215; the number of experts but only 4&#215; the number of total parameters compared to the smaller Scout model, it must be using fine-grained experts. </p><p>In contrast, both the Scout and Behemoth models use standard (coarse-grained) experts. There are a few different reasons that Meta may be making this choice. Generally, using fine-grained experts allows for more specialization among experts and can improve both performance and efficiency. However, <em>fine-grained experts also introduce added complexity into the distributed training process</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jK92!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f3c4895-4eb4-4f7a-974d-e53b1423af84_1136x676.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jK92!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f3c4895-4eb4-4f7a-974d-e53b1423af84_1136x676.png 424w, https://substackcdn.com/image/fetch/$s_!jK92!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f3c4895-4eb4-4f7a-974d-e53b1423af84_1136x676.png 848w, https://substackcdn.com/image/fetch/$s_!jK92!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f3c4895-4eb4-4f7a-974d-e53b1423af84_1136x676.png 1272w, https://substackcdn.com/image/fetch/$s_!jK92!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f3c4895-4eb4-4f7a-974d-e53b1423af84_1136x676.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jK92!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f3c4895-4eb4-4f7a-974d-e53b1423af84_1136x676.png" width="444" height="264.2112676056338" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6f3c4895-4eb4-4f7a-974d-e53b1423af84_1136x676.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:676,&quot;width&quot;:1136,&quot;resizeWidth&quot;:444,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;tensor parallel vs expert parallel&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="tensor parallel vs expert parallel" title="tensor parallel vs expert parallel" srcset="https://substackcdn.com/image/fetch/$s_!jK92!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f3c4895-4eb4-4f7a-974d-e53b1423af84_1136x676.png 424w, https://substackcdn.com/image/fetch/$s_!jK92!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f3c4895-4eb4-4f7a-974d-e53b1423af84_1136x676.png 848w, https://substackcdn.com/image/fetch/$s_!jK92!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f3c4895-4eb4-4f7a-974d-e53b1423af84_1136x676.png 1272w, https://substackcdn.com/image/fetch/$s_!jK92!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f3c4895-4eb4-4f7a-974d-e53b1423af84_1136x676.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(<a href="https://nvidia.github.io/TensorRT-LLM/advanced/expert-parallelism.html">source</a>)</figcaption></figure></div><p>Experts are typically distributed across multiple GPUs during training (i.e., <a href="https://nvidia.github.io/TensorRT-LLM/advanced/expert-parallelism.html">expert parallelism</a>); see above. When using coarse-grained experts, it is common for each GPU to store a single expert<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>. However, we can usually fit multiple fine-grained experts into the memory of a single GPU. Additionally, because we usually select a larger number of experts when using fine-grained experts, we could run into an issue where each token has to be routed to multiple different GPUs in the cluster, thus creating a drastic increase in communication costs between GPUs. </p><blockquote><p><em>&#8220;We ensure that each token will be sent to at most &#119872; nodes, which are selected according to the sum of the highest &#119870; / &#119872; affinity scores of the experts distributed on each node. Under this constraint, our MoE training framework can nearly achieve full computation-communication overlap.&#8221;</em> - from DeepSeek-v3 paper [4]</p></blockquote><p>As a result, we must adopt some strategy to limit communication costs and improve training efficiency. For example, DeepSeek-v3 uses the node-limited routing scheme described above, which restricts the number of devices to which a single token can be routed. We can avoid this extra complexity by not using fine-grained experts. However, training both fine-grained and coarse-grained expert models also provides more configurability and choices to model users.  </p><p><strong>Impact to open LLMs. </strong>MoEs do not use all of their parameters during inference, but we still have to fit the model&#8217;s parameters into GPU memory. As a result, MoE-based LLMs have a much higher memory footprint&#8212;<em>and therefore require access to more and better GPUs</em>&#8212;relative to dense models<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>. Llama 4 Scout <em>&#8220;fits on a single H100 GPU (with <a href="https://arxiv.org/abs/2301.12017">Int4 quantization</a>)&#8221;</em><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>, while Maverick needs <em>&#8220;a single H100 host&#8221;</em>. In other words, we cannot perform inference of the larger Maverick model using a single GPU&#8212;<em>we have to perform <a href="https://docs.vllm.ai/en/latest/serving/distributed_serving.html">distributed inference</a> on a multi-GPU host. </em></p><p>With all of these considerations in mind, we may start to realize that the migration of Llama to an MoE architecture is a double-edged sword:</p><ul><li><p>The Llama project takes a step towards parity with the most powerful (proprietary) LLMs and unlocks potential for creating better models.</p></li><li><p>The barrier to entry for using Llama models is increased.</p></li></ul><p>This dilemma has significant implications for open LLM research. Increasing the barrier to entry for open LLMs has significant side effects and will hinder the ability of those without significant GPU resources to conduct meaningful research. The open LLM community cannot continue to thrive its contributors are slowly priced out of doing research as models continue to advance.</p><blockquote><p><em>&#8220;The model that becomes the open standard doesn&#8217;t need to be the best overall model, but rather a family of models in many shapes and sizes that is solid in many different deployment settings&#8230; memory-intensive models like sparse MoEs price out more participants in the open community.&#8221;</em> - <a href="https://www.interconnects.ai/p/llama-4">Nathan Lambert</a></p></blockquote><p>To avoid this negative aspect of MoE architectures, we can distill larger MoE models into smaller dense models, <em>providing a suite of more user-friendly LLMs that still perform well</em>. This approach was adopted and popularized by DeepSeek-R1 [5]<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>, a 671B parameter MoE-based reasoning model that was distilled into several dense LLMs with sizes ranging from 1.5B to 70B parameters. One of the key findings from [5] is the fact that distillation is most effective when a very large and powerful model is used as a teacher. As we will see later in the overview, distillation from Llama 4 models is already being heavily explored.</p><h4>Native Multi-Modality and Early Fusion</h4><p>Multi-modal Llama models have been released in the past. The original Llama 3 publication [2] included <a href="https://cameronrwolfe.substack.com/i/158954054/extending-llama-to-images-and-video">preliminary experiments</a> with multi-modality, which were later productionized with the release of <a href="https://cameronrwolfe.substack.com/i/158954054/llama-medium-sized-vision-llms">Llama 3.2 Vision</a>. Key details of multi-modal Llama 3 models are outlined within the overview linked below. Similarly to prior model generations, Llama 4 models support visual inputs&#8212;<em>both images and videos</em>. However, as we will see in this section, Llama 4 takes a drastically different approach to multi-modality.</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;1caef8a0-3a80-48bd-bfe6-f8ce1cbb80e9&quot;,&quot;caption&quot;:&quot;After the popularization of text-based large language models (LLMs), one of the most important questions within the research community was how we could extend such powerful models to understand other modalities of data (e.g., images, video or speech). Research on multi-modal LLMs is promising for several reasons:&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Vision Large Language Models (vLLMs)&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:29736521,&quot;name&quot;:&quot;Cameron R. Wolfe, Ph.D.&quot;,&quot;bio&quot;:&quot;ML @ Netflix &#8226; Rice University PhD &#8226; I make AI understandable&quot;,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/69aba7df-b571-4609-aa47-fc2d031c11b8_1242x1595.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-03-31T09:34:01.673Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/12372b06-0850-4b33-b8a8-dd01dd5662fb_2208x1218.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://cameronrwolfe.substack.com/p/vision-llms&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:158954054,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:97,&quot;comment_count&quot;:8,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Deep (Learning) Focus&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab9b43fb-52d5-40da-995d-5b7cd3f91064_896x896.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p><strong>Multi-modal architectures. </strong>Multi-modal LLMs have two primary components: an <em>LLM backbone</em> and a <em>vision encoder</em>. The LLM backbone is just a standard decoder-only transformer, while the vision encoder is usually a <a href="https://cameronrwolfe.substack.com/i/158954054/contrastive-language-image-pre-training-clip">CLIP</a> or <a href="https://cameronrwolfe.substack.com/i/158954054/vision-transformers-vit">ViT</a> model that converts an image into a set of corresponding embeddings; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rNP6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd022205-f5bc-4580-a4b3-3a03648d37d1_1288x1066.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rNP6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd022205-f5bc-4580-a4b3-3a03648d37d1_1288x1066.png 424w, https://substackcdn.com/image/fetch/$s_!rNP6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd022205-f5bc-4580-a4b3-3a03648d37d1_1288x1066.png 848w, https://substackcdn.com/image/fetch/$s_!rNP6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd022205-f5bc-4580-a4b3-3a03648d37d1_1288x1066.png 1272w, https://substackcdn.com/image/fetch/$s_!rNP6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd022205-f5bc-4580-a4b3-3a03648d37d1_1288x1066.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rNP6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd022205-f5bc-4580-a4b3-3a03648d37d1_1288x1066.png" width="360" height="297.9503105590062" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cd022205-f5bc-4580-a4b3-3a03648d37d1_1288x1066.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1066,&quot;width&quot;:1288,&quot;resizeWidth&quot;:360,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rNP6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd022205-f5bc-4580-a4b3-3a03648d37d1_1288x1066.png 424w, https://substackcdn.com/image/fetch/$s_!rNP6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd022205-f5bc-4580-a4b3-3a03648d37d1_1288x1066.png 848w, https://substackcdn.com/image/fetch/$s_!rNP6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd022205-f5bc-4580-a4b3-3a03648d37d1_1288x1066.png 1272w, https://substackcdn.com/image/fetch/$s_!rNP6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd022205-f5bc-4580-a4b3-3a03648d37d1_1288x1066.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Using a vision encoder to produce image embeddings</figcaption></figure></div><p>Given these two components, a vision LLM (or vLLM for short) must learn how to properly fuse both visual and textual information. In other words, the LLM must somehow <em>i)</em> ingest the image embeddings and <em>ii)</em> use these embeddings as added context for generating text. There are two primary model architectures that can be used for this purpose (depicted below):</p><ol><li><p><em>Unified embedding</em>: concatenates both image and text tokens at the input layer to form a single input sequence that is processed by the LLM<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>.</p></li><li><p><em>Cross-modality attention:</em> passes only text tokens as input to the LLM and fuses visual information into the model via additional cross-attention layers.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Vc17!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa676e40-5e09-4315-9fd1-90275964685e_2372x938.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Vc17!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa676e40-5e09-4315-9fd1-90275964685e_2372x938.png 424w, https://substackcdn.com/image/fetch/$s_!Vc17!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa676e40-5e09-4315-9fd1-90275964685e_2372x938.png 848w, https://substackcdn.com/image/fetch/$s_!Vc17!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa676e40-5e09-4315-9fd1-90275964685e_2372x938.png 1272w, https://substackcdn.com/image/fetch/$s_!Vc17!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa676e40-5e09-4315-9fd1-90275964685e_2372x938.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Vc17!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa676e40-5e09-4315-9fd1-90275964685e_2372x938.png" width="1456" height="576" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fa676e40-5e09-4315-9fd1-90275964685e_2372x938.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:576,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:766152,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/161016210?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa676e40-5e09-4315-9fd1-90275964685e_2372x938.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Vc17!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa676e40-5e09-4315-9fd1-90275964685e_2372x938.png 424w, https://substackcdn.com/image/fetch/$s_!Vc17!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa676e40-5e09-4315-9fd1-90275964685e_2372x938.png 848w, https://substackcdn.com/image/fetch/$s_!Vc17!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa676e40-5e09-4315-9fd1-90275964685e_2372x938.png 1272w, https://substackcdn.com/image/fetch/$s_!Vc17!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa676e40-5e09-4315-9fd1-90275964685e_2372x938.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Multi-modal architecture variants</figcaption></figure></div><p>These architectures both have their benefits. For example, cross-modality attention tends to be more efficient because we do not pass image embeddings through the entire LLM backbone. However, the unified embedding approach has the potential to yield better performance for the same exact reason!</p><p><strong>Multi-modal training.</strong> Given that vLLMs generate text as output, we still train them using <a href="https://cameronrwolfe.substack.com/p/language-model-training-and-inference?open=false#%C2%A7understanding-next-token-prediction">next token prediction</a>. Beyond the training objective, however, there are a few different choices of training strategies for these types of models:</p><ol><li><p><em>Native multi-modality</em>: train the vLLM from scratch using multi-modal data from the beginning.</p></li><li><p><em>Compositional multi-modality</em>: begin by training a separate LLM backbone and vision encoder, then perform extra training to fuse them together.</p></li></ol><p>Objectively speaking, native multi-modality introduces extra complexity into the training process (e.g., imbalances between modalities). Assuming that we can avoid these pitfalls, however, natively multi-modal training has massive potential&#8212;<em>it expands the scope and volume of data to which the model can be exposed</em>. For this reason, many top labs&#8212;<em>most</em> <em>notably <a href="https://blog.google/technology/ai/google-gemini-ai/">Google</a> and <a href="https://openai.com/index/image-generation-api/">OpenAI</a></em>&#8212;have adopted this approach, which was likely a motivating factor for the design of Llama 4.</p><blockquote><p><em>&#8220;Llama 4 models are designed with native multimodality, incorporating early fusion to seamlessly integrate text and vision tokens into a unified model backbone. Early fusion is a major step forward, since it enables us to jointly pre-train the model with large amounts of unlabeled text, image, and video data.&#8221; </em>- from Llama 4 blog [1]</p></blockquote><p>Prior Llama variants (e.g., Llama 3.2 Vision) use a cross-modality attention architecture and are trained with a compositional approach. In contrast, Llama 4 models are natively multi-modal and are pretrained from scratch using text, image and video data. Migrating to native multi-modality allows Llama 4 models to draw upon multiple modalities of data when constructing their massive 30T token pretraining dataset&#8212;<em>more than 2&#215; larger than that of Llama 3</em>.</p><p><strong>Early fusion.</strong> As indicated in the above quote, Llama 4 also adopts a unified embedding architecture instead of the cross-modality attention architecture that is used by Llama 3. In [1], the term <em>&#8220;early fusion&#8221;, </em>meaning that images and text are combined at the input-level of the LLM, is used to describe the architecture of Llama 4 models. Alternatively, &#8220;late fusion&#8221; architectures (e.g., cross-modality attention) combine image and text data in a later layer of the LLM. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XUZf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6347a27e-6a17-484b-aaae-69278a3dda75_1408x758.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XUZf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6347a27e-6a17-484b-aaae-69278a3dda75_1408x758.png 424w, https://substackcdn.com/image/fetch/$s_!XUZf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6347a27e-6a17-484b-aaae-69278a3dda75_1408x758.png 848w, https://substackcdn.com/image/fetch/$s_!XUZf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6347a27e-6a17-484b-aaae-69278a3dda75_1408x758.png 1272w, https://substackcdn.com/image/fetch/$s_!XUZf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6347a27e-6a17-484b-aaae-69278a3dda75_1408x758.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XUZf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6347a27e-6a17-484b-aaae-69278a3dda75_1408x758.png" width="1408" height="758" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6347a27e-6a17-484b-aaae-69278a3dda75_1408x758.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:758,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:315453,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/161016210?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6347a27e-6a17-484b-aaae-69278a3dda75_1408x758.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XUZf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6347a27e-6a17-484b-aaae-69278a3dda75_1408x758.png 424w, https://substackcdn.com/image/fetch/$s_!XUZf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6347a27e-6a17-484b-aaae-69278a3dda75_1408x758.png 848w, https://substackcdn.com/image/fetch/$s_!XUZf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6347a27e-6a17-484b-aaae-69278a3dda75_1408x758.png 1272w, https://substackcdn.com/image/fetch/$s_!XUZf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6347a27e-6a17-484b-aaae-69278a3dda75_1408x758.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The Chameleon architecture (from [6])</figcaption></figure></div><p>Although authors do not provide many details on the architecture of Llama 4 in [1], we can look at Chameleon [6]&#8212;<em>a recent publication from Meta on the topic of native multi-modality and early fusion</em>&#8212;for hints on what might be happening in Llama 4. As shown above, the Chameleon architecture passes interleaved image and text tokens as a single sequence to a unified LLM backbone. This model is trained using a natively multi-modal approach and is even capable of generating images as output. Although no image generation capabilities are presented for Llama 4 in [1], we might expect such a capability in the near future based on Llama 4&#8217;s use of a Chameleon-style early fusion architecture and the <a href="https://openai.com/index/image-generation-api/">recent success of OpenAI</a> in image generation with natively multi-modal models. </p><blockquote><p><em>&#8220;This early-fusion approach, where all modalities are projected into a shared representational space from the start, allows for seamless reasoning and generation across modalities. However, it also presents significant technical challenges, particularly in terms of optimization stability and scaling.&#8221;</em> - from [6]</p></blockquote><p>In [6], authors mention that they experience a variety of unique difficulties when training Chameleon largely due to the model&#8217;s native multi-modality. Namely, Chameleon experiences more frequent training instabilities and is harder to scale compared to a standard text-based LLM. To get around these issues, a few notable modifications are made to the underlying transformer architecture:</p><ul><li><p>Layer norm is applied to the query and key vectors during attention<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a>.</p></li><li><p>An additional <a href="https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html">dropout</a> module is added after each attention and feed-forward layer in the transformer.</p></li><li><p>The position of layer norm in the transformer block is modified (i.e., a post-norm structure is adopted instead of the more standard pre-norm [8]).</p></li></ul><p>The difficulties outlined in [6] clearly demonstrate the technical complexity of natively multi-modal training. Although Llama 4 is not confirmed to use any of the architectural tricks from Chameleon, these lessons are universally useful for any model trained using a natively multi-modal approach.</p><p><strong>The vision encoder.</strong> Although the Chameleon architecture largely matches the structure of the unified embedding model described above, the attentive reader may notice that Chameleon has no image encoder! Instead, we directly quantize images into discrete token embeddings, as described in <a href="https://arxiv.org/abs/2203.13131">this paper</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!amc9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5291bb46-b63b-4217-9dba-c21fbca3ed57_2584x940.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!amc9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5291bb46-b63b-4217-9dba-c21fbca3ed57_2584x940.png 424w, https://substackcdn.com/image/fetch/$s_!amc9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5291bb46-b63b-4217-9dba-c21fbca3ed57_2584x940.png 848w, https://substackcdn.com/image/fetch/$s_!amc9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5291bb46-b63b-4217-9dba-c21fbca3ed57_2584x940.png 1272w, https://substackcdn.com/image/fetch/$s_!amc9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5291bb46-b63b-4217-9dba-c21fbca3ed57_2584x940.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!amc9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5291bb46-b63b-4217-9dba-c21fbca3ed57_2584x940.png" width="1456" height="530" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5291bb46-b63b-4217-9dba-c21fbca3ed57_2584x940.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:530,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:510931,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/161016210?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5291bb46-b63b-4217-9dba-c21fbca3ed57_2584x940.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!amc9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5291bb46-b63b-4217-9dba-c21fbca3ed57_2584x940.png 424w, https://substackcdn.com/image/fetch/$s_!amc9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5291bb46-b63b-4217-9dba-c21fbca3ed57_2584x940.png 848w, https://substackcdn.com/image/fetch/$s_!amc9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5291bb46-b63b-4217-9dba-c21fbca3ed57_2584x940.png 1272w, https://substackcdn.com/image/fetch/$s_!amc9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5291bb46-b63b-4217-9dba-c21fbca3ed57_2584x940.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Fuyu model architecture (from [7])</figcaption></figure></div><p>Chameleon is not the first model to forgo image encoders and directly pass image info as input to an LLM. Fuyu [7] breaks images into patches&#8212;<em>like a standard ViT</em>&#8212;and linearly projects these patches to make them the same size as a text token vector. Then, the LLM can directly ingest these image patch embeddings as input. The main motivation for this approach is the fact that relevant information from the image may be lost when we pass that image through a vision encoder.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uiK2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F563cde53-5b66-4b9b-822d-cdaf4d34336c_1952x836.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uiK2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F563cde53-5b66-4b9b-822d-cdaf4d34336c_1952x836.png 424w, https://substackcdn.com/image/fetch/$s_!uiK2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F563cde53-5b66-4b9b-822d-cdaf4d34336c_1952x836.png 848w, https://substackcdn.com/image/fetch/$s_!uiK2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F563cde53-5b66-4b9b-822d-cdaf4d34336c_1952x836.png 1272w, https://substackcdn.com/image/fetch/$s_!uiK2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F563cde53-5b66-4b9b-822d-cdaf4d34336c_1952x836.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uiK2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F563cde53-5b66-4b9b-822d-cdaf4d34336c_1952x836.png" width="1456" height="624" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/563cde53-5b66-4b9b-822d-cdaf4d34336c_1952x836.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:624,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:385418,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/161016210?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F563cde53-5b66-4b9b-822d-cdaf4d34336c_1952x836.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uiK2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F563cde53-5b66-4b9b-822d-cdaf4d34336c_1952x836.png 424w, https://substackcdn.com/image/fetch/$s_!uiK2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F563cde53-5b66-4b9b-822d-cdaf4d34336c_1952x836.png 848w, https://substackcdn.com/image/fetch/$s_!uiK2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F563cde53-5b66-4b9b-822d-cdaf4d34336c_1952x836.png 1272w, https://substackcdn.com/image/fetch/$s_!uiK2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F563cde53-5b66-4b9b-822d-cdaf4d34336c_1952x836.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">MetaCLIP performance relative to the original CLIP model (from [9])</figcaption></figure></div><p>Unlike Chameleon, authors confirm in [1] that Llama 4 uses a vision encoder that is based upon MetaCLIP [9]&#8212;<em>an open replication of CLIP that emphasizes training data transparency</em>. Llama 3 uses the same architecture for its vision encoder. However, the Llama 4 vision encoder is trained in conjunction with an LLM to both <em>i)</em> improve the quality of its embeddings and <em>ii)</em> better align the visual embeddings with textual embeddings from the LLM. </p><blockquote><p><em>&#8220;We also improved the vision encoder in Llama 4. This is based on MetaCLIP but trained separately in conjunction with a frozen Llama model to better adapt the encoder to the LLM.&#8221;</em> - from Llama 4 blog [1]</p></blockquote><h4>10M Token Context Window</h4><p>Long context understanding is important, both for solving tasks that naturally require long context (e.g., multi-document summarization) and <a href="https://cameronrwolfe.substack.com/p/demystifying-reasoning-models">reasoning-based use cases</a>. Many top labs have released models with massive context windows to enable more long context applications. The release of Llama 4 follows the trend towards longer context and tries to set a new state-of-the-art in this area. As we will learn, however, enabling long context is highly complex and typically requires the (correct) integration of numerous interrelated techniques into the LLM. </p><p><strong>10M token context.</strong> Extending upon Llama 3&#8217;s context length of 128K tokens, Llama 4 Scout has an industry-leading context length of 10M tokens. The model is pretrained with a context length of 256K tokens, but the 10M token context is made possible via a variety of tricks involving modified position embeddings, scaled softmax, and long-context focused training procedures. Let&#8217;s dive deeper into the details of these techniques to understand exactly how they work.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!T5kl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8fec4d1-3b72-4e17-8a01-2e7c4f3b7a5c_1949x930.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!T5kl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8fec4d1-3b72-4e17-8a01-2e7c4f3b7a5c_1949x930.png 424w, https://substackcdn.com/image/fetch/$s_!T5kl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8fec4d1-3b72-4e17-8a01-2e7c4f3b7a5c_1949x930.png 848w, https://substackcdn.com/image/fetch/$s_!T5kl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8fec4d1-3b72-4e17-8a01-2e7c4f3b7a5c_1949x930.png 1272w, https://substackcdn.com/image/fetch/$s_!T5kl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8fec4d1-3b72-4e17-8a01-2e7c4f3b7a5c_1949x930.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!T5kl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8fec4d1-3b72-4e17-8a01-2e7c4f3b7a5c_1949x930.png" width="1456" height="695" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c8fec4d1-3b72-4e17-8a01-2e7c4f3b7a5c_1949x930.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:695,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!T5kl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8fec4d1-3b72-4e17-8a01-2e7c4f3b7a5c_1949x930.png 424w, https://substackcdn.com/image/fetch/$s_!T5kl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8fec4d1-3b72-4e17-8a01-2e7c4f3b7a5c_1949x930.png 848w, https://substackcdn.com/image/fetch/$s_!T5kl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8fec4d1-3b72-4e17-8a01-2e7c4f3b7a5c_1949x930.png 1272w, https://substackcdn.com/image/fetch/$s_!T5kl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8fec4d1-3b72-4e17-8a01-2e7c4f3b7a5c_1949x930.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Standard scaled dot-product self-attention operation</figcaption></figure></div><p><strong>Position embeddings</strong> help the transformer to understand the order of tokens in a sequence; e.g., which token comes first, second, third and so on. Explicit position information is necessary because <a href="https://cameronrwolfe.substack.com/i/142044446/the-self-attention-operation">self-attention</a> does not naturally consider the ordering of a sequence. Rather, all tokens in the sequence are considered simultaneously&#8212;<em>agnostic of position</em>&#8212;as we compute attention scores between them; see above. By using position embeddings, we can directly inject position information into the embedding of each token, allowing self-attention to use this information and learn patterns in the ordering of tokens. Many position encoding schemes exist, such as standard <a href="https://arxiv.org/abs/1706.03762">Absolute Position Embeddings (APE)</a>, <a href="https://arxiv.org/abs/2104.09864">Rotary Position Embeddings (RopE)</a> [11], <a href="https://arxiv.org/abs/2108.12409">Attention with Linear Biases (ALiBi)</a>, and more. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!s0ac!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02efc279-6e0a-43ab-b166-5d8c08d5cca9_1644x976.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!s0ac!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02efc279-6e0a-43ab-b166-5d8c08d5cca9_1644x976.png 424w, https://substackcdn.com/image/fetch/$s_!s0ac!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02efc279-6e0a-43ab-b166-5d8c08d5cca9_1644x976.png 848w, https://substackcdn.com/image/fetch/$s_!s0ac!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02efc279-6e0a-43ab-b166-5d8c08d5cca9_1644x976.png 1272w, https://substackcdn.com/image/fetch/$s_!s0ac!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02efc279-6e0a-43ab-b166-5d8c08d5cca9_1644x976.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!s0ac!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02efc279-6e0a-43ab-b166-5d8c08d5cca9_1644x976.png" width="498" height="295.5164835164835" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/02efc279-6e0a-43ab-b166-5d8c08d5cca9_1644x976.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:864,&quot;width&quot;:1456,&quot;resizeWidth&quot;:498,&quot;bytes&quot;:203429,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/161016210?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02efc279-6e0a-43ab-b166-5d8c08d5cca9_1644x976.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!s0ac!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02efc279-6e0a-43ab-b166-5d8c08d5cca9_1644x976.png 424w, https://substackcdn.com/image/fetch/$s_!s0ac!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02efc279-6e0a-43ab-b166-5d8c08d5cca9_1644x976.png 848w, https://substackcdn.com/image/fetch/$s_!s0ac!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02efc279-6e0a-43ab-b166-5d8c08d5cca9_1644x976.png 1272w, https://substackcdn.com/image/fetch/$s_!s0ac!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02efc279-6e0a-43ab-b166-5d8c08d5cca9_1644x976.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Absolute position embeddings</figcaption></figure></div><p><strong>RoPE explained.</strong> The <a href="https://www.google.com/search?q=attention+is+all+you+need&amp;rlz=1C5GCCM_en&amp;oq=attention+is+all+you+need&amp;gs_lcrp=EgZjaHJvbWUqDAgAEEUYOxixAxiABDIMCAAQRRg7GLEDGIAEMgcIARAAGIAEMgcIAhAAGIAEMgYIAxBFGEAyBggEEEUYPTIGCAUQRRhAMgYIBhBFGEAyBggHEEUYQNIBCDMwMzdqMGo0qAIAsAIA&amp;sourceid=chrome&amp;ie=UTF-8">original transformer architecture</a> uses an absolute position embedding scheme that adds a fixed position embedding to each token vector at the model&#8217;s input layer based upon the token&#8217;s absolute position in the sequence; see above. Today, LLMs more frequently use relative position embeddings that consider distances between tokens instead of absolute position. By using relative position embeddings, we can achieve better performance<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a> and make the attention mechanism more generalizable to sequences of different lengths. The most commonly-used position encoding scheme for LLMs is RoPE [11] (depicted below), which is used by both the Llama 3 [2] and Llama 4 [1].</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FT7A!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d51b4b1-deb9-4a2f-8b5f-9f7b683c9866_1566x984.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FT7A!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d51b4b1-deb9-4a2f-8b5f-9f7b683c9866_1566x984.png 424w, https://substackcdn.com/image/fetch/$s_!FT7A!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d51b4b1-deb9-4a2f-8b5f-9f7b683c9866_1566x984.png 848w, https://substackcdn.com/image/fetch/$s_!FT7A!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d51b4b1-deb9-4a2f-8b5f-9f7b683c9866_1566x984.png 1272w, https://substackcdn.com/image/fetch/$s_!FT7A!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d51b4b1-deb9-4a2f-8b5f-9f7b683c9866_1566x984.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FT7A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d51b4b1-deb9-4a2f-8b5f-9f7b683c9866_1566x984.png" width="474" height="297.87774725274727" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1d51b4b1-deb9-4a2f-8b5f-9f7b683c9866_1566x984.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:915,&quot;width&quot;:1456,&quot;resizeWidth&quot;:474,&quot;bytes&quot;:217820,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/161016210?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d51b4b1-deb9-4a2f-8b5f-9f7b683c9866_1566x984.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FT7A!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d51b4b1-deb9-4a2f-8b5f-9f7b683c9866_1566x984.png 424w, https://substackcdn.com/image/fetch/$s_!FT7A!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d51b4b1-deb9-4a2f-8b5f-9f7b683c9866_1566x984.png 848w, https://substackcdn.com/image/fetch/$s_!FT7A!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d51b4b1-deb9-4a2f-8b5f-9f7b683c9866_1566x984.png 1272w, https://substackcdn.com/image/fetch/$s_!FT7A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d51b4b1-deb9-4a2f-8b5f-9f7b683c9866_1566x984.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [11])</figcaption></figure></div><p>RoPE is a hybrid of absolute and relative position embeddings that operates by modifying the query and key vectors in self-attention. Unlike absolute position embeddings, RoPE acts upon every transformer layer&#8212;<em>not just the input layer</em>. In the standard transformer architecture, we produce key and query vectors by linearly projecting the sequence of token vectors for a given layer. For a single token in the input sequence, we can formulate this operation as shown below, where we linearly project a single token embedding. The figure below displays the creation of a key vector, but we follow the same exact approach&#8212;<em>with a different weight matrix</em>&#8212;to produce query and value vectors too. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fsp7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66db2a1d-b210-4464-a0aa-278c522601fe_974x562.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fsp7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66db2a1d-b210-4464-a0aa-278c522601fe_974x562.png 424w, https://substackcdn.com/image/fetch/$s_!fsp7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66db2a1d-b210-4464-a0aa-278c522601fe_974x562.png 848w, https://substackcdn.com/image/fetch/$s_!fsp7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66db2a1d-b210-4464-a0aa-278c522601fe_974x562.png 1272w, https://substackcdn.com/image/fetch/$s_!fsp7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66db2a1d-b210-4464-a0aa-278c522601fe_974x562.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fsp7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66db2a1d-b210-4464-a0aa-278c522601fe_974x562.png" width="446" height="257.3429158110883" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/66db2a1d-b210-4464-a0aa-278c522601fe_974x562.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:562,&quot;width&quot;:974,&quot;resizeWidth&quot;:446,&quot;bytes&quot;:48307,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/161016210?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66db2a1d-b210-4464-a0aa-278c522601fe_974x562.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fsp7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66db2a1d-b210-4464-a0aa-278c522601fe_974x562.png 424w, https://substackcdn.com/image/fetch/$s_!fsp7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66db2a1d-b210-4464-a0aa-278c522601fe_974x562.png 848w, https://substackcdn.com/image/fetch/$s_!fsp7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66db2a1d-b210-4464-a0aa-278c522601fe_974x562.png 1272w, https://substackcdn.com/image/fetch/$s_!fsp7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66db2a1d-b210-4464-a0aa-278c522601fe_974x562.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Projecting a token vector to form a key in self-attention</figcaption></figure></div><p>RoPE incorporates position information into the creation of key and query vectors by multiplying the weight matrix used in the above operation by a unique <a href="https://en.wikipedia.org/wiki/Rotation_matrix">rotation matrix</a>. Here, this rotation matrix is computed based upon the absolute position of a token in the sequence&#8212;<em>the amount that a given vector is rotated depends upon its position in the sequence.</em> This modified operation is shown below, where we again depict the creation of key vectors. The same strategy is applied to the creation of query vectors, but we do <em>not</em> modify the creation of value vectors. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IEiI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F660d68d7-f108-493e-b010-1e1c7205a1a6_1466x624.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IEiI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F660d68d7-f108-493e-b010-1e1c7205a1a6_1466x624.png 424w, https://substackcdn.com/image/fetch/$s_!IEiI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F660d68d7-f108-493e-b010-1e1c7205a1a6_1466x624.png 848w, https://substackcdn.com/image/fetch/$s_!IEiI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F660d68d7-f108-493e-b010-1e1c7205a1a6_1466x624.png 1272w, https://substackcdn.com/image/fetch/$s_!IEiI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F660d68d7-f108-493e-b010-1e1c7205a1a6_1466x624.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IEiI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F660d68d7-f108-493e-b010-1e1c7205a1a6_1466x624.png" width="550" height="234.2032967032967" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/660d68d7-f108-493e-b010-1e1c7205a1a6_1466x624.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:620,&quot;width&quot;:1456,&quot;resizeWidth&quot;:550,&quot;bytes&quot;:88995,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/161016210?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F660d68d7-f108-493e-b010-1e1c7205a1a6_1466x624.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IEiI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F660d68d7-f108-493e-b010-1e1c7205a1a6_1466x624.png 424w, https://substackcdn.com/image/fetch/$s_!IEiI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F660d68d7-f108-493e-b010-1e1c7205a1a6_1466x624.png 848w, https://substackcdn.com/image/fetch/$s_!IEiI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F660d68d7-f108-493e-b010-1e1c7205a1a6_1466x624.png 1272w, https://substackcdn.com/image/fetch/$s_!IEiI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F660d68d7-f108-493e-b010-1e1c7205a1a6_1466x624.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Incorporating position information via a rotation matrix</figcaption></figure></div><p>Here, &#952; is a vector called the rotational (or frequency) basis. We have a function <code>R</code> that takes the rotational basis &#952; and the position of the token in the sequence <code>i</code> as input and produces a rotation matrix as output. The rotation matrix is a <a href="https://mathworld.wolfram.com/BlockDiagonalMatrix.html">block-diagonal matrix</a> that is constructed as shown in the equation below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!63HZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9963ed9d-67e5-4587-ac67-f1cdea075570_1670x646.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!63HZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9963ed9d-67e5-4587-ac67-f1cdea075570_1670x646.png 424w, https://substackcdn.com/image/fetch/$s_!63HZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9963ed9d-67e5-4587-ac67-f1cdea075570_1670x646.png 848w, https://substackcdn.com/image/fetch/$s_!63HZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9963ed9d-67e5-4587-ac67-f1cdea075570_1670x646.png 1272w, https://substackcdn.com/image/fetch/$s_!63HZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9963ed9d-67e5-4587-ac67-f1cdea075570_1670x646.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!63HZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9963ed9d-67e5-4587-ac67-f1cdea075570_1670x646.png" width="498" height="192.56456043956044" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9963ed9d-67e5-4587-ac67-f1cdea075570_1670x646.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:563,&quot;width&quot;:1456,&quot;resizeWidth&quot;:498,&quot;bytes&quot;:176011,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/161016210?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9963ed9d-67e5-4587-ac67-f1cdea075570_1670x646.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!63HZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9963ed9d-67e5-4587-ac67-f1cdea075570_1670x646.png 424w, https://substackcdn.com/image/fetch/$s_!63HZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9963ed9d-67e5-4587-ac67-f1cdea075570_1670x646.png 848w, https://substackcdn.com/image/fetch/$s_!63HZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9963ed9d-67e5-4587-ac67-f1cdea075570_1670x646.png 1272w, https://substackcdn.com/image/fetch/$s_!63HZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9963ed9d-67e5-4587-ac67-f1cdea075570_1670x646.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Structure of the rotation matrix in RoPE (from [14])</figcaption></figure></div><p>This matrix is block diagonal and each block in the matrix is a <code>2 &#215; 2</code> rotation matrix. Each of these blocks rotates a pair of two dimensions within the output key (or query) embedding. As a result, each pair of dimensions in the resulting embedding is rotated based upon both the absolute position of the token in the sequence <code>i</code> and the entry of the rotational basis &#952; corresponding to that pair of dimensions. We apply this rotation matrix when producing both the key and query vectors for self-attention in every transformer layer. These modifications yield the attention operation shown below, where every key and query vector is rotated according to their absolute position in the sequence.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dwRu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab69c09a-f632-4e18-9cc3-16cd92cd8fb2_1548x936.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dwRu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab69c09a-f632-4e18-9cc3-16cd92cd8fb2_1548x936.png 424w, https://substackcdn.com/image/fetch/$s_!dwRu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab69c09a-f632-4e18-9cc3-16cd92cd8fb2_1548x936.png 848w, https://substackcdn.com/image/fetch/$s_!dwRu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab69c09a-f632-4e18-9cc3-16cd92cd8fb2_1548x936.png 1272w, https://substackcdn.com/image/fetch/$s_!dwRu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab69c09a-f632-4e18-9cc3-16cd92cd8fb2_1548x936.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dwRu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab69c09a-f632-4e18-9cc3-16cd92cd8fb2_1548x936.png" width="526" height="317.9120879120879" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ab69c09a-f632-4e18-9cc3-16cd92cd8fb2_1548x936.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:880,&quot;width&quot;:1456,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:160171,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/161016210?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab69c09a-f632-4e18-9cc3-16cd92cd8fb2_1548x936.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dwRu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab69c09a-f632-4e18-9cc3-16cd92cd8fb2_1548x936.png 424w, https://substackcdn.com/image/fetch/$s_!dwRu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab69c09a-f632-4e18-9cc3-16cd92cd8fb2_1548x936.png 848w, https://substackcdn.com/image/fetch/$s_!dwRu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab69c09a-f632-4e18-9cc3-16cd92cd8fb2_1548x936.png 1272w, https://substackcdn.com/image/fetch/$s_!dwRu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab69c09a-f632-4e18-9cc3-16cd92cd8fb2_1548x936.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Rotated keys and queries for self-attention in RoPE</figcaption></figure></div><p>When we take this standard outer product between the rotated keys and queries, however, something interesting happens. The two rotation matrices&#8212;<em>used to rotate the keys and queries, respectively</em>&#8212;combine to form a single rotation matrix <code>R(&#952;, n - m)</code>. In other words, <em>the combination of rotating both the key and query vectors in self-attention captures the relative distance between tokens in the sequence</em>. This is the crux of RoPE! Although we might struggle to understand the purpose of these rotation matrices at first, we now see that they inject the relative position of each token pair directly into the self-attention mechanism!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bZNb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d1c2937-a1c7-4cda-b7a9-c078731694c9_2186x1006.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bZNb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d1c2937-a1c7-4cda-b7a9-c078731694c9_2186x1006.png 424w, https://substackcdn.com/image/fetch/$s_!bZNb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d1c2937-a1c7-4cda-b7a9-c078731694c9_2186x1006.png 848w, https://substackcdn.com/image/fetch/$s_!bZNb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d1c2937-a1c7-4cda-b7a9-c078731694c9_2186x1006.png 1272w, https://substackcdn.com/image/fetch/$s_!bZNb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d1c2937-a1c7-4cda-b7a9-c078731694c9_2186x1006.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bZNb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d1c2937-a1c7-4cda-b7a9-c078731694c9_2186x1006.png" width="1456" height="670" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7d1c2937-a1c7-4cda-b7a9-c078731694c9_2186x1006.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:670,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:230744,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/161016210?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d1c2937-a1c7-4cda-b7a9-c078731694c9_2186x1006.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bZNb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d1c2937-a1c7-4cda-b7a9-c078731694c9_2186x1006.png 424w, https://substackcdn.com/image/fetch/$s_!bZNb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d1c2937-a1c7-4cda-b7a9-c078731694c9_2186x1006.png 848w, https://substackcdn.com/image/fetch/$s_!bZNb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d1c2937-a1c7-4cda-b7a9-c078731694c9_2186x1006.png 1272w, https://substackcdn.com/image/fetch/$s_!bZNb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d1c2937-a1c7-4cda-b7a9-c078731694c9_2186x1006.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [12])</figcaption></figure></div><p><strong>Length generalization.</strong> If we provide a sequence to an LLM that is much longer than the sequences upon which the model was trained, the performance of the model will drastically deteriorate. Position embeddings play a key role in an LLM&#8217;s ability to generalize to longer context lengths. Ideally, we want to use a position encoding scheme that allows the model to generalize more easily to context lengths that go beyond what is seen during training!</p><blockquote><p><em>&#8220;Length generalization, the ability to generalize from small training context sizes to larger ones, is a critical challenge in the development of Transformer-based language models. Positional encoding has been identified as a major factor influencing length generalization.&#8221;</em> - from [12]</p></blockquote><p>Recently, researchers showed that the most common position encoding schemes for LLM&#8217;s&#8212;<em>including RoPE</em>&#8212;fail to generalize well to long context lengths [12]; see below. Even though RoPE is generally considered to be a relative position encoding scheme, it performs similarly to absolute position encodings when generalizing to long context lengths. However, the No Positional Embedding (NoPE) scheme proposed in [12], which simply removes position embeddings from the model, is surprisingly capable of generalizing to longer contexts. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wdXa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44d1db51-b8a6-43c7-a998-50fa94d5f05e_1674x864.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wdXa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44d1db51-b8a6-43c7-a998-50fa94d5f05e_1674x864.png 424w, https://substackcdn.com/image/fetch/$s_!wdXa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44d1db51-b8a6-43c7-a998-50fa94d5f05e_1674x864.png 848w, https://substackcdn.com/image/fetch/$s_!wdXa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44d1db51-b8a6-43c7-a998-50fa94d5f05e_1674x864.png 1272w, https://substackcdn.com/image/fetch/$s_!wdXa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44d1db51-b8a6-43c7-a998-50fa94d5f05e_1674x864.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wdXa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44d1db51-b8a6-43c7-a998-50fa94d5f05e_1674x864.png" width="1456" height="751" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/44d1db51-b8a6-43c7-a998-50fa94d5f05e_1674x864.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:751,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:321442,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/161016210?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44d1db51-b8a6-43c7-a998-50fa94d5f05e_1674x864.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wdXa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44d1db51-b8a6-43c7-a998-50fa94d5f05e_1674x864.png 424w, https://substackcdn.com/image/fetch/$s_!wdXa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44d1db51-b8a6-43c7-a998-50fa94d5f05e_1674x864.png 848w, https://substackcdn.com/image/fetch/$s_!wdXa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44d1db51-b8a6-43c7-a998-50fa94d5f05e_1674x864.png 1272w, https://substackcdn.com/image/fetch/$s_!wdXa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44d1db51-b8a6-43c7-a998-50fa94d5f05e_1674x864.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [12])</figcaption></figure></div><p>The fact that NoPE works well is surprising, but empirical (and theoretical) analysis in [12] reveals that transformers can represent both relative and absolute position encodings without using explicit position embeddings. Practically, the attention patterns learned by NoPE are shown to resemble relative position encodings in [12]; see above. Drawing upon these results, Llama 4 models interleave standard transformer layers that use RoPE with layers using NoPE. This approach, called interleaved RoPE (iRoPE), improves long context abilities.</p><blockquote><p><em>&#8220;A key innovation in the Llama 4 architecture is the use of interleaved attention layers <a href="https://arxiv.org/abs/2305.19466">without positional embeddings</a>. Additionally, we employ <a href="https://arxiv.org/pdf/2501.19399">inference time temperature scaling</a> of attention to enhance length generalization.&#8221;</em> - from Llama 4 blog [1]</p></blockquote><p><strong>Temperature scaling.</strong> Every transformer layer has a softmax transformation within its attention mechanism. Softmax is computed for element <code>i</code> of an <code>N</code>-dimensional vector as follows:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!07Vl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F838ca808-966d-4bce-847b-64003d7525e3_694x274.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!07Vl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F838ca808-966d-4bce-847b-64003d7525e3_694x274.png 424w, https://substackcdn.com/image/fetch/$s_!07Vl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F838ca808-966d-4bce-847b-64003d7525e3_694x274.png 848w, https://substackcdn.com/image/fetch/$s_!07Vl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F838ca808-966d-4bce-847b-64003d7525e3_694x274.png 1272w, https://substackcdn.com/image/fetch/$s_!07Vl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F838ca808-966d-4bce-847b-64003d7525e3_694x274.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!07Vl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F838ca808-966d-4bce-847b-64003d7525e3_694x274.png" width="258" height="101.86167146974063" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/838ca808-966d-4bce-847b-64003d7525e3_694x274.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:274,&quot;width&quot;:694,&quot;resizeWidth&quot;:258,&quot;bytes&quot;:31909,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/161016210?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F838ca808-966d-4bce-847b-64003d7525e3_694x274.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!07Vl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F838ca808-966d-4bce-847b-64003d7525e3_694x274.png 424w, https://substackcdn.com/image/fetch/$s_!07Vl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F838ca808-966d-4bce-847b-64003d7525e3_694x274.png 848w, https://substackcdn.com/image/fetch/$s_!07Vl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F838ca808-966d-4bce-847b-64003d7525e3_694x274.png 1272w, https://substackcdn.com/image/fetch/$s_!07Vl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F838ca808-966d-4bce-847b-64003d7525e3_694x274.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The denominator of this expression&#8212;<em>the sum of raw attention scores for all pairs of tokens in the sequence</em>&#8212;will become larger with increasing context length, but the numerator is decoupled from the context length and fixed in magnitude. These two facts create an interesting phenomenon in attention scores for long contexts: <em>attention scores get smaller as the context length grows larger</em>. To mitigate this issue, authors in [13] propose Scalable-Softmax, which is formulated as follows:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VKik!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedb1c15-c15a-4d11-8481-3ec9fca19bfc_1094x274.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VKik!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedb1c15-c15a-4d11-8481-3ec9fca19bfc_1094x274.png 424w, https://substackcdn.com/image/fetch/$s_!VKik!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedb1c15-c15a-4d11-8481-3ec9fca19bfc_1094x274.png 848w, https://substackcdn.com/image/fetch/$s_!VKik!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedb1c15-c15a-4d11-8481-3ec9fca19bfc_1094x274.png 1272w, https://substackcdn.com/image/fetch/$s_!VKik!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedb1c15-c15a-4d11-8481-3ec9fca19bfc_1094x274.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VKik!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedb1c15-c15a-4d11-8481-3ec9fca19bfc_1094x274.png" width="406" height="101.68555758683729" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aedb1c15-c15a-4d11-8481-3ec9fca19bfc_1094x274.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:274,&quot;width&quot;:1094,&quot;resizeWidth&quot;:406,&quot;bytes&quot;:44947,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/161016210?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedb1c15-c15a-4d11-8481-3ec9fca19bfc_1094x274.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VKik!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedb1c15-c15a-4d11-8481-3ec9fca19bfc_1094x274.png 424w, https://substackcdn.com/image/fetch/$s_!VKik!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedb1c15-c15a-4d11-8481-3ec9fca19bfc_1094x274.png 848w, https://substackcdn.com/image/fetch/$s_!VKik!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedb1c15-c15a-4d11-8481-3ec9fca19bfc_1094x274.png 1272w, https://substackcdn.com/image/fetch/$s_!VKik!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faedb1c15-c15a-4d11-8481-3ec9fca19bfc_1094x274.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Similarly to standard softmax, Scalable-Softmax is just a function that converts a vector of values into a valid probability distribution. However, this variant of the softmax introduces two new and important factors:</p><ul><li><p><code>s</code>: a scaling parameter that can be tuned to change the function&#8217;s shape.</p></li><li><p><code>N</code>: the length of the input vector. </p></li></ul><p>By including the length of the input vector in Scalable-Softmax, we can balance the scale of the numerator and denominator, prevent long context attention scores from decaying and improve long context capabilities; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nNkn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3f99960-52bf-4e47-a39b-d41d92cd2d29_760x1556.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nNkn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3f99960-52bf-4e47-a39b-d41d92cd2d29_760x1556.png 424w, https://substackcdn.com/image/fetch/$s_!nNkn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3f99960-52bf-4e47-a39b-d41d92cd2d29_760x1556.png 848w, https://substackcdn.com/image/fetch/$s_!nNkn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3f99960-52bf-4e47-a39b-d41d92cd2d29_760x1556.png 1272w, https://substackcdn.com/image/fetch/$s_!nNkn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3f99960-52bf-4e47-a39b-d41d92cd2d29_760x1556.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nNkn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3f99960-52bf-4e47-a39b-d41d92cd2d29_760x1556.png" width="366" height="749.3368421052631" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d3f99960-52bf-4e47-a39b-d41d92cd2d29_760x1556.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1556,&quot;width&quot;:760,&quot;resizeWidth&quot;:366,&quot;bytes&quot;:352798,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/161016210?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3f99960-52bf-4e47-a39b-d41d92cd2d29_760x1556.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nNkn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3f99960-52bf-4e47-a39b-d41d92cd2d29_760x1556.png 424w, https://substackcdn.com/image/fetch/$s_!nNkn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3f99960-52bf-4e47-a39b-d41d92cd2d29_760x1556.png 848w, https://substackcdn.com/image/fetch/$s_!nNkn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3f99960-52bf-4e47-a39b-d41d92cd2d29_760x1556.png 1272w, https://substackcdn.com/image/fetch/$s_!nNkn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3f99960-52bf-4e47-a39b-d41d92cd2d29_760x1556.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [13])</figcaption></figure></div><p>As mentioned in [1], Llama 4 models adopt a similar approach that scales the <a href="https://stats.stackexchange.com/questions/527080/what-is-the-role-of-temperature-in-softmax">temperature</a> of the softmax function at inference time to avoid attention scores from decaying at very large context lengths. Lowering the softmax temperature makes the resulting distribution more pointed, while increasing the temperature makes the distribution more uniform. We can simply lower the temperature of softmax at long context lengths to balance attention scores. Such inference-time tricks are useful, but they also complicate the inference process of Llama 4, <em>thus increasing the likelihood of <a href="https://x.com/Ahmad_Al_Dahle/status/1909302532306092107">detrimental bugs and implementation differences</a></em>.</p><p><strong>Context extension.</strong> Finally, in addition to the strategies outlined so far, we need to train the LLM to support long context. Usually, we do not just pretrain the LLM with long context. Such an approach is sub-optimal because the memory requirements of training on long sequences are very high. Instead, we can train the model in two stages:</p><ol><li><p>Standard pretraining with lower context length. </p></li><li><p>Finetuning on a long context dataset, <em>also known as &#8220;context extension&#8221;</em>. </p></li></ol><p>For example, Llama 4  Scout is pretrained with a 256K context length prior to having its context extended during a later stage of training.</p><blockquote><p><em>&#8220;We continued training the model in [mid-training] to improve core capabilities with new training recipes including long context extension using specialized datasets. This enabled us to enhance model quality while also unlocking best-in-class 10M input context length for Llama 4 Scout.&#8221;</em> - from Llama 4 blog [1] </p></blockquote><p>By dedicating a specific finetuning stage to context extension, we can limit the amount of training performed with ultra-long sequences. In most cases, the training data used for context extension is synthetic&#8212;<em>either created with heuristics or an LLM</em>&#8212;due to the difficulty of collecting real long-context data. As we will see, the quality of the synthetic data used for context extension can drastically impact the model&#8217;s capabilities. <em>This data must accurately resemble and capture the types of tasks that the model will solve in practice</em>. As we will see, the long context abilities of Llama 4 models break down in practice, possibly due to this issue.</p><div id="youtube2-dc4chADushM" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;dc4chADushM&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/dc4chADushM?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>In [1], authors do not mention the exact methods used for extending the context of Llama 4. However, we can overview some commonly-used techniques in the literature to provide inspiration for the context extension techniques that Llama 4 likely used. As described perfectly in the above video, there are two main categories of approaches used for extending the context of an LLM:</p><ul><li><p><em>Position Interpolation</em>: these techniques adjust the frequency basis of RoPE<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a> such that larger positions still fit within the model&#8217;s &#8220;known&#8221; context length; e.g., <a href="https://arxiv.org/abs/2306.15595">Position Interpolation</a>, <a href="https://arxiv.org/abs/2306.15595">NTK-RoPE</a>, <a href="https://arxiv.org/abs/2309.00071">YaRN</a>, and <a href="https://arxiv.org/abs/2310.16450">CLEX</a>. </p></li><li><p><em>Approximate Attention</em>: these techniques modify the structure of attention to only consider certain groups of tokens (e.g., based on a <a href="https://arxiv.org/abs/2309.12307">blocks</a>, <a href="https://arxiv.org/abs/2305.16300">landmarks</a> or a <a href="https://arxiv.org/abs/2308.16137">sliding window</a>) when computing attention scores.</p></li></ul><p>An extensive analysis of these approaches is provided in [14], where we see that position interpolation-style methods tend to perform the best. In particular, NTK-RoPE achieves very impressive performance due to its ability to dynamically adjust frequencies in RoPE so that the frequency of nearby tokens is not changed too much. These techniques are very commonly used for training LLMs. As a concrete example, see page four of <a href="https://arxiv.org/abs/2412.15115">the Qwen-2.5 report</a> where authors describe increasing the base frequency of RoPE before performing long context training. </p><h2>Training Llama 4</h2><p>In addition to its completely revised architecture, Llama 4 uses a new training pipeline that makes significant modifications to both pre and post-training. Again, many of these changes introduce extra complexity for the purpose of better performance and are inspired by techniques that have been successfully adopted within frontier-level research labs. Interestingly, the training process for the smaller Llama 4 Maverick and Scout models also heavily leverages knowledge distillation from the much larger Behemoth model. </p><h4><strong>Pretraining</strong></h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rfvy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15fbad1d-acda-4c67-96c3-d48444e083ae_1920x1308.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rfvy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15fbad1d-acda-4c67-96c3-d48444e083ae_1920x1308.png 424w, https://substackcdn.com/image/fetch/$s_!rfvy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15fbad1d-acda-4c67-96c3-d48444e083ae_1920x1308.png 848w, https://substackcdn.com/image/fetch/$s_!rfvy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15fbad1d-acda-4c67-96c3-d48444e083ae_1920x1308.png 1272w, https://substackcdn.com/image/fetch/$s_!rfvy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15fbad1d-acda-4c67-96c3-d48444e083ae_1920x1308.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rfvy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15fbad1d-acda-4c67-96c3-d48444e083ae_1920x1308.png" width="1456" height="992" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/15fbad1d-acda-4c67-96c3-d48444e083ae_1920x1308.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:992,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rfvy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15fbad1d-acda-4c67-96c3-d48444e083ae_1920x1308.png 424w, https://substackcdn.com/image/fetch/$s_!rfvy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15fbad1d-acda-4c67-96c3-d48444e083ae_1920x1308.png 848w, https://substackcdn.com/image/fetch/$s_!rfvy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15fbad1d-acda-4c67-96c3-d48444e083ae_1920x1308.png 1272w, https://substackcdn.com/image/fetch/$s_!rfvy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15fbad1d-acda-4c67-96c3-d48444e083ae_1920x1308.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">MoE architecture for Llama 4 (from [1])</figcaption></figure></div><p>Significantly tweaking the pretraining process for an LLM is both risky and rare given that <em>i)</em> pretraining is very expensive and <em>ii)</em> <a href="https://cameronrwolfe.substack.com/p/llm-scaling-laws">techniques for pretraining and scaling</a> are heavily studied and (relatively) solidified. However, the native multi-modality and MoE-based architecture of Llama 4 warrant some changes to the pretraining process that we will quickly overview in this section. </p><p><strong>Native multi-modality.</strong> As mentioned previously, Llama 4 models are pretrained over a massive 30T token dataset comprised of text, images and videos. However, this dataset is not just multi-modal, it&#8217;s also highly multilingual and contains data from 200 languages. Over 100 of these languages have at least 1B training tokens associated with them, <em>providing a 10&#215; increase in multilingual data relative to Llama 3</em>. This multilingual emphasis is not surprising given Meta&#8217;s prior investments into machine translation research, most notably their <a href="https://ai.meta.com/blog/nllb-200-high-quality-machine-translation/">No Language Left Behind (NLLB) model</a> that also supports 200 languages. </p><blockquote><p><em>&#8220;In the final stages of pre-training, we train on long sequences to support context windows of up to 128K tokens. We do not train on long sequences earlier because the compute in self-attention layers grows quadratically in the sequence length.&#8221;</em> - from Llama 3 paper [2]</p></blockquote><p>Llama 4 models are pretrained using a context length of 256K tokens, which is quite large compared to prior models. For example, Llama 3 is originally pretrained with a context length of 8K, which is later increased to 128K via a six-stage context extension process. This extended context length speaks to the efficiency of the pretraining process with Llama 4&#8217;s new MoE architecture and is needed for multi-modal pretraining. Namely, Llama 4 receives up to 48 images&#8212;<em>either standalone images or still frames from a video</em>&#8212;in its input sequence during pretraining and provides good results with up to eight images during testing.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_iiJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ba82b9e-e187-48ec-ab5d-7edb999fcdb1_1488x1126.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_iiJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ba82b9e-e187-48ec-ab5d-7edb999fcdb1_1488x1126.png 424w, https://substackcdn.com/image/fetch/$s_!_iiJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ba82b9e-e187-48ec-ab5d-7edb999fcdb1_1488x1126.png 848w, https://substackcdn.com/image/fetch/$s_!_iiJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ba82b9e-e187-48ec-ab5d-7edb999fcdb1_1488x1126.png 1272w, https://substackcdn.com/image/fetch/$s_!_iiJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ba82b9e-e187-48ec-ab5d-7edb999fcdb1_1488x1126.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_iiJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ba82b9e-e187-48ec-ab5d-7edb999fcdb1_1488x1126.png" width="1456" height="1102" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4ba82b9e-e187-48ec-ab5d-7edb999fcdb1_1488x1126.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1102,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:800962,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/161016210?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ba82b9e-e187-48ec-ab5d-7edb999fcdb1_1488x1126.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_iiJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ba82b9e-e187-48ec-ab5d-7edb999fcdb1_1488x1126.png 424w, https://substackcdn.com/image/fetch/$s_!_iiJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ba82b9e-e187-48ec-ab5d-7edb999fcdb1_1488x1126.png 848w, https://substackcdn.com/image/fetch/$s_!_iiJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ba82b9e-e187-48ec-ab5d-7edb999fcdb1_1488x1126.png 1272w, https://substackcdn.com/image/fetch/$s_!_iiJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ba82b9e-e187-48ec-ab5d-7edb999fcdb1_1488x1126.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Interleaved images and text (from [6])</figcaption></figure></div><p>Given that Llama 4 uses a (Chameleon-style) unified embedding architecture, images and video stills can be arbitrarily interleaved within the model&#8217;s input sequence; see above. Here, visual tokens are just another token in the model&#8217;s input sequence and are treated similarly to a standard text token. Unlike Llama 3, the Llama 4 blog [1] does not explicitly mention the use of a <a href="https://cameronrwolfe.substack.com/i/158954054/from-images-to-videos">Perceiver Resampler</a> for ingesting video data. Instead, it seems&#8212;<em>based on wording in the blog post</em>&#8212;that the model might just ingest still video frames and learn temporal patterns from the position of each token within the input.</p><blockquote><p><em>&#8220;Compared with the BF16 baseline, the relative loss error of our FP8-training model remains consistently below 0.25%, a level well within the acceptable range of training randomness.&#8221;</em> - from [4] </p></blockquote><p><strong>Low precision training.</strong> Authors in [1] mention that Llama 4 models are trained using FP8 precision. DeepSeek-v3 [4] was the first open model to successfully use FP8 precision for large-scale pretraining. Mixed precision training is common, but FP8 is an aggressive precision setting&#8212;<em>most training is performed with higher precision like </em><code>bfloat16</code>. Plus, MoEs are <a href="https://cameronrwolfe.substack.com/i/155023686/best-practices-for-training-moes">especially sensitive to mixed precision training</a> due to their increased likelihood of training instability.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DVzR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c333a4-8f03-4704-b8d3-83f9f2b73cc8_1234x347.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DVzR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c333a4-8f03-4704-b8d3-83f9f2b73cc8_1234x347.png 424w, https://substackcdn.com/image/fetch/$s_!DVzR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c333a4-8f03-4704-b8d3-83f9f2b73cc8_1234x347.png 848w, https://substackcdn.com/image/fetch/$s_!DVzR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c333a4-8f03-4704-b8d3-83f9f2b73cc8_1234x347.png 1272w, https://substackcdn.com/image/fetch/$s_!DVzR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c333a4-8f03-4704-b8d3-83f9f2b73cc8_1234x347.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DVzR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c333a4-8f03-4704-b8d3-83f9f2b73cc8_1234x347.png" width="1234" height="347" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/26c333a4-8f03-4704-b8d3-83f9f2b73cc8_1234x347.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:347,&quot;width&quot;:1234,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:100442,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/161016210?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d62f43a-68b9-46fb-8817-bbd8101edcfa_1234x460.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DVzR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c333a4-8f03-4704-b8d3-83f9f2b73cc8_1234x347.png 424w, https://substackcdn.com/image/fetch/$s_!DVzR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c333a4-8f03-4704-b8d3-83f9f2b73cc8_1234x347.png 848w, https://substackcdn.com/image/fetch/$s_!DVzR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c333a4-8f03-4704-b8d3-83f9f2b73cc8_1234x347.png 1272w, https://substackcdn.com/image/fetch/$s_!DVzR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26c333a4-8f03-4704-b8d3-83f9f2b73cc8_1234x347.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">FP8 training framework used by DeepSeek-v3 (from [4])</figcaption></figure></div><p> Few details are provided on the FP8 scheme used for training Llama 4, but the implementation is likely to resemble that of DeepSeek-v3; see above. The main issue with FP8 training is the presence of outliers within the activations, weights and gradients of an LLM&#8212;<em>truncating the precision of large numbers leads to round-off errors that create instabilities during training</em>. To avoid this issue, DeepSeek-v3 proposes a novel FP8 quantization scheme that performs fine-grained quantization of 1D tiles or 2D blocks of values within the model. By performing quantization over finer-grained groups, we minimize round-off errors.</p><p><strong>Curriculum learning.</strong> Finally, Llama 4 is also pretrained in multiple stages, including both the standard pretraining phase and an additional training phase&#8212;<em>referred to as &#8220;mid-training&#8221; in [1]</em>&#8212;with a different data mixture that emphasizes key domains and specific model capabilities (e.g., long context understanding). This strategy of annealing the mixture of data being used toward the end of pretraining is common for LLMs. For example, Llama 3 uses a similar strategy with a high-quality annealing dataset (see page 56 in <a href="https://arxiv.org/abs/2407.21783">the paper</a>) and entire papers have even been published on exactly this topic [10]!</p><h4>Post-Training</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Fbt4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18dc7ee3-961e-4fd6-b1dc-5fe7d91826c5_2660x1064.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Fbt4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18dc7ee3-961e-4fd6-b1dc-5fe7d91826c5_2660x1064.png 424w, https://substackcdn.com/image/fetch/$s_!Fbt4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18dc7ee3-961e-4fd6-b1dc-5fe7d91826c5_2660x1064.png 848w, https://substackcdn.com/image/fetch/$s_!Fbt4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18dc7ee3-961e-4fd6-b1dc-5fe7d91826c5_2660x1064.png 1272w, https://substackcdn.com/image/fetch/$s_!Fbt4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18dc7ee3-961e-4fd6-b1dc-5fe7d91826c5_2660x1064.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Fbt4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18dc7ee3-961e-4fd6-b1dc-5fe7d91826c5_2660x1064.png" width="1456" height="582" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/18dc7ee3-961e-4fd6-b1dc-5fe7d91826c5_2660x1064.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:582,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:231345,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/161016210?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18dc7ee3-961e-4fd6-b1dc-5fe7d91826c5_2660x1064.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Fbt4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18dc7ee3-961e-4fd6-b1dc-5fe7d91826c5_2660x1064.png 424w, https://substackcdn.com/image/fetch/$s_!Fbt4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18dc7ee3-961e-4fd6-b1dc-5fe7d91826c5_2660x1064.png 848w, https://substackcdn.com/image/fetch/$s_!Fbt4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18dc7ee3-961e-4fd6-b1dc-5fe7d91826c5_2660x1064.png 1272w, https://substackcdn.com/image/fetch/$s_!Fbt4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18dc7ee3-961e-4fd6-b1dc-5fe7d91826c5_2660x1064.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Post-training for Llama 3 (from [2])</figcaption></figure></div><p>One of the most fascinating aspects of Llama 3 is the simplicity of its post-training pipeline, which includes several rounds of <a href="https://cameronrwolfe.substack.com/p/understanding-and-using-supervised">supervised finetuning (SFT)</a> and direct preference optimization (DPO) [18]; see above. Given that DPO does not require the training of a separate reward model like <a href="https://cameronrwolfe.substack.com/p/proximal-policy-optimization-ppo">PPO-based RLHF</a>, this strategy is more user friendly in terms of the required GPU resources. However, we see with Llama 4 that such a basic alignment strategy comes at the cost of model performance. Post-training is one of the fastest-moving domains of LLM research, and a more sophisticated approach is needed to match top models. For a more general overview of LLM post-training, see the video below.</p><div id="youtube2-6yIMb0K-aS4" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;6yIMb0K-aS4&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/6yIMb0K-aS4?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p><strong>Post-training for Llama 4.</strong> Post-training for Llama 4 has three key stages:</p><ol><li><p><em>Lightweight SFT</em>: supervised training over a small (and highly-curated) set of completions for difficult prompts.</p></li><li><p><em>Online RL</em>: large-scale RL training focused on improving model capabilities in several areas (e.g., multi-modality, reasoning, conversation and more).</p></li><li><p><em>Lightweight DPO</em>: a short additional training phase used to fix minor issues and corner cases in model response quality.</p></li></ol><p>Put simply, Llama 4 makes a heavier investment into RL training, adopting a more sophisticated post-training strategy that relies upon large-scale RL to develop key model capabilities like reasoning and conversation. However, most details on the exact RL settings used for Llama 4 are excluded from [1]. Again, we will have to rely on recent research to provide hints on Llama 4&#8217;s approach.</p><blockquote><p><em>&#8220;We found that doing lightweight SFT followed by large-scale reinforcement learning (RL) produced significant improvements in reasoning and coding abilities.&#8221;</em> - from Llama 4 blog [1]</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nKHY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5e3b5bc-a219-4219-83e0-80075efa1f01_640x480.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nKHY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5e3b5bc-a219-4219-83e0-80075efa1f01_640x480.gif 424w, https://substackcdn.com/image/fetch/$s_!nKHY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5e3b5bc-a219-4219-83e0-80075efa1f01_640x480.gif 848w, https://substackcdn.com/image/fetch/$s_!nKHY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5e3b5bc-a219-4219-83e0-80075efa1f01_640x480.gif 1272w, https://substackcdn.com/image/fetch/$s_!nKHY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5e3b5bc-a219-4219-83e0-80075efa1f01_640x480.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nKHY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5e3b5bc-a219-4219-83e0-80075efa1f01_640x480.gif" width="494" height="370.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d5e3b5bc-a219-4219-83e0-80075efa1f01_640x480.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:480,&quot;width&quot;:640,&quot;resizeWidth&quot;:494,&quot;bytes&quot;:481395,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/161016210?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5e3b5bc-a219-4219-83e0-80075efa1f01_640x480.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nKHY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5e3b5bc-a219-4219-83e0-80075efa1f01_640x480.gif 424w, https://substackcdn.com/image/fetch/$s_!nKHY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5e3b5bc-a219-4219-83e0-80075efa1f01_640x480.gif 848w, https://substackcdn.com/image/fetch/$s_!nKHY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5e3b5bc-a219-4219-83e0-80075efa1f01_640x480.gif 1272w, https://substackcdn.com/image/fetch/$s_!nKHY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5e3b5bc-a219-4219-83e0-80075efa1f01_640x480.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(<a href="https://huggingface.co/learn/deep-rl-course/en/unitbonus3/offline-online">source</a>)</figcaption></figure></div><p><strong>Online vs. offline RL.</strong> In [1], authors emphasize the use of online RL for training Llama 4, <em>but what does this mean?</em> As detailed in <a href="https://huggingface.co/learn/deep-rl-course/en/unitbonus3/offline-online">this blog</a>, we can either adopt an online or offline approach when training an LLM (or any other model) with RL. The difference between these strategies lies in how we collect training data:</p><ul><li><p>Online RL trains the LLM on data collected from the current model&#8212;<em>the training data is coming from the LLM itself</em><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-9" href="#footnote-9" target="_self">9</a>.</p></li><li><p>Offline RL trains the LLM on historical data; e.g., from prior versions of the LLM or another LLM.</p></li></ul><p>The key distinguishing feature of online RL is the presence of on-policy sampling (i.e., sampling training data directly from the current LLM). Generally, offline RL is considered to be both cheaper and easier to implement. However, recent papers have shown that online RL offers a clear performance benefit [18]. </p><p><strong>Relation to reasoning research.</strong> Interestingly, authors in [1] find that using only SFT and DPO can &#8220;over-constrain&#8221; the LLM&#8217;s performance&#8212;<em>especially in domains that require complex reasoning like math and code</em>&#8212;by allowing for less exploration during the RL training phase. Recent reasoning research (e.g., <a href="https://arxiv.org/abs/2501.12948">DeepSeek-R1</a> and <a href="https://arxiv.org/abs/2501.12599">Kimi-1.5</a>) comes to conclusions that are very similar. The impressive reasoning capabilities of recent models are enabled by large-scale training with RL and less emphasis is placed upon supervised training; e.g., the initial <a href="https://cameronrwolfe.substack.com/i/153722335/deepseek-r-zero">DeepSeek-R1-Zero</a> model is actually post-trained using pure RL with no SFT!</p><blockquote><p><em>&#8220;The self-evolution of DeepSeek-R1-Zero is a fascinating demonstration of how RL can drive a model to improve its reasoning capabilities autonomously.&#8221;</em> - from [1]</p></blockquote><p>Recent reasoning models make heavy use of <a href="https://cameronrwolfe.substack.com/p/demystifying-reasoning-models">RL from verifiable rewards (RLVR)</a>; see below. Unlike standard RLHF that derives a reward signal from an LLM-based reward model that is trained on human preferences, RLVR uses reward signals that are deterministic. For example, the reward on a math question could simply check whether the LLM&#8217;s answer matches the ground truth answer.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mzxO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mzxO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png 424w, https://substackcdn.com/image/fetch/$s_!mzxO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png 848w, https://substackcdn.com/image/fetch/$s_!mzxO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png 1272w, https://substackcdn.com/image/fetch/$s_!mzxO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mzxO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png" width="1456" height="570" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:570,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!mzxO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png 424w, https://substackcdn.com/image/fetch/$s_!mzxO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png 848w, https://substackcdn.com/image/fetch/$s_!mzxO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png 1272w, https://substackcdn.com/image/fetch/$s_!mzxO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(<a href="https://arxiv.org/abs/2411.15124">source</a>)</figcaption></figure></div><p>In [1], authors openly categorize Llama 4 as a &#8220;non-reasoning model&#8221;, indicating that Llama 4 is (most likely) post-training using a more standard RLHF setup&#8212;<em>though it is still very likely that RLVR is used at least in part</em>&#8212;and is not trained to leverage <a href="https://cameronrwolfe.substack.com/i/153722335/initial-reasoning-models-o-and-o-mini">long chains of thought</a> when solving problems. Based recent trends in LLM research, however, we should not be surprised is a reasoning-variant of Llama 4 is released in the near future. For example, DeepSeek-R1 was an extension of the previously-released DeepSeek-v3 (non-reasoning) model.</p><p><strong>Data mixing and curation.</strong> Beyond using new algorithms, authors emphasize the importance of data curation and curriculum learning in the post-training process for Llama 4. Over 50% of the data available for SFT is removed from the training process by using an <a href="https://cameronrwolfe.substack.com/p/llm-as-a-judge">LLM judge</a> (i.e., this is just a prior Llama model) to identify and remove easy examples, <em>thus focusing post-training on more difficult data</em>. For the Behemoth model, an even larger portion (95%) of this data is removed. </p><blockquote><p><em>&#8220;We also found that dynamically filtering out prompts with zero advantage during training and constructing training batches with mixed prompts from multiple capabilities were instrumental in providing a performance boost on math, reasoning, and coding.&#8221;</em> - from Llama 4 blog [1]</p></blockquote><p>A similar strategy is used during online RL by alternating between training the model and using it to identify hard training prompts. In particular, prompt difficulty is assessed using <a href="https://www.philschmid.de/agents-pass-at-k-pass-power-k">pass@k analysis</a>, which generates <code>k</code> completions with the LLM and checks how many of them are correct. Notably, a nearly identical technique is adopted by Kimi-1.5 (see Section Two of <a href="https://arxiv.org/abs/2501.12599">this paper</a>) to assess prompt difficulty and develop a curriculum learning strategy. As detailed in the above quote, Llama 4 also adopts some additional tricks for identifying hard prompts and mixes data from multiple domains in each training batch to achieve a good balance in model capabilities (e.g., conversation, reasoning, coding and more). </p><h4>Model Distillation</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!16qD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc65edf3f-2618-4712-8339-3e37ded9e142_1920x729.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!16qD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc65edf3f-2618-4712-8339-3e37ded9e142_1920x729.png 424w, https://substackcdn.com/image/fetch/$s_!16qD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc65edf3f-2618-4712-8339-3e37ded9e142_1920x729.png 848w, https://substackcdn.com/image/fetch/$s_!16qD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc65edf3f-2618-4712-8339-3e37ded9e142_1920x729.png 1272w, https://substackcdn.com/image/fetch/$s_!16qD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc65edf3f-2618-4712-8339-3e37ded9e142_1920x729.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!16qD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc65edf3f-2618-4712-8339-3e37ded9e142_1920x729.png" width="1920" height="729" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c65edf3f-2618-4712-8339-3e37ded9e142_1920x729.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:729,&quot;width&quot;:1920,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:78804,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!16qD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc65edf3f-2618-4712-8339-3e37ded9e142_1920x729.png 424w, https://substackcdn.com/image/fetch/$s_!16qD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc65edf3f-2618-4712-8339-3e37ded9e142_1920x729.png 848w, https://substackcdn.com/image/fetch/$s_!16qD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc65edf3f-2618-4712-8339-3e37ded9e142_1920x729.png 1272w, https://substackcdn.com/image/fetch/$s_!16qD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc65edf3f-2618-4712-8339-3e37ded9e142_1920x729.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>Beyond releasing the Llama 4 Scout and Maverick models in [1], authors also preview<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-10" href="#footnote-10" target="_self">10</a> Llama 4 Behemoth&#8212;<em>a much larger natively multi-modal MoE with 288B active parameters, 16 experts and 2T total parameters</em>. The key performance metrics of the Llama 4 Behemoth model are presented in the table above.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xySY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c601ca0-a535-4d7f-874d-6b94c8e7763e_2488x1050.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xySY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c601ca0-a535-4d7f-874d-6b94c8e7763e_2488x1050.png 424w, https://substackcdn.com/image/fetch/$s_!xySY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c601ca0-a535-4d7f-874d-6b94c8e7763e_2488x1050.png 848w, https://substackcdn.com/image/fetch/$s_!xySY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c601ca0-a535-4d7f-874d-6b94c8e7763e_2488x1050.png 1272w, https://substackcdn.com/image/fetch/$s_!xySY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c601ca0-a535-4d7f-874d-6b94c8e7763e_2488x1050.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xySY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c601ca0-a535-4d7f-874d-6b94c8e7763e_2488x1050.png" width="1456" height="614" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4c601ca0-a535-4d7f-874d-6b94c8e7763e_2488x1050.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:614,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:568468,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/161016210?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c601ca0-a535-4d7f-874d-6b94c8e7763e_2488x1050.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!xySY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c601ca0-a535-4d7f-874d-6b94c8e7763e_2488x1050.png 424w, https://substackcdn.com/image/fetch/$s_!xySY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c601ca0-a535-4d7f-874d-6b94c8e7763e_2488x1050.png 848w, https://substackcdn.com/image/fetch/$s_!xySY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c601ca0-a535-4d7f-874d-6b94c8e7763e_2488x1050.png 1272w, https://substackcdn.com/image/fetch/$s_!xySY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c601ca0-a535-4d7f-874d-6b94c8e7763e_2488x1050.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(<a href="https://arxiv.org/abs/2006.05525">source</a>)</figcaption></figure></div><p>Despite the impressive performance of Llama 4 Behemoth, this model is primarily used for the purpose of knowledge distillation [15]. In other words, we use Llama 4 Behemoth as a teacher when training other Llama 4 models.</p><blockquote><p><em>&#8220;These models are our best yet thanks to distillation from Llama 4 Behemoth, a 288 billion active parameter model with 16 experts that is our most powerful yet and among the world&#8217;s smartest LLMs.&#8221;</em> - from Llama 4 blog [1]</p></blockquote><p><strong>What is distillation?</strong> Given a input sequence of token vectors, an LLM outputs an equally-sized set of (transformed) token vectors. We can pass each of these output vectors through the LLM&#8217;s classification-based <a href="https://cameronrwolfe.substack.com/i/136638774/understanding-next-token-prediction">next token prediction</a> head&#8212;<em>this is usually just implemented as an additional <a href="https://pytorch.org/docs/stable/generated/torch.nn.Linear.html">linear layer</a></em>&#8212;and apply softmax to obtain a probability distribution over the set of potential next tokens. Therefore, the LLM&#8217;s final output is a list of vectors representing next token probability distributions at each position in the input sequence; see below.</p><pre><code>import torch
import torch.nn.functional as F

seq_len = 128
d = 768  # size of token embeddings
vocab_size = 32678

# classification head for next token prediction
ntp_head = torch.nn.Linear(in_features=d, out_features=vocab_size)

# construct LLM output and next token probabilities
llm_output = torch.rand((seq_len, d))
logits = ntp_head(llm_output)
ntp_probs = F.softmax(logits, dim=-1)</code></pre><p>During training, <em>we know what the actual next token is within the sequence</em>. So, we can train our model using a <a href="https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html">cross-entropy loss</a> applied to the probability of the correct next token in the sequence. This training loss is implemented below, where the ground truth next tokens at each position are stored in the <code>target</code> vector. Here, we provide logits as input because PyTorch already applies softmax internally within its implementation of cross-entropy.</p><pre><code># next token prediction (cross-entropy) loss
targets = torch.randint(0, vocab_size, (seq_len,))
loss = F.cross_entropy(logits, targets)</code></pre><p>The key idea behind knowledge distillation is deriving our target from another LLM instead of ground truth. Keeping everything else fixed, we can generate output with two LLMs&#8212;<em>a student and a teacher</em>&#8212;and use the teacher&#8217;s output as the target&#8212;<em>instead of the ground truth</em>&#8212;for training the student.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MCm4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F977c55cd-7f51-46be-bf3d-221b5fb7915f_1620x1166.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MCm4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F977c55cd-7f51-46be-bf3d-221b5fb7915f_1620x1166.png 424w, https://substackcdn.com/image/fetch/$s_!MCm4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F977c55cd-7f51-46be-bf3d-221b5fb7915f_1620x1166.png 848w, https://substackcdn.com/image/fetch/$s_!MCm4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F977c55cd-7f51-46be-bf3d-221b5fb7915f_1620x1166.png 1272w, https://substackcdn.com/image/fetch/$s_!MCm4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F977c55cd-7f51-46be-bf3d-221b5fb7915f_1620x1166.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MCm4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F977c55cd-7f51-46be-bf3d-221b5fb7915f_1620x1166.png" width="551" height="396.5989010989011" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/977c55cd-7f51-46be-bf3d-221b5fb7915f_1620x1166.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1048,&quot;width&quot;:1456,&quot;resizeWidth&quot;:551,&quot;bytes&quot;:749255,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/161016210?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F977c55cd-7f51-46be-bf3d-221b5fb7915f_1620x1166.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MCm4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F977c55cd-7f51-46be-bf3d-221b5fb7915f_1620x1166.png 424w, https://substackcdn.com/image/fetch/$s_!MCm4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F977c55cd-7f51-46be-bf3d-221b5fb7915f_1620x1166.png 848w, https://substackcdn.com/image/fetch/$s_!MCm4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F977c55cd-7f51-46be-bf3d-221b5fb7915f_1620x1166.png 1272w, https://substackcdn.com/image/fetch/$s_!MCm4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F977c55cd-7f51-46be-bf3d-221b5fb7915f_1620x1166.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(<a href="https://arxiv.org/abs/2207.10666">source</a>)</figcaption></figure></div><p><strong>Practical examples of distillation.</strong> Knowledge distillation was proposed in the context of deep learning in [15] and has been heavily used ever since; e.g., pre-ChatGPT examples of distillation include <a href="https://arxiv.org/abs/1910.01108">DistilBERT</a> and <a href="https://arxiv.org/abs/2207.10666">TinyViT</a>. Distillation is also heavily used in the context of LLMs. For example, DeepSeek-v3 [16] uses the DeepSeek-R1 reasoning model [17] as a teacher during pretraining. Additionally, knowledge distillation is used to create a suite of dense reasoning models of various sizes using the massive, MoE-based DeepSeek-R1 model as a teacher. Beyond these open examples, similar strategies for distillation and <a href="https://www.interconnects.ai/p/llm-synthetic-data\">synthetic data</a> are almost certainly used for training the top closed LLMs as well. Such trends likely encouraged Meta to adopt similar approaches for Llama 4 training. </p><p><strong>Hard vs. soft distillation.</strong> There are two main variants of knowledge distillation: <em>hard and soft distillation</em>. Hard distillation is very similar to our original training objective. We simply <em>i)</em> derive a one-hot label from the teacher LLM&#8217;s output by selecting the highest-probability token, <em>ii)</em> treat this one-hot label as the ground truth target and <em>iii)</em> apply the same cross-entropy loss; see below.</p><pre><code>temperature = 1.0  # softmax temperature
scaling_factor = 1.0

# student forward pass
llm_output = torch.rand((seq_len, d))
logits = ntp_head(llm_output)

# teacher forward pass
teacher_output = torch.rand((seq_len, d))
teacher_logits = ntp_head(teacher_output)
teacher_ntp_probs = F.softmax(teacher_logits / temperature, dim=1)

# different distillation losses
teacher_one_hot = torch.argmax(teacher_logits, dim=1)
hard_loss = F.cross_entropy(logits, teacher_one_hot)
soft_loss = F.cross_entropy(logits, teacher_ntp_probs)
hybrid_loss = hard_loss + scaling_factor * soft_loss</code></pre><p>However, there is a lot of potentially useful information contained within the full probability distribution predicted by the teacher model that we lose by creating the hard distillation target. Instead, we could use the entire distribution from the teacher as a training signal&#8212;<em>this is known as soft (or dense) distillation</em>. Such a soft distillation loss can be implemented as shown above<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-11" href="#footnote-11" target="_self">11</a>. Within soft distillation, we can also tweak the <a href="https://arxiv.org/abs/2502.20604">softmax temperature</a> used to create the teacher&#8217;s predicted distribution of token probabilities as a training hyperparameter.</p><p>Whether to use hard or soft distillation depends on a variety of factors. For example, if we are using a closed LLM as our teacher, we may not have access to to the teacher&#8217;s logprobs, which prevents soft distillation. Assuming a powerful teacher, however, soft distillation usually provides a more dense or rich signal to the student, which speeds up training and can make the student more robust [15]. <em>We can also use both approaches at the same time by combining them into a single loss</em>.</p><p><strong>Distilling Llama 4.</strong> Llama 4 models use a codistillation approach. The term &#8220;codistillation&#8221; here refers to the fact that both Llama 4 Maverick and Scout are trained using the Behemoth model as a teacher. By distilling multiple models from the larger Behemoth model, we can amortize the cost of forward passes to compute distillation targets during training, which is large&#8212;<em>this is a big model</em>! Authors mention in [1] that this codistillation strategy&#8212;<em>that uses a combination of hard and soft targets</em>&#8212;boosts the performance of both models.</p><blockquote><p><em>&#8220;We codistilled the Llama 4 Maverick model from Llama 4 Behemoth as a teacher model, resulting in substantial quality improvements across end task evaluation metrics. We developed a novel distillation loss function that dynamically weights the soft and hard targets through training.&#8221;</em> - from Llama 4 blog [1]</p></blockquote><p>As stated above, the distillation strategy used by Llama 4 is dynamic&#8212;<em>the balance between hard and soft targets changes throughout training.</em> Practically, we can implement this by modifying the <code>scaling_factor</code> in the above code. Although the exact strategy is not revealed in [1], it is likely that the training process begins by using hard targets and emphasizes soft targets later in training, <em>thus slowly increasing the density of information to which the LLM is exposed</em>. This is a common form of <a href="https://en.wikipedia.org/wiki/Curriculum_learning">curriculum learning</a>, where the LLM first learns from easier data and is gradually exposed to harder data over time; e.g., see <a href="https://arxiv.org/abs/2405.07490">here</a>. </p><h2>Llama 4 Performance and Capabilities</h2><p>LLM development is an empirically-driven and iterative process. To develop a powerful LLM, we tweak the model and build robust evaluation systems so that meaningful changes can be detected. Applying enough positive changes over time leads to a better model<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-12" href="#footnote-12" target="_self">12</a>. In contrast, Llama 4 makes many significant changes to the model at once&#8212;<em>this was a complete (and risky) pivot in research direction</em>. As we will see in this section, Llama 4 models are not state-of-the-art, and their performance was heavily criticized. However, this does not mean that the changes made by Llama 4 were a mistake. In fact, the approach taken by Llama 4 is inspired by many successful and popular LLMs. The long term success of Llama will be determined by the team&#8217;s ability to iterate and improve upon current state.</p><h4>Reported Performance</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!H_KL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd1a524f-fcf9-4a8b-b1c3-cef397b0e9c8_2242x1220.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!H_KL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd1a524f-fcf9-4a8b-b1c3-cef397b0e9c8_2242x1220.png 424w, https://substackcdn.com/image/fetch/$s_!H_KL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd1a524f-fcf9-4a8b-b1c3-cef397b0e9c8_2242x1220.png 848w, https://substackcdn.com/image/fetch/$s_!H_KL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd1a524f-fcf9-4a8b-b1c3-cef397b0e9c8_2242x1220.png 1272w, https://substackcdn.com/image/fetch/$s_!H_KL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd1a524f-fcf9-4a8b-b1c3-cef397b0e9c8_2242x1220.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!H_KL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd1a524f-fcf9-4a8b-b1c3-cef397b0e9c8_2242x1220.png" width="1456" height="792" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dd1a524f-fcf9-4a8b-b1c3-cef397b0e9c8_2242x1220.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:792,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:348006,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/161016210?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd1a524f-fcf9-4a8b-b1c3-cef397b0e9c8_2242x1220.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!H_KL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd1a524f-fcf9-4a8b-b1c3-cef397b0e9c8_2242x1220.png 424w, https://substackcdn.com/image/fetch/$s_!H_KL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd1a524f-fcf9-4a8b-b1c3-cef397b0e9c8_2242x1220.png 848w, https://substackcdn.com/image/fetch/$s_!H_KL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd1a524f-fcf9-4a8b-b1c3-cef397b0e9c8_2242x1220.png 1272w, https://substackcdn.com/image/fetch/$s_!H_KL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd1a524f-fcf9-4a8b-b1c3-cef397b0e9c8_2242x1220.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YdsU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2e4ca8a-6416-4b31-b93f-0c2076c7e8e3_1704x1298.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YdsU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2e4ca8a-6416-4b31-b93f-0c2076c7e8e3_1704x1298.png 424w, https://substackcdn.com/image/fetch/$s_!YdsU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2e4ca8a-6416-4b31-b93f-0c2076c7e8e3_1704x1298.png 848w, https://substackcdn.com/image/fetch/$s_!YdsU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2e4ca8a-6416-4b31-b93f-0c2076c7e8e3_1704x1298.png 1272w, https://substackcdn.com/image/fetch/$s_!YdsU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2e4ca8a-6416-4b31-b93f-0c2076c7e8e3_1704x1298.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YdsU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2e4ca8a-6416-4b31-b93f-0c2076c7e8e3_1704x1298.png" width="1456" height="1109" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a2e4ca8a-6416-4b31-b93f-0c2076c7e8e3_1704x1298.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1109,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:547951,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/161016210?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2e4ca8a-6416-4b31-b93f-0c2076c7e8e3_1704x1298.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YdsU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2e4ca8a-6416-4b31-b93f-0c2076c7e8e3_1704x1298.png 424w, https://substackcdn.com/image/fetch/$s_!YdsU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2e4ca8a-6416-4b31-b93f-0c2076c7e8e3_1704x1298.png 848w, https://substackcdn.com/image/fetch/$s_!YdsU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2e4ca8a-6416-4b31-b93f-0c2076c7e8e3_1704x1298.png 1272w, https://substackcdn.com/image/fetch/$s_!YdsU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2e4ca8a-6416-4b31-b93f-0c2076c7e8e3_1704x1298.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>The details of benchmarks reported for Llama 4 models in [1] are summarized in the tables above, where both Llama 4 Maverick and Scout are compared to other similar models&#8212;<em>both open and closed</em>&#8212;on various tasks of interest. From these metrics, we see that Llama 4 models:</p><ul><li><p>Perform well on image-based document understanding tasks, likely due to the inclusion of <a href="https://cameronrwolfe.substack.com/i/158954054/extending-llama-to-images-and-video">synthetic structured images</a> (e.g., charts, graphs and documents) in their training process. </p></li><li><p>Have strong image understanding capabilities due to their natively multi-modal training process and early fusion architecture. </p></li><li><p>Are more multi-lingual&#8212;<em>meaning that more languages are supported and performance on supported languages is better</em>&#8212;than prior Llama model iterations, as well as some closed models like GPT-4o. </p></li><li><p>Have promising long-context capabilities, either matching or exceeding those of industry leading models like Gemini 2.0 Flash (1M token context length). </p></li></ul><p>The Llama 4 Maverick model also achieves an impressive <a href="https://en.wikipedia.org/wiki/Elo_rating_system">Elo score</a> of 1417 on <a href="https://blog.lmarena.ai/about/">LMArena</a>, which places it among the <a href="https://lmarena.ai/?leaderboard">top models on the leaderboard</a> at the time of writing. However, these results were measured with an <em>&#8220;experimental chat version&#8221;</em> of the model that differs from the model actually used for evaluation<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-13" href="#footnote-13" target="_self">13</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Aq4I!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87bf9d8b-945e-40c8-8989-d8dcfa5b8ec0_1830x921.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Aq4I!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87bf9d8b-945e-40c8-8989-d8dcfa5b8ec0_1830x921.png 424w, https://substackcdn.com/image/fetch/$s_!Aq4I!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87bf9d8b-945e-40c8-8989-d8dcfa5b8ec0_1830x921.png 848w, https://substackcdn.com/image/fetch/$s_!Aq4I!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87bf9d8b-945e-40c8-8989-d8dcfa5b8ec0_1830x921.png 1272w, https://substackcdn.com/image/fetch/$s_!Aq4I!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87bf9d8b-945e-40c8-8989-d8dcfa5b8ec0_1830x921.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Aq4I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87bf9d8b-945e-40c8-8989-d8dcfa5b8ec0_1830x921.png" width="1456" height="733" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/87bf9d8b-945e-40c8-8989-d8dcfa5b8ec0_1830x921.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:733,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:100167,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:&quot;Image&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!Aq4I!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87bf9d8b-945e-40c8-8989-d8dcfa5b8ec0_1830x921.png 424w, https://substackcdn.com/image/fetch/$s_!Aq4I!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87bf9d8b-945e-40c8-8989-d8dcfa5b8ec0_1830x921.png 848w, https://substackcdn.com/image/fetch/$s_!Aq4I!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87bf9d8b-945e-40c8-8989-d8dcfa5b8ec0_1830x921.png 1272w, https://substackcdn.com/image/fetch/$s_!Aq4I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87bf9d8b-945e-40c8-8989-d8dcfa5b8ec0_1830x921.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(<a href="https://x.com/AIatMeta/status/1908618302676697317">source</a>)</figcaption></figure></div><p>This change caused tons of <a href="https://x.com/TheXeophon/status/1908900306580074741">confusion</a> and <a href="https://x.com/natolambert/status/1908895656535871936">discussion</a> online. The LMArena result was a key part of the Llama 4 release, so using a specialized model for this single evaluation was perceived as misleading (and even a bit duplicitous). </p><blockquote><p><em>&#8220;Meta&#8217;s interpretation of our policy did not match what we expect from model providers. Meta should have made it clearer that Llama-4-Maverick-03-26-Experimental was a customized model to optimize for human preference. We are updating our leaderboard policies to reinforce our commitment to fair, reproducible evaluations so this confusion doesn&#8217;t occur in the future.&#8221;</em> - <a href="https://x.com/lmarena_ai/status/1909397817434816562">LMArena statement</a></p></blockquote><p>To further analyze the long context capabilities of Llama 4 models, authors in [1] also present the results of <a href="https://github.com/gkamradt/LLMTest_NeedleInAHaystack">needle in a haystack</a> tests for each model, finding that Llama 4 models are able to retrieve information from contexts up to 1M tokens (for Maverick) and 10M tokens (for Scout); see below. However, this style of long context testing only measures retrieval abilities, which do not guarantee that the model is capable of leveraging its entire context for problem solving.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!w1I7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdc27709-6f00-4b08-a382-c06ad21d29ba_2096x929.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!w1I7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdc27709-6f00-4b08-a382-c06ad21d29ba_2096x929.png 424w, https://substackcdn.com/image/fetch/$s_!w1I7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdc27709-6f00-4b08-a382-c06ad21d29ba_2096x929.png 848w, https://substackcdn.com/image/fetch/$s_!w1I7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdc27709-6f00-4b08-a382-c06ad21d29ba_2096x929.png 1272w, https://substackcdn.com/image/fetch/$s_!w1I7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdc27709-6f00-4b08-a382-c06ad21d29ba_2096x929.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!w1I7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdc27709-6f00-4b08-a382-c06ad21d29ba_2096x929.png" width="1456" height="645" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cdc27709-6f00-4b08-a382-c06ad21d29ba_2096x929.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:645,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!w1I7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdc27709-6f00-4b08-a382-c06ad21d29ba_2096x929.png 424w, https://substackcdn.com/image/fetch/$s_!w1I7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdc27709-6f00-4b08-a382-c06ad21d29ba_2096x929.png 848w, https://substackcdn.com/image/fetch/$s_!w1I7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdc27709-6f00-4b08-a382-c06ad21d29ba_2096x929.png 1272w, https://substackcdn.com/image/fetch/$s_!w1I7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdc27709-6f00-4b08-a382-c06ad21d29ba_2096x929.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>No modern long context benchmarks (e.g., <a href="https://arxiv.org/abs/2406.10149">BABILong</a>, <a href="https://arxiv.org/abs/2404.06654">RULER</a> or <a href="https://arxiv.org/abs/2502.05167">NoLiMa</a>) are used for evaluating Llama 4, making the long context abilities of these models&#8212;<em>one of their key distinguishing features</em>&#8212;somewhat questionable. We also see from these metrics that Llama 4 models are not especially strong on coding tasks and&#8212;<em>despite being strong &#8220;non-reasoning&#8221; models</em>&#8212;are not compared to <a href="https://cameronrwolfe.substack.com/p/demystifying-reasoning-models">reasoning models like DeepSeek-R1</a> or the <a href="https://openai.com/index/introducing-o3-and-o4-mini/">o-series of OpenAI models</a><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-14" href="#footnote-14" target="_self">14</a>. As we will see, the negatives do not stop here. Llama 4 models were harshly criticized after their release and public evaluations revealed many gaps in their performance.</p><h4>Public Reaction and Criticism</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Zqu8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c212390-6df2-4b3d-b210-cf75d80d25af_1158x1158.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Zqu8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c212390-6df2-4b3d-b210-cf75d80d25af_1158x1158.png 424w, https://substackcdn.com/image/fetch/$s_!Zqu8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c212390-6df2-4b3d-b210-cf75d80d25af_1158x1158.png 848w, https://substackcdn.com/image/fetch/$s_!Zqu8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c212390-6df2-4b3d-b210-cf75d80d25af_1158x1158.png 1272w, https://substackcdn.com/image/fetch/$s_!Zqu8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c212390-6df2-4b3d-b210-cf75d80d25af_1158x1158.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Zqu8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c212390-6df2-4b3d-b210-cf75d80d25af_1158x1158.png" width="598" height="598" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9c212390-6df2-4b3d-b210-cf75d80d25af_1158x1158.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1158,&quot;width&quot;:1158,&quot;resizeWidth&quot;:598,&quot;bytes&quot;:677288,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/161016210?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c212390-6df2-4b3d-b210-cf75d80d25af_1158x1158.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Zqu8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c212390-6df2-4b3d-b210-cf75d80d25af_1158x1158.png 424w, https://substackcdn.com/image/fetch/$s_!Zqu8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c212390-6df2-4b3d-b210-cf75d80d25af_1158x1158.png 848w, https://substackcdn.com/image/fetch/$s_!Zqu8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c212390-6df2-4b3d-b210-cf75d80d25af_1158x1158.png 1272w, https://substackcdn.com/image/fetch/$s_!Zqu8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c212390-6df2-4b3d-b210-cf75d80d25af_1158x1158.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Llama -4 performance on public coding benchmarks (<a href="https://x.com/terryyuezhuo/status/1909247540379148439">source</a>, <a href="https://x.com/paulgauthier/status/1908976568879476843">source</a>)</figcaption></figure></div><p><strong>Public evaluation.</strong> Immediately after the release of Llama 4, researchers began independently evaluating the models, and the findings were highly variable. For coding tasks, Llama 4 models definitely left something to be desired:</p><ul><li><p>Neither of the Llama 4 models place within the top-40 models on the BigCodeBench leaderboard [<a href="https://x.com/terryyuezhuo/status/1909247540379148439">link</a>]</p></li><li><p>Llama 4 Maverick achieves a completion accuracy of only 16% on the <a href="https://aider.chat/docs/leaderboards/">Aider Polyglot benchmark</a> (state of the art is ~80%) [<a href="https://x.com/paulgauthier/status/1908976568879476843">link</a>]</p></li><li><p>Some users anecdotally published very harsh takes on the coding abilities of Llama 4 models, seeming to indicate that coding abilities were almost completely neglected in this model release [<a href="https://www.reddit.com/r/LocalLLaMA/comments/1jsl37d/im_incredibly_disappointed_with_llama4/">link</a>]</p></li></ul><p>These results are especially difficult to parse given that Llama 4 models do not perform poorly on all coding benchmarks; e.g., the metrics on <a href="https://arxiv.org/abs/2403.07974">LiveCodeBench</a> in [1] seem to indicate reasonable coding performance. </p><p>Additionally, the long context abilities of Llama 4 models were less impressive during public evaluation; e.g., performance on the long context portion of <a href="https://livebench.ai/#/">LiveBench</a>&#8212;<em>a dataset with minimal data contamination</em>&#8212;was <a href="https://www.reddit.com/r/LocalLLaMA/comments/1jsx7m2/fictionlivebench_for_long_context_deep/">poor</a>. These results highlight a deeper issue with retrieval-based long context evaluations (e.g., needle in a haystack). Just because the model can retrieve information in its context does not mean it can actually leverage its entire context for problem solving.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tIJR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff7f0f4-1134-4d4b-a039-8bcc29faca67_1108x1560.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tIJR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff7f0f4-1134-4d4b-a039-8bcc29faca67_1108x1560.png 424w, https://substackcdn.com/image/fetch/$s_!tIJR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff7f0f4-1134-4d4b-a039-8bcc29faca67_1108x1560.png 848w, https://substackcdn.com/image/fetch/$s_!tIJR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff7f0f4-1134-4d4b-a039-8bcc29faca67_1108x1560.png 1272w, https://substackcdn.com/image/fetch/$s_!tIJR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff7f0f4-1134-4d4b-a039-8bcc29faca67_1108x1560.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tIJR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff7f0f4-1134-4d4b-a039-8bcc29faca67_1108x1560.png" width="1108" height="1560" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eff7f0f4-1134-4d4b-a039-8bcc29faca67_1108x1560.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1560,&quot;width&quot;:1108,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:205372,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/161016210?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff7f0f4-1134-4d4b-a039-8bcc29faca67_1108x1560.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tIJR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff7f0f4-1134-4d4b-a039-8bcc29faca67_1108x1560.png 424w, https://substackcdn.com/image/fetch/$s_!tIJR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff7f0f4-1134-4d4b-a039-8bcc29faca67_1108x1560.png 848w, https://substackcdn.com/image/fetch/$s_!tIJR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff7f0f4-1134-4d4b-a039-8bcc29faca67_1108x1560.png 1272w, https://substackcdn.com/image/fetch/$s_!tIJR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feff7f0f4-1134-4d4b-a039-8bcc29faca67_1108x1560.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(<a href="https://deepmind.google/technologies/gemini/pro/">source</a>)</figcaption></figure></div><p>Researchers also noted that <a href="https://deepmind.google/technologies/gemini/pro/">Gemini-2.5 Pro</a> decisively outperforms even the largest Llama 4 Behemoth model on most key benchmarks; see above. </p><p><strong>Public perception of Llama 4.</strong> The disconnect between Llama 4&#8217;s reported metrics and public evaluation results created a lot of speculation and frustration within the AI research community, even leading to <a href="https://techcrunch.com/2025/04/07/meta-exec-denies-the-company-artificially-boosted-llama-4s-benchmark-scores/">false claims</a> that testing data was purposefully included in Llama 4&#8217;s training dataset to inflate benchmark scores. These claims were quickly <a href="https://x.com/Ahmad_Al_Dahle/status/1909302532306092107">denied by Meta executives</a>, who emphasized that fluctuations in model performance are due to implementation differences within the model itself, quantization strategies for inference, and more. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4uBW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc80fa15-7c41-484b-bbbc-4a11a01ce436_1971x770.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4uBW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc80fa15-7c41-484b-bbbc-4a11a01ce436_1971x770.jpeg 424w, https://substackcdn.com/image/fetch/$s_!4uBW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc80fa15-7c41-484b-bbbc-4a11a01ce436_1971x770.jpeg 848w, https://substackcdn.com/image/fetch/$s_!4uBW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc80fa15-7c41-484b-bbbc-4a11a01ce436_1971x770.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!4uBW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc80fa15-7c41-484b-bbbc-4a11a01ce436_1971x770.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4uBW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc80fa15-7c41-484b-bbbc-4a11a01ce436_1971x770.jpeg" width="1456" height="569" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bc80fa15-7c41-484b-bbbc-4a11a01ce436_1971x770.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:569,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:168635,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!4uBW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc80fa15-7c41-484b-bbbc-4a11a01ce436_1971x770.jpeg 424w, https://substackcdn.com/image/fetch/$s_!4uBW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc80fa15-7c41-484b-bbbc-4a11a01ce436_1971x770.jpeg 848w, https://substackcdn.com/image/fetch/$s_!4uBW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc80fa15-7c41-484b-bbbc-4a11a01ce436_1971x770.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!4uBW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc80fa15-7c41-484b-bbbc-4a11a01ce436_1971x770.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(<a href="https://github.com/meta-llama/llama-models/blob/main/models/llama4/MODEL_CARD.md">source</a>)</figcaption></figure></div><p>Nonetheless, the confused aura around the release of Llama 4 remains. Many aspects of this release were <a href="https://www.interconnects.ai/p/llama-4">seemingly rushed</a>, beginning with the random decision to release the model on a Saturday instead of the following Monday. </p><h2>The Future of Llama</h2><p>The release of Llama 4 was received poorly by the AI research community. Now that we have a deep understanding of these models, however, we see that the story behind Llama 4 is nuanced. Relative to Llama 3, these new Llama models modify&#8212;<em>or completely reinvent</em>&#8212;nearly every component of the model:</p><ul><li><p>MoE-based model architecture.</p></li><li><p>Different approach to multi-modality (early fusion).</p></li><li><p>Natively multi-modal pretraining.</p></li><li><p>Emphasis on model distillation during pretraining.</p></li><li><p>Completely different post-training pipeline.</p></li><li><p>Focus on long context capabilities. </p></li></ul><p>The open LLM landscape is becoming more competitive with the success of <a href="https://api-docs.deepseek.com/news/news1226">DeepSeek-v3</a>, <a href="https://arxiv.org/abs/2412.15115">Qwen-2.5</a> and more. With the release of Llama 4, Meta both responded to this competition and made clear their goal of creating a frontier-level Llama model. Llama 4 does not achieve this goal, but this should not come as a surprise. Meta took an (obvious) risk&#8212;<em>which may still prove to be the correct choice in the long run</em>&#8212;by pivoting in their research strategy.</p><p><strong>Frontier-Level Llama Models.</strong> Given the staggering pace of LLM research, the success of Llama is far from guaranteed, and Meta has a lot of work to do after falling short with Llama 4. To create a frontier-level LLM, Meta needs to iterate and improve upon their models more quickly. Those who work closely with Llama models might have noticed that the amount of time between major Llama releases has been slowly increasing:</p><ul><li><p><a href="https://arxiv.org/abs/2302.13971">Llama</a> was released in February 2023.</p></li><li><p><a href="https://arxiv.org/abs/2307.09288">Llama 2</a> was released in July 2023. </p></li><li><p><a href="https://arxiv.org/abs/2407.21783">Llama 3</a> was released in April 2024.</p></li><li><p><a href="https://ai.meta.com/blog/llama-4-multimodal-intelligence/">Llama 4</a> was released in April 2025.</p></li></ul><p>This expanding gap is worrying and lags behind top labs; e.g., since January 2024 DeepSeek has released <a href="https://arxiv.org/abs/2401.02954">DeepSeek-v1</a>, <a href="https://arxiv.org/abs/2405.04434">v2</a>, <a href="https://arxiv.org/abs/2412.19437">v3</a> and <a href="https://arxiv.org/abs/2501.12948">R1</a>. Even if the next Llama model is state-of-the-art, new models will be released shortly after. <em>Models will continue to evolve and improve at an uncomfortable pace</em>. The only way forward is to iterate quickly and fix the gaps in evaluation capabilities that led to the huge disconnect between internal and external evaluations of Llama 4.</p><p><strong>The Open LLM Landscape.</strong> Even if Llama models are not state-of-the-art, they can still be successful in the open LLM landscape, where many other factors&#8212;<em>like barrier to entry and ease of use</em>&#8212;are important. To maximize success, Meta must do everything they can to avoid restricting use cases for open LLMs. Most notably, Llama 4 models need to be distilled into a variety of smaller, dense models&#8212;<em>in a similar fashion to DeepSeek-R1 and Qwen-2.5</em>&#8212;to avoid the hardware requirements of massive MoEs. Creating a frontier-level Llama model is an important goal, but it should not come at the cost of deteriorating Meta&#8217;s position in the open LLM landscape. After all, Llama has never been the top-performing LLM. <em>The emphasis upon openness is what made Llama successful in the first place</em>.</p><h4>New to the newsletter?</h4><p>Hi! I&#8217;m <a href="https://cameronrwolfe.me/">Cameron R. Wolfe</a>, Deep Learning Ph.D. and Senior Research Scientist at <a href="https://research.netflix.com/research-area/nlp-and-conversations">Netflix</a>. This is the Deep (Learning) Focus newsletter, where I help readers better understand important topics in AI research. If you like the newsletter, please subscribe, share it, or follow me on <a href="https://twitter.com/cwolferesearch">X</a> and <a href="https://www.linkedin.com/in/cameron-r-wolfe-ph-d-04744a238/">LinkedIn</a>!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://cameronrwolfe.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://cameronrwolfe.substack.com/subscribe?"><span>Subscribe now</span></a></p><h4>Bibliography</h4><p>[1] Meta. &#8220;The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation&#8221; <em><a href="https://ai.meta.com/blog/llama-4-multimodal-intelligence/">https://ai.meta.com/blog/llama-4-multimodal-intelligence/</a> </em>(2025).</p><p>[2] Grattafiori, Aaron, et al. "The llama 3 herd of models." <em>arXiv preprint arXiv:2407.21783</em> (2024).</p><p>[3] Liu, Aixin, et al. "Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model." arXiv preprint arXiv:2405.04434 (2024).</p><p>[4] Liu, Aixin, et al. "Deepseek-v3 technical report." arXiv preprint arXiv:2412.19437 (2024).</p><p>[5] Guo, Daya, et al. "Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning." <em>arXiv preprint arXiv:2501.12948</em> (2025).</p><p>[6] Team, Chameleon. "Chameleon: Mixed-modal early-fusion foundation models." <em>arXiv preprint arXiv:2405.09818</em> (2024).</p><p>[7] Bavishi, Rohan, et al. "Fuyu-8b: A multimodal architecture for ai agents." <em>URL: https://www. adept. ai/blog/fuyu-8b</em> (2023).</p><p>[8] Xiong, Ruibin, et al. "On layer normalization in the transformer architecture." <em>International conference on machine learning</em>. PMLR, 2020.</p><p>[9] Xu, Hu, et al. "Demystifying clip data." <em>arXiv preprint arXiv:2309.16671</em> (2023).</p><p>[10] Blakeney, Cody, et al. "Does your data spark joy? Performance gains from domain upsampling at the end of training." <em>arXiv preprint arXiv:2406.03476</em> (2024).</p><p>[11] Su, Jianlin, et al. "Roformer: Enhanced transformer with rotary position embedding." <em>Neurocomputing</em> 568 (2024): 127063.</p><p>[12] Kazemnejad, Amirhossein, et al. "The impact of positional encoding on length generalization in transformers." <em>Advances in Neural Information Processing Systems</em> 36 (2023): 24892-24928.</p><p>[13] Nakanishi, Ken M. "Scalable-Softmax Is Superior for Attention." <em>arXiv preprint arXiv:2501.19399</em> (2025).</p><p>[14] Lu, Yi, et al. "A controlled study on long context extension and generalization in llms." <em>arXiv preprint arXiv:2409.12181</em> (2024).</p><p>[15] Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." <em>arXiv preprint arXiv:1503.02531</em> (2015).</p><p>[16] Liu, Aixin, et al. "Deepseek-v3 technical report." <em>arXiv preprint arXiv:2412.19437</em> (2024).</p><p>[17] Guo, Daya, et al. "Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning." <em>arXiv preprint arXiv:2501.12948</em> (2025).</p><p>[18] Rafailov, Rafael, et al. "Direct preference optimization: Your language model is secretly a reward model." <em>Advances in Neural Information Processing Systems</em> 36 (2023): 53728-53741.</p><p>[19] Tang, Yunhao, et al. "Understanding the performance gap between online and offline alignment algorithms." <em>arXiv preprint arXiv:2405.08448</em> (2024).</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Given that coarse-grained experts are the same size as the original feed-forward layer from the transformer, the full model with a single expert usually matches the size of a standard dense LLM. As such, this model with a single expert is typically the perfect size to fit into a single GPU!</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>As stated in the Llama 4 blog [1]: <em>&#8220;While all parameters are stored in memory, only a subset of the total parameters are activated while serving these models.&#8221;</em></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>70B models such as Llama 3 70B fit in a single H100 GPU (80Gb memory) with int8 quantization. To fit the larger Scout model (with 109B total parameters) into the same GPU, we must adopt a more aggressive int4 quanitzation scheme. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>Qwen models take a similar approach as well. For example, <a href="https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e">Qwen-2.5</a> has seven different models ranging from 0.5B to 72B parameters. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>In the figure, we concatenate image and token embeddings left-to-right. However, image and token embeddings can be arbitrarily interleaved in the input sequence.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>This modification is made to avoid a specific kind of training instability observed in [6] where inputs to the softmax in the attention mechanism (i.e., the query and key vectors) slowly grow in magnitude throughout the later stage of training, eventually leading to numerical instabilities that cause training to diverge.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>The relative position between tokens can be more meaningful than absolute position, as many tasks (e.g., translation or summarization) require developing an understanding of the relationships between tokens. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>We can adjust each entry of the frequency basis non-uniformly! For example, <a href="https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/">NTK-RoPE</a> maintains the frequency of tokens that are close together but applies a larger adjustment to tokens that are further apart. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-9" href="#footnote-anchor-9" class="footnote-number" contenteditable="false" target="_self">9</a><div class="footnote-content"><p>In traditional RL research, this means that we are generating data on-the-fly with the exact model that we are currently training. For LLMs, this requirement is slightly relaxed to encompass data collected using the model for the current phase of post-training; see <a href="https://rlhfbook.com/c/03-setup.html">here</a> for more details. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-10" href="#footnote-anchor-10" class="footnote-number" contenteditable="false" target="_self">10</a><div class="footnote-content"><p>The model is not formally released in [1]. Authors claim that the model was still training at the time of writing. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-11" href="#footnote-anchor-11" class="footnote-number" contenteditable="false" target="_self">11</a><div class="footnote-content"><p>There are many alternative ways of implementing soft distillation as well. For example, we could use the <a href="https://pytorch.org/docs/stable/generated/torch.nn.functional.kl_div.html">KL divergence</a> or <a href="https://pytorch.org/docs/stable/generated/torch.nn.functional.mse_loss.html">mean-squared error</a> as a loss function. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-12" href="#footnote-anchor-12" class="footnote-number" contenteditable="false" target="_self">12</a><div class="footnote-content"><p>For example, various versions of Llama 3 (e.g., <a href="https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2/">Llama 3.2</a> and <a href="https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_3/">Llama 3.3</a>) were released shortly after Llama 3 that all make relatively minor modifications to the model to optimize its performance. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-13" href="#footnote-anchor-13" class="footnote-number" contenteditable="false" target="_self">13</a><div class="footnote-content"><p>This change in model caused tons of <a href="https://x.com/TheXeophon/status/1908900306580074741">confusion</a> and <a href="https://x.com/natolambert/status/1908895656535871936">discussion</a> online. The LMArena result was a key part of the Llama 4 release, so many perceived using a specialized model for this evaluation as misleading (possibly even duplicitous). </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-14" href="#footnote-anchor-14" class="footnote-number" contenteditable="false" target="_self">14</a><div class="footnote-content"><p>Separating reasoning and non-reasoning models is difficult because LLMs tend to lie on a continuous spectrum of reasoning capabilities. Many researchers are advocating to separate the evaluation of reasoning and non-reasoning tasks, instead of trying to distinguish between reasoning and non-reasoning models; see <a href="https://www.interconnects.ai/p/gemini-25-pro-googles-second-ai-chance">here</a>. </p></div></div>]]></content:encoded></item><item><title><![CDATA[Vision Large Language Models (vLLMs)]]></title><description><![CDATA[Teaching LLMs to understand images and videos in addition to text...]]></description><link>https://cameronrwolfe.substack.com/p/vision-llms</link><guid isPermaLink="false">https://cameronrwolfe.substack.com/p/vision-llms</guid><dc:creator><![CDATA[Cameron R. Wolfe, Ph.D.]]></dc:creator><pubDate>Mon, 31 Mar 2025 09:34:01 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/12372b06-0850-4b33-b8a8-dd01dd5662fb_2208x1218.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3rTz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18346de8-f6b0-447e-a759-3bf2488f55e0_2270x1218.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3rTz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18346de8-f6b0-447e-a759-3bf2488f55e0_2270x1218.png 424w, https://substackcdn.com/image/fetch/$s_!3rTz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18346de8-f6b0-447e-a759-3bf2488f55e0_2270x1218.png 848w, https://substackcdn.com/image/fetch/$s_!3rTz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18346de8-f6b0-447e-a759-3bf2488f55e0_2270x1218.png 1272w, https://substackcdn.com/image/fetch/$s_!3rTz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18346de8-f6b0-447e-a759-3bf2488f55e0_2270x1218.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3rTz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18346de8-f6b0-447e-a759-3bf2488f55e0_2270x1218.png" width="1456" height="781" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/18346de8-f6b0-447e-a759-3bf2488f55e0_2270x1218.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:781,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:799306,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/158954054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18346de8-f6b0-447e-a759-3bf2488f55e0_2270x1218.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3rTz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18346de8-f6b0-447e-a759-3bf2488f55e0_2270x1218.png 424w, https://substackcdn.com/image/fetch/$s_!3rTz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18346de8-f6b0-447e-a759-3bf2488f55e0_2270x1218.png 848w, https://substackcdn.com/image/fetch/$s_!3rTz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18346de8-f6b0-447e-a759-3bf2488f55e0_2270x1218.png 1272w, https://substackcdn.com/image/fetch/$s_!3rTz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18346de8-f6b0-447e-a759-3bf2488f55e0_2270x1218.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>After the popularization of text-based large language models (LLMs), one of the most important questions within the research community was how we could extend such powerful models to understand other modalities of data (e.g., images, video or speech). Research on multi-modal LLMs is promising for several reasons:</p><ul><li><p>Improving model capabilities. </p></li><li><p>Uncovering new sources of training data.</p></li><li><p>Expanding the scope of problems that LLMs can solve.</p></li></ul><p>Recently, vision-based LLMs&#8212;<em>or vLLMs for short, these are LLMs that can ingest images and videos as input in addition to text</em>&#8212;have become more popular. For example, most recent OpenAI models support visual inputs, and Meta has released a vision-based variant of LLAMA-3, called LLaMA-3.2 Vision. In this overview, we will aim to understand how vLLMs work from first principles, starting with basic concepts and eventually studying how LLaMA-3.2 Vision is practically implemented. As we will learn, vLLMs&#8212;<em>despite their impressive capabilities</em>&#8212;are not actually much different than text-based LLMs. </p><h2>The Building Blocks of vLLMs</h2><p>To fully understand vLLMs, we need to start from the beginning. In this section, we will cover some of the fundamental concepts used to build these models, including ideas like cross-attention and encoders for images and video. We will (mostly) assume knowledge of the basic concepts behind text-based LLMs, such as a high-level understanding of the transformer architecture. However, readers who are unfamiliar with these concepts can find more details at the link below.</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;896acff1-a91d-4f65-b0d4-d40a8fbc0644&quot;,&quot;caption&quot;:&quot;The current pace of AI research is staggering. Keeping up with the most recent publications is a difficult feat, leaving even experts in the field feeling as if they are failing to grasp the finer details of this evolving frontier. In the domain of large language models (LLMs) especially, impactful research is being released constantly, inc&#8230;&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Decoder-Only Transformers: The Workhorse of Generative LLMs&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:29736521,&quot;name&quot;:&quot;Cameron R. Wolfe, Ph.D.&quot;,&quot;bio&quot;:&quot;ML @ Netflix &#8226; Rice University PhD &#8226; I make AI understandable&quot;,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/69aba7df-b571-4609-aa47-fc2d031c11b8_1242x1595.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-03-04T09:33:07.426Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6e3c9db5-400a-49de-a235-e09bc3aa3689_2392x1342.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://cameronrwolfe.substack.com/p/decoder-only-transformers-the-workhorse&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:142044446,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:110,&quot;comment_count&quot;:14,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Deep (Learning) Focus&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab9b43fb-52d5-40da-995d-5b7cd3f91064_896x896.png&quot;,&quot;belowTheFold&quot;:false,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h4>Cross-Attention (and Transformers)</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qc6a!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c20b8a-b589-4282-8be7-e2509f4e0803_912x1112.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qc6a!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c20b8a-b589-4282-8be7-e2509f4e0803_912x1112.png 424w, https://substackcdn.com/image/fetch/$s_!qc6a!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c20b8a-b589-4282-8be7-e2509f4e0803_912x1112.png 848w, https://substackcdn.com/image/fetch/$s_!qc6a!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c20b8a-b589-4282-8be7-e2509f4e0803_912x1112.png 1272w, https://substackcdn.com/image/fetch/$s_!qc6a!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c20b8a-b589-4282-8be7-e2509f4e0803_912x1112.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qc6a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c20b8a-b589-4282-8be7-e2509f4e0803_912x1112.png" width="364" height="443.82456140350877" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e0c20b8a-b589-4282-8be7-e2509f4e0803_912x1112.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1112,&quot;width&quot;:912,&quot;resizeWidth&quot;:364,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qc6a!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c20b8a-b589-4282-8be7-e2509f4e0803_912x1112.png 424w, https://substackcdn.com/image/fetch/$s_!qc6a!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c20b8a-b589-4282-8be7-e2509f4e0803_912x1112.png 848w, https://substackcdn.com/image/fetch/$s_!qc6a!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c20b8a-b589-4282-8be7-e2509f4e0803_912x1112.png 1272w, https://substackcdn.com/image/fetch/$s_!qc6a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c20b8a-b589-4282-8be7-e2509f4e0803_912x1112.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [8])</figcaption></figure></div><p>The <a href="https://jalammar.github.io/illustrated-transformer/">transformer architecture</a> [8] is used universally within language modeling research. In its original form, the transformer architecture has two components: <em>an encoder and a decoder</em>. As shown above, the encoder and decoder contain repeated blocks of:</p><ol><li><p><em>Self-attention</em>: transforms each token vector based on the other tokens that are present in the sequence.</p></li><li><p><em>Feed-forward transformation</em>: transforms each token vector individually.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LE12!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a4634e6-3d72-4053-84bf-84bef43101d5_1630x804.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LE12!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a4634e6-3d72-4053-84bf-84bef43101d5_1630x804.png 424w, https://substackcdn.com/image/fetch/$s_!LE12!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a4634e6-3d72-4053-84bf-84bef43101d5_1630x804.png 848w, https://substackcdn.com/image/fetch/$s_!LE12!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a4634e6-3d72-4053-84bf-84bef43101d5_1630x804.png 1272w, https://substackcdn.com/image/fetch/$s_!LE12!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a4634e6-3d72-4053-84bf-84bef43101d5_1630x804.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LE12!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a4634e6-3d72-4053-84bf-84bef43101d5_1630x804.png" width="1456" height="718" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5a4634e6-3d72-4053-84bf-84bef43101d5_1630x804.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:718,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:166571,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/158954054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a4634e6-3d72-4053-84bf-84bef43101d5_1630x804.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LE12!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a4634e6-3d72-4053-84bf-84bef43101d5_1630x804.png 424w, https://substackcdn.com/image/fetch/$s_!LE12!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a4634e6-3d72-4053-84bf-84bef43101d5_1630x804.png 848w, https://substackcdn.com/image/fetch/$s_!LE12!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a4634e6-3d72-4053-84bf-84bef43101d5_1630x804.png 1272w, https://substackcdn.com/image/fetch/$s_!LE12!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a4634e6-3d72-4053-84bf-84bef43101d5_1630x804.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Decoder-only transformer architecture</figcaption></figure></div><p>The <strong>decoder-only transformer</strong> is the variant of the transformer architecture that is most commonly used by <a href="https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf">GPT-style</a> (generative) LLMs. Most vLLMs also utilize a decoder-only architecture, but additional modules are added to the architecture to handle vision-based inputs. Put simply, this architecture is the same as the transformer, but it has no encoder component&#8212;<em>hence the name &#8220;decoder-only&#8221;</em>. </p><p><strong>Original decoder.</strong> The decoder-only transformer only has masked self-attention and a feed-forward transformation in each of its blocks. However, the decoder from the original transformer architecture has an extra cross-attention module in each of its blocks. Self-attention computes attention over the tokens in a single sequence. In contrast, cross-attention considers two sequences of tokens&#8212;<em>the tokens from the encoder and the tokens from the decoder&#8212;</em>and computes attention between these two sequences<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>. By doing this, we allow the decoder to consider the representations produced by the encoder when generating its output! Let&#8217;s first try to understand self-attention, then we will cover cross-attention. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!T5kl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8fec4d1-3b72-4e17-8a01-2e7c4f3b7a5c_1949x930.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!T5kl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8fec4d1-3b72-4e17-8a01-2e7c4f3b7a5c_1949x930.png 424w, https://substackcdn.com/image/fetch/$s_!T5kl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8fec4d1-3b72-4e17-8a01-2e7c4f3b7a5c_1949x930.png 848w, https://substackcdn.com/image/fetch/$s_!T5kl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8fec4d1-3b72-4e17-8a01-2e7c4f3b7a5c_1949x930.png 1272w, https://substackcdn.com/image/fetch/$s_!T5kl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8fec4d1-3b72-4e17-8a01-2e7c4f3b7a5c_1949x930.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!T5kl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8fec4d1-3b72-4e17-8a01-2e7c4f3b7a5c_1949x930.png" width="1949" height="930" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c8fec4d1-3b72-4e17-8a01-2e7c4f3b7a5c_1949x930.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:930,&quot;width&quot;:1949,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:137871,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/158954054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c29bd26-64ca-498a-af30-f9dce60201c1_1966x980.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!T5kl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8fec4d1-3b72-4e17-8a01-2e7c4f3b7a5c_1949x930.png 424w, https://substackcdn.com/image/fetch/$s_!T5kl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8fec4d1-3b72-4e17-8a01-2e7c4f3b7a5c_1949x930.png 848w, https://substackcdn.com/image/fetch/$s_!T5kl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8fec4d1-3b72-4e17-8a01-2e7c4f3b7a5c_1949x930.png 1272w, https://substackcdn.com/image/fetch/$s_!T5kl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8fec4d1-3b72-4e17-8a01-2e7c4f3b7a5c_1949x930.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Self-attention at a glance.</strong> The input to a self-attention mechanism is a sequence of token vectors. Self-attention forms an output representation for each token by considering all other tokens in the sequence. To do this, the self-attention operation creates three separate linear projections&#8212;<em>called the keys, queries, and values</em>&#8212;of the token vectors. As shown above, we can then use the keys and queries to compute an attention score between every pair of tokens in the sequence. This attention score captures how important each token is to every other token in the sequence&#8212;<em>or how much some token should &#8220;pay attention to&#8221; another token</em>. We can multiply these attention scores by the values to obtain our final output. A basic implementation of self-attention is provided below<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>.</p><div class="github-gist" data-attrs="{&quot;innerHTML&quot;:&quot;<div id=\&quot;gist137030626\&quot; class=\&quot;gist\&quot;>\n    <div class=\&quot;gist-file\&quot; translate=\&quot;no\&quot; data-color-mode=\&quot;light\&quot; data-light-theme=\&quot;light\&quot;>\n      <div class=\&quot;gist-data\&quot;>\n        <div class=\&quot;js-gist-file-update-container js-task-list-container\&quot;>\n  <div id=\&quot;file-bidir_self_attn-py\&quot; class=\&quot;file my-2\&quot;>\n    \n    <div itemprop=\&quot;text\&quot;\n      class=\&quot;Box-body p-0 blob-wrapper data type-python  \&quot;\n      style=\&quot;overflow: auto\&quot; tabindex=\&quot;0\&quot; role=\&quot;region\&quot;\n      aria-label=\&quot;bidir_self_attn.py content, created by wolfecameron on 09:54PM on March 24.\&quot;\n    >\n\n        \n<div class=\&quot;js-check-bidi js-blob-code-container blob-code-content\&quot;>\n\n  <template class=\&quot;js-file-alert-template\&quot;>\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash flash-warn flash-full d-flex flex-items-center\&quot;>\n  <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n    <span>\n      This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.\n      <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.co/hiddenchars\&quot; target=\&quot;_blank\&quot;>Learn more about bidirectional Unicode characters</a>\n    </span>\n\n\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash-action\&quot;>        <a href=\&quot;{{ revealButtonHref }}\&quot; data-view-component=\&quot;true\&quot; class=\&quot;btn-sm btn\&quot;>    Show hidden characters\n</a>\n</div>\n</div></template>\n<template class=\&quot;js-line-alert-template\&quot;>\n  <span aria-label=\&quot;This line has hidden Unicode characters\&quot; data-view-component=\&quot;true\&quot; class=\&quot;line-alert tooltipped tooltipped-e\&quot;>\n    <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n</span></template>\n\n  <table data-hpc class=\&quot;highlight tab-size js-file-line-container\&quot; data-tab-size=\&quot;8\&quot; data-paste-markdown-skip data-tagsearch-path=\&quot;bidir_self_attn.py\&quot;>\n        <tr>\n          <td id=\&quot;file-bidir_self_attn-py-L1\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;1\&quot;></td>\n          <td id=\&quot;file-bidir_self_attn-py-LC1\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>import</span> <span class=pl-s1>math</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bidir_self_attn-py-L2\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;2\&quot;></td>\n          <td id=\&quot;file-bidir_self_attn-py-LC2\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>import</span> <span class=pl-s1>torch</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bidir_self_attn-py-L3\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;3\&quot;></td>\n          <td id=\&quot;file-bidir_self_attn-py-LC3\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>from</span> <span class=pl-s1>torch</span> <span class=pl-k>import</span> <span class=pl-s1>nn</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bidir_self_attn-py-L4\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;4\&quot;></td>\n          <td id=\&quot;file-bidir_self_attn-py-LC4\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>import</span> <span class=pl-s1>torch</span>.<span class=pl-s1>nn</span>.<span class=pl-s1>functional</span> <span class=pl-k>as</span> <span class=pl-c1>F</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bidir_self_attn-py-L5\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;5\&quot;></td>\n          <td id=\&quot;file-bidir_self_attn-py-LC5\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bidir_self_attn-py-L6\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;6\&quot;></td>\n          <td id=\&quot;file-bidir_self_attn-py-LC6\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>class</span> <span class=pl-v>SelfAttention</span>(<span class=pl-s1>nn</span>.<span class=pl-c1>Module</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bidir_self_attn-py-L7\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;7\&quot;></td>\n          <td id=\&quot;file-bidir_self_attn-py-LC7\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bidir_self_attn-py-L8\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;8\&quot;></td>\n          <td id=\&quot;file-bidir_self_attn-py-LC8\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>__init__</span>(<span class=pl-s1>self</span>, <span class=pl-s1>d</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bidir_self_attn-py-L9\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;9\&quot;></td>\n          <td id=\&quot;file-bidir_self_attn-py-LC9\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bidir_self_attn-py-L10\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;10\&quot;></td>\n          <td id=\&quot;file-bidir_self_attn-py-LC10\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        Arguments:</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bidir_self_attn-py-L11\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;11\&quot;></td>\n          <td id=\&quot;file-bidir_self_attn-py-LC11\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        d: size of embedding dimension</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bidir_self_attn-py-L12\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;12\&quot;></td>\n          <td id=\&quot;file-bidir_self_attn-py-LC12\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        &amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bidir_self_attn-py-L13\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;13\&quot;></td>\n          <td id=\&quot;file-bidir_self_attn-py-LC13\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-en>super</span>().<span class=pl-c1>__init__</span>()</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bidir_self_attn-py-L14\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;14\&quot;></td>\n          <td id=\&quot;file-bidir_self_attn-py-LC14\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>d</span> <span class=pl-c1>=</span> <span class=pl-s1>d</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bidir_self_attn-py-L15\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;15\&quot;></td>\n          <td id=\&quot;file-bidir_self_attn-py-LC15\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        </td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bidir_self_attn-py-L16\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;16\&quot;></td>\n          <td id=\&quot;file-bidir_self_attn-py-LC16\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># key, query, value projections for all heads, but in a batch</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bidir_self_attn-py-L17\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;17\&quot;></td>\n          <td id=\&quot;file-bidir_self_attn-py-LC17\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># output is 3X the dimension because it includes key, query and value</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bidir_self_attn-py-L18\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;18\&quot;></td>\n          <td id=\&quot;file-bidir_self_attn-py-LC18\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>c_attn</span> <span class=pl-c1>=</span> <span class=pl-s1>nn</span>.<span class=pl-c1>Linear</span>(<span class=pl-s1>d</span>, <span class=pl-c1>3</span><span class=pl-c1>*</span><span class=pl-s1>d</span>, <span class=pl-s1>bias</span><span class=pl-c1>=</span><span class=pl-c1>False</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bidir_self_attn-py-L19\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;19\&quot;></td>\n          <td id=\&quot;file-bidir_self_attn-py-LC19\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bidir_self_attn-py-L20\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;20\&quot;></td>\n          <td id=\&quot;file-bidir_self_attn-py-LC20\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>forward</span>(<span class=pl-s1>self</span>, <span class=pl-s1>x</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bidir_self_attn-py-L21\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;21\&quot;></td>\n          <td id=\&quot;file-bidir_self_attn-py-LC21\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># compute query, key, and value vectors in batch</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bidir_self_attn-py-L22\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;22\&quot;></td>\n          <td id=\&quot;file-bidir_self_attn-py-LC22\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># split the output into separate query, key, and value tensors</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bidir_self_attn-py-L23\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;23\&quot;></td>\n          <td id=\&quot;file-bidir_self_attn-py-LC23\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>q</span>, <span class=pl-s1>k</span>, <span class=pl-s1>v</span>  <span class=pl-c1>=</span> <span class=pl-s1>self</span>.<span class=pl-c1>c_attn</span>(<span class=pl-s1>x</span>).<span class=pl-c1>split</span>(<span class=pl-s1>self</span>.<span class=pl-c1>d</span>, <span class=pl-s1>dim</span><span class=pl-c1>=</span><span class=pl-c1>2</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bidir_self_attn-py-L24\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;24\&quot;></td>\n          <td id=\&quot;file-bidir_self_attn-py-LC24\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bidir_self_attn-py-L25\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;25\&quot;></td>\n          <td id=\&quot;file-bidir_self_attn-py-LC25\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># compute the attention matrix and apply dropout</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bidir_self_attn-py-L26\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;26\&quot;></td>\n          <td id=\&quot;file-bidir_self_attn-py-LC26\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>att</span> <span class=pl-c1>=</span> (<span class=pl-s1>q</span> @ <span class=pl-s1>k</span>.<span class=pl-c1>transpose</span>(<span class=pl-c1>-</span><span class=pl-c1>2</span>, <span class=pl-c1>-</span><span class=pl-c1>1</span>)) <span class=pl-c1>*</span> (<span class=pl-c1>1.0</span> <span class=pl-c1>/</span> <span class=pl-s1>math</span>.<span class=pl-c1>sqrt</span>(<span class=pl-s1>k</span>.<span class=pl-c1>size</span>(<span class=pl-c1>-</span><span class=pl-c1>1</span>)))</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bidir_self_attn-py-L27\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;27\&quot;></td>\n          <td id=\&quot;file-bidir_self_attn-py-LC27\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>att</span> <span class=pl-c1>=</span> <span class=pl-c1>F</span>.<span class=pl-c1>softmax</span>(<span class=pl-s1>att</span>, <span class=pl-s1>dim</span><span class=pl-c1>=</span><span class=pl-c1>-</span><span class=pl-c1>1</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bidir_self_attn-py-L28\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;28\&quot;></td>\n          <td id=\&quot;file-bidir_self_attn-py-LC28\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bidir_self_attn-py-L29\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;29\&quot;></td>\n          <td id=\&quot;file-bidir_self_attn-py-LC29\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># compute output vectors for each token</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bidir_self_attn-py-L30\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;30\&quot;></td>\n          <td id=\&quot;file-bidir_self_attn-py-LC30\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>y</span> <span class=pl-c1>=</span> <span class=pl-s1>att</span> @ <span class=pl-s1>v</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-bidir_self_attn-py-L31\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;31\&quot;></td>\n          <td id=\&quot;file-bidir_self_attn-py-LC31\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>return</span> <span class=pl-s1>y</span></td>\n        </tr>\n  </table>\n</div>\n\n\n    </div>\n\n  </div>\n</div>\n\n      </div>\n      <div class=\&quot;gist-meta\&quot;>\n        <a href=\&quot;https://gist.github.com/wolfecameron/a809ef1e9bd176344ab59303c3e00389/raw/efb77715beacf745fa2f72e3fb10a1ccc21c8757/bidir_self_attn.py\&quot; style=\&quot;float:right\&quot; class=\&quot;Link--inTextBlock\&quot;>view raw</a>\n        <a href=\&quot;https://gist.github.com/wolfecameron/a809ef1e9bd176344ab59303c3e00389#file-bidir_self_attn-py\&quot; class=\&quot;Link--inTextBlock\&quot;>\n          bidir_self_attn.py\n        </a>\n        hosted with &amp;#10084; by <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.com\&quot;>GitHub</a>\n      </div>\n    </div>\n</div>\n&quot;,&quot;stylesheet&quot;:&quot;https://github.githubassets.com/assets/gist-embed-04c27bb90e5b.css&quot;}" data-component-name="GitgistToDOM"><link rel="stylesheet" href="https://github.githubassets.com/assets/gist-embed-04c27bb90e5b.css"><div id="gist137030626" class="gist">
    <div class="gist-file" data-color-mode="light" data-light-theme="light">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container">
  <div id="file-bidir_self_attn-py" class="file my-2">
    
    <div itemprop="text" class="Box-body p-0 blob-wrapper data type-python  " style="overflow:auto">

        
<div class="js-check-bidi js-blob-code-container blob-code-content">

  
  <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
  
    

    <span>
      This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
    </span>


  <div data-view-component="true" class="flash-action">        <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn">    Show hidden characters
</a>
</div>
</div>

  <span data-view-component="true" class="line-alert tooltipped tooltipped-e">
    
    

</span>

  <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="bidir_self_attn.py">
        <tbody><tr>
          <td id="file-bidir_self_attn-py-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
          <td id="file-bidir_self_attn-py-LC1" class="blob-code blob-code-inner js-file-line"><span class="pl-k">import</span> <span class="pl-s1">math</span></td>
        </tr>
        <tr>
          <td id="file-bidir_self_attn-py-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
          <td id="file-bidir_self_attn-py-LC2" class="blob-code blob-code-inner js-file-line"><span class="pl-k">import</span> <span class="pl-s1">torch</span></td>
        </tr>
        <tr>
          <td id="file-bidir_self_attn-py-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
          <td id="file-bidir_self_attn-py-LC3" class="blob-code blob-code-inner js-file-line"><span class="pl-k">from</span> <span class="pl-s1">torch</span> <span class="pl-k">import</span> <span class="pl-s1">nn</span></td>
        </tr>
        <tr>
          <td id="file-bidir_self_attn-py-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
          <td id="file-bidir_self_attn-py-LC4" class="blob-code blob-code-inner js-file-line"><span class="pl-k">import</span> <span class="pl-s1">torch</span>.<span class="pl-s1">nn</span>.<span class="pl-s1">functional</span> <span class="pl-k">as</span> <span class="pl-c1">F</span></td>
        </tr>
        <tr>
          <td id="file-bidir_self_attn-py-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
          <td id="file-bidir_self_attn-py-LC5" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bidir_self_attn-py-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
          <td id="file-bidir_self_attn-py-LC6" class="blob-code blob-code-inner js-file-line"><span class="pl-k">class</span> <span class="pl-v">SelfAttention</span>(<span class="pl-s1">nn</span>.<span class="pl-c1">Module</span>):</td>
        </tr>
        <tr>
          <td id="file-bidir_self_attn-py-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
          <td id="file-bidir_self_attn-py-LC7" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bidir_self_attn-py-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
          <td id="file-bidir_self_attn-py-LC8" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">__init__</span>(<span class="pl-s1">self</span>, <span class="pl-s1">d</span>):</td>
        </tr>
        <tr>
          <td id="file-bidir_self_attn-py-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
          <td id="file-bidir_self_attn-py-LC9" class="blob-code blob-code-inner js-file-line">        <span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-bidir_self_attn-py-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
          <td id="file-bidir_self_attn-py-LC10" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        Arguments:</span></td>
        </tr>
        <tr>
          <td id="file-bidir_self_attn-py-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
          <td id="file-bidir_self_attn-py-LC11" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        d: size of embedding dimension</span></td>
        </tr>
        <tr>
          <td id="file-bidir_self_attn-py-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
          <td id="file-bidir_self_attn-py-LC12" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        """</span></td>
        </tr>
        <tr>
          <td id="file-bidir_self_attn-py-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
          <td id="file-bidir_self_attn-py-LC13" class="blob-code blob-code-inner js-file-line">        <span class="pl-en">super</span>().<span class="pl-c1">__init__</span>()</td>
        </tr>
        <tr>
          <td id="file-bidir_self_attn-py-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
          <td id="file-bidir_self_attn-py-LC14" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">d</span> <span class="pl-c1">=</span> <span class="pl-s1">d</span></td>
        </tr>
        <tr>
          <td id="file-bidir_self_attn-py-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
          <td id="file-bidir_self_attn-py-LC15" class="blob-code blob-code-inner js-file-line">        </td>
        </tr>
        <tr>
          <td id="file-bidir_self_attn-py-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
          <td id="file-bidir_self_attn-py-LC16" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># key, query, value projections for all heads, but in a batch</span></td>
        </tr>
        <tr>
          <td id="file-bidir_self_attn-py-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
          <td id="file-bidir_self_attn-py-LC17" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># output is 3X the dimension because it includes key, query and value</span></td>
        </tr>
        <tr>
          <td id="file-bidir_self_attn-py-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
          <td id="file-bidir_self_attn-py-LC18" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">c_attn</span> <span class="pl-c1">=</span> <span class="pl-s1">nn</span>.<span class="pl-c1">Linear</span>(<span class="pl-s1">d</span>, <span class="pl-c1">3</span><span class="pl-c1">*</span><span class="pl-s1">d</span>, <span class="pl-s1">bias</span><span class="pl-c1">=</span><span class="pl-c1">False</span>)</td>
        </tr>
        <tr>
          <td id="file-bidir_self_attn-py-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
          <td id="file-bidir_self_attn-py-LC19" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bidir_self_attn-py-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td>
          <td id="file-bidir_self_attn-py-LC20" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">forward</span>(<span class="pl-s1">self</span>, <span class="pl-s1">x</span>):</td>
        </tr>
        <tr>
          <td id="file-bidir_self_attn-py-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td>
          <td id="file-bidir_self_attn-py-LC21" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># compute query, key, and value vectors in batch</span></td>
        </tr>
        <tr>
          <td id="file-bidir_self_attn-py-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td>
          <td id="file-bidir_self_attn-py-LC22" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># split the output into separate query, key, and value tensors</span></td>
        </tr>
        <tr>
          <td id="file-bidir_self_attn-py-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td>
          <td id="file-bidir_self_attn-py-LC23" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">q</span>, <span class="pl-s1">k</span>, <span class="pl-s1">v</span>  <span class="pl-c1">=</span> <span class="pl-s1">self</span>.<span class="pl-c1">c_attn</span>(<span class="pl-s1">x</span>).<span class="pl-c1">split</span>(<span class="pl-s1">self</span>.<span class="pl-c1">d</span>, <span class="pl-s1">dim</span><span class="pl-c1">=</span><span class="pl-c1">2</span>)</td>
        </tr>
        <tr>
          <td id="file-bidir_self_attn-py-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td>
          <td id="file-bidir_self_attn-py-LC24" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bidir_self_attn-py-L25" class="blob-num js-line-number js-blob-rnum" data-line-number="25"></td>
          <td id="file-bidir_self_attn-py-LC25" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># compute the attention matrix and apply dropout</span></td>
        </tr>
        <tr>
          <td id="file-bidir_self_attn-py-L26" class="blob-num js-line-number js-blob-rnum" data-line-number="26"></td>
          <td id="file-bidir_self_attn-py-LC26" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">att</span> <span class="pl-c1">=</span> (<span class="pl-s1">q</span> @ <span class="pl-s1">k</span>.<span class="pl-c1">transpose</span>(<span class="pl-c1">-</span><span class="pl-c1">2</span>, <span class="pl-c1">-</span><span class="pl-c1">1</span>)) <span class="pl-c1">*</span> (<span class="pl-c1">1.0</span> <span class="pl-c1">/</span> <span class="pl-s1">math</span>.<span class="pl-c1">sqrt</span>(<span class="pl-s1">k</span>.<span class="pl-c1">size</span>(<span class="pl-c1">-</span><span class="pl-c1">1</span>)))</td>
        </tr>
        <tr>
          <td id="file-bidir_self_attn-py-L27" class="blob-num js-line-number js-blob-rnum" data-line-number="27"></td>
          <td id="file-bidir_self_attn-py-LC27" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">att</span> <span class="pl-c1">=</span> <span class="pl-c1">F</span>.<span class="pl-c1">softmax</span>(<span class="pl-s1">att</span>, <span class="pl-s1">dim</span><span class="pl-c1">=</span><span class="pl-c1">-</span><span class="pl-c1">1</span>)</td>
        </tr>
        <tr>
          <td id="file-bidir_self_attn-py-L28" class="blob-num js-line-number js-blob-rnum" data-line-number="28"></td>
          <td id="file-bidir_self_attn-py-LC28" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-bidir_self_attn-py-L29" class="blob-num js-line-number js-blob-rnum" data-line-number="29"></td>
          <td id="file-bidir_self_attn-py-LC29" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># compute output vectors for each token</span></td>
        </tr>
        <tr>
          <td id="file-bidir_self_attn-py-L30" class="blob-num js-line-number js-blob-rnum" data-line-number="30"></td>
          <td id="file-bidir_self_attn-py-LC30" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">y</span> <span class="pl-c1">=</span> <span class="pl-s1">att</span> @ <span class="pl-s1">v</span></td>
        </tr>
        <tr>
          <td id="file-bidir_self_attn-py-L31" class="blob-num js-line-number js-blob-rnum" data-line-number="31"></td>
          <td id="file-bidir_self_attn-py-LC31" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">return</span> <span class="pl-s1">y</span></td>
        </tr>
  </tbody></table>
</div>


    </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/wolfecameron/a809ef1e9bd176344ab59303c3e00389/raw/efb77715beacf745fa2f72e3fb10a1ccc21c8757/bidir_self_attn.py" style="float:right" class="Link--inTextBlock">view raw</a>
        <a href="https://gist.github.com/wolfecameron/a809ef1e9bd176344ab59303c3e00389#file-bidir_self_attn-py" class="Link--inTextBlock">
          bidir_self_attn.py
        </a>
        hosted with &#10084; by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
      </div>
    </div>
</div>
</div><p><strong>How does cross-attention work?</strong> A schematic depiction of cross-attention is provided below. As we can see, this module is not much different than self-attention. The key difference here is in the initial linear projections used to compute the key, query and value matrices. Instead of computing all three of these matrices by linearly projecting a single sequence of token vectors, we linearly project two different sequences of vectors; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vOQ_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc806734f-037f-4f6a-8863-b7383964d8ec_2098x1058.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vOQ_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc806734f-037f-4f6a-8863-b7383964d8ec_2098x1058.png 424w, https://substackcdn.com/image/fetch/$s_!vOQ_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc806734f-037f-4f6a-8863-b7383964d8ec_2098x1058.png 848w, https://substackcdn.com/image/fetch/$s_!vOQ_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc806734f-037f-4f6a-8863-b7383964d8ec_2098x1058.png 1272w, https://substackcdn.com/image/fetch/$s_!vOQ_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc806734f-037f-4f6a-8863-b7383964d8ec_2098x1058.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vOQ_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc806734f-037f-4f6a-8863-b7383964d8ec_2098x1058.png" width="1456" height="734" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c806734f-037f-4f6a-8863-b7383964d8ec_2098x1058.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:734,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:170194,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/158954054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc806734f-037f-4f6a-8863-b7383964d8ec_2098x1058.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vOQ_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc806734f-037f-4f6a-8863-b7383964d8ec_2098x1058.png 424w, https://substackcdn.com/image/fetch/$s_!vOQ_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc806734f-037f-4f6a-8863-b7383964d8ec_2098x1058.png 848w, https://substackcdn.com/image/fetch/$s_!vOQ_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc806734f-037f-4f6a-8863-b7383964d8ec_2098x1058.png 1272w, https://substackcdn.com/image/fetch/$s_!vOQ_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc806734f-037f-4f6a-8863-b7383964d8ec_2098x1058.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The query matrix is produced by linearly projecting the first sequence, while both key and value matrices are produced by linearly projecting the second sequence. As a result, our attention matrix contains all pairwise attention scores between tokens in the first and second sequence. The length of the sequences need not be equal, and the length of the output will match that of the first sequence.</p><div class="github-gist" data-attrs="{&quot;innerHTML&quot;:&quot;<div id=\&quot;gist137033469\&quot; class=\&quot;gist\&quot;>\n    <div class=\&quot;gist-file\&quot; translate=\&quot;no\&quot; data-color-mode=\&quot;light\&quot; data-light-theme=\&quot;light\&quot;>\n      <div class=\&quot;gist-data\&quot;>\n        <div class=\&quot;js-gist-file-update-container js-task-list-container\&quot;>\n  <div id=\&quot;file-cross_attention-py\&quot; class=\&quot;file my-2\&quot;>\n    \n    <div itemprop=\&quot;text\&quot;\n      class=\&quot;Box-body p-0 blob-wrapper data type-python  \&quot;\n      style=\&quot;overflow: auto\&quot; tabindex=\&quot;0\&quot; role=\&quot;region\&quot;\n      aria-label=\&quot;cross_attention.py content, created by wolfecameron on 02:31AM on March 25.\&quot;\n    >\n\n        \n<div class=\&quot;js-check-bidi js-blob-code-container blob-code-content\&quot;>\n\n  <template class=\&quot;js-file-alert-template\&quot;>\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash flash-warn flash-full d-flex flex-items-center\&quot;>\n  <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n    <span>\n      This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.\n      <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.co/hiddenchars\&quot; target=\&quot;_blank\&quot;>Learn more about bidirectional Unicode characters</a>\n    </span>\n\n\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash-action\&quot;>        <a href=\&quot;{{ revealButtonHref }}\&quot; data-view-component=\&quot;true\&quot; class=\&quot;btn-sm btn\&quot;>    Show hidden characters\n</a>\n</div>\n</div></template>\n<template class=\&quot;js-line-alert-template\&quot;>\n  <span aria-label=\&quot;This line has hidden Unicode characters\&quot; data-view-component=\&quot;true\&quot; class=\&quot;line-alert tooltipped tooltipped-e\&quot;>\n    <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n</span></template>\n\n  <table data-hpc class=\&quot;highlight tab-size js-file-line-container\&quot; data-tab-size=\&quot;8\&quot; data-paste-markdown-skip data-tagsearch-path=\&quot;cross_attention.py\&quot;>\n        <tr>\n          <td id=\&quot;file-cross_attention-py-L1\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;1\&quot;></td>\n          <td id=\&quot;file-cross_attention-py-LC1\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>import</span> <span class=pl-s1>math</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-cross_attention-py-L2\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;2\&quot;></td>\n          <td id=\&quot;file-cross_attention-py-LC2\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>import</span> <span class=pl-s1>torch</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-cross_attention-py-L3\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;3\&quot;></td>\n          <td id=\&quot;file-cross_attention-py-LC3\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>from</span> <span class=pl-s1>torch</span> <span class=pl-k>import</span> <span class=pl-s1>nn</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-cross_attention-py-L4\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;4\&quot;></td>\n          <td id=\&quot;file-cross_attention-py-LC4\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>import</span> <span class=pl-s1>torch</span>.<span class=pl-s1>nn</span>.<span class=pl-s1>functional</span> <span class=pl-k>as</span> <span class=pl-c1>F</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-cross_attention-py-L5\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;5\&quot;></td>\n          <td id=\&quot;file-cross_attention-py-LC5\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-cross_attention-py-L6\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;6\&quot;></td>\n          <td id=\&quot;file-cross_attention-py-LC6\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>class</span> <span class=pl-v>CrossAttention</span>(<span class=pl-s1>nn</span>.<span class=pl-c1>Module</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-cross_attention-py-L7\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;7\&quot;></td>\n          <td id=\&quot;file-cross_attention-py-LC7\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-cross_attention-py-L8\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;8\&quot;></td>\n          <td id=\&quot;file-cross_attention-py-LC8\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>__init__</span>(<span class=pl-s1>self</span>, <span class=pl-s1>d</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-cross_attention-py-L9\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;9\&quot;></td>\n          <td id=\&quot;file-cross_attention-py-LC9\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-cross_attention-py-L10\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;10\&quot;></td>\n          <td id=\&quot;file-cross_attention-py-LC10\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        Arguments:</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-cross_attention-py-L11\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;11\&quot;></td>\n          <td id=\&quot;file-cross_attention-py-LC11\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        d: size of embedding dimension</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-cross_attention-py-L12\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;12\&quot;></td>\n          <td id=\&quot;file-cross_attention-py-LC12\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        &amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-cross_attention-py-L13\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;13\&quot;></td>\n          <td id=\&quot;file-cross_attention-py-LC13\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-en>super</span>().<span class=pl-c1>__init__</span>()</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-cross_attention-py-L14\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;14\&quot;></td>\n          <td id=\&quot;file-cross_attention-py-LC14\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>d</span> <span class=pl-c1>=</span> <span class=pl-s1>d</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-cross_attention-py-L15\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;15\&quot;></td>\n          <td id=\&quot;file-cross_attention-py-LC15\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        </td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-cross_attention-py-L16\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;16\&quot;></td>\n          <td id=\&quot;file-cross_attention-py-LC16\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># linear projection for producing query matrix</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-cross_attention-py-L17\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;17\&quot;></td>\n          <td id=\&quot;file-cross_attention-py-LC17\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>w_q</span> <span class=pl-c1>=</span> <span class=pl-s1>nn</span>.<span class=pl-c1>Linear</span>(<span class=pl-s1>d</span>, <span class=pl-s1>d</span>, <span class=pl-s1>bias</span><span class=pl-c1>=</span><span class=pl-c1>False</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-cross_attention-py-L18\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;18\&quot;></td>\n          <td id=\&quot;file-cross_attention-py-LC18\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        </td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-cross_attention-py-L19\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;19\&quot;></td>\n          <td id=\&quot;file-cross_attention-py-LC19\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># linear projection for producing key / value matrices</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-cross_attention-py-L20\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;20\&quot;></td>\n          <td id=\&quot;file-cross_attention-py-LC20\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>w_kv</span> <span class=pl-c1>=</span> <span class=pl-s1>nn</span>.<span class=pl-c1>Linear</span>(<span class=pl-s1>d</span>, <span class=pl-c1>2</span><span class=pl-c1>*</span><span class=pl-s1>d</span>, <span class=pl-s1>bias</span><span class=pl-c1>=</span><span class=pl-c1>False</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-cross_attention-py-L21\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;21\&quot;></td>\n          <td id=\&quot;file-cross_attention-py-LC21\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-cross_attention-py-L22\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;22\&quot;></td>\n          <td id=\&quot;file-cross_attention-py-LC22\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>forward</span>(<span class=pl-s1>self</span>, <span class=pl-s1>x_1</span>, <span class=pl-s1>x_2</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-cross_attention-py-L23\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;23\&quot;></td>\n          <td id=\&quot;file-cross_attention-py-LC23\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># compute query, key, and value matrices</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-cross_attention-py-L24\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;24\&quot;></td>\n          <td id=\&quot;file-cross_attention-py-LC24\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>q</span> <span class=pl-c1>=</span> <span class=pl-s1>self</span>.<span class=pl-c1>w_q</span>(<span class=pl-s1>x_1</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-cross_attention-py-L25\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;25\&quot;></td>\n          <td id=\&quot;file-cross_attention-py-LC25\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>k</span>, <span class=pl-s1>v</span> <span class=pl-c1>=</span> <span class=pl-s1>self</span>.<span class=pl-c1>w_kv</span>(<span class=pl-s1>x_2</span>).<span class=pl-c1>split</span>(<span class=pl-s1>self</span>.<span class=pl-c1>d</span>, <span class=pl-s1>dim</span><span class=pl-c1>=</span><span class=pl-c1>2</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-cross_attention-py-L26\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;26\&quot;></td>\n          <td id=\&quot;file-cross_attention-py-LC26\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-cross_attention-py-L27\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;27\&quot;></td>\n          <td id=\&quot;file-cross_attention-py-LC27\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># compute the attention matrix and apply dropout</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-cross_attention-py-L28\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;28\&quot;></td>\n          <td id=\&quot;file-cross_attention-py-LC28\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>att</span> <span class=pl-c1>=</span> (<span class=pl-s1>q</span> @ <span class=pl-s1>k</span>.<span class=pl-c1>transpose</span>(<span class=pl-c1>-</span><span class=pl-c1>2</span>, <span class=pl-c1>-</span><span class=pl-c1>1</span>)) <span class=pl-c1>*</span> (<span class=pl-c1>1.0</span> <span class=pl-c1>/</span> <span class=pl-s1>math</span>.<span class=pl-c1>sqrt</span>(<span class=pl-s1>k</span>.<span class=pl-c1>size</span>(<span class=pl-c1>-</span><span class=pl-c1>1</span>)))</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-cross_attention-py-L29\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;29\&quot;></td>\n          <td id=\&quot;file-cross_attention-py-LC29\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>att</span> <span class=pl-c1>=</span> <span class=pl-c1>F</span>.<span class=pl-c1>softmax</span>(<span class=pl-s1>att</span>, <span class=pl-s1>dim</span><span class=pl-c1>=</span><span class=pl-c1>-</span><span class=pl-c1>1</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-cross_attention-py-L30\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;30\&quot;></td>\n          <td id=\&quot;file-cross_attention-py-LC30\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-cross_attention-py-L31\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;31\&quot;></td>\n          <td id=\&quot;file-cross_attention-py-LC31\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># compute output vectors for each token in x_1</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-cross_attention-py-L32\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;32\&quot;></td>\n          <td id=\&quot;file-cross_attention-py-LC32\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>y</span> <span class=pl-c1>=</span> <span class=pl-s1>att</span> @ <span class=pl-s1>v</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-cross_attention-py-L33\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;33\&quot;></td>\n          <td id=\&quot;file-cross_attention-py-LC33\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>return</span> <span class=pl-s1>y</span></td>\n        </tr>\n  </table>\n</div>\n\n\n    </div>\n\n  </div>\n</div>\n\n      </div>\n      <div class=\&quot;gist-meta\&quot;>\n        <a href=\&quot;https://gist.github.com/wolfecameron/5646b2092d41d6d31ec1abb28b3b930a/raw/761cf359329f08286e4f8ae24c31447e79c4259d/cross_attention.py\&quot; style=\&quot;float:right\&quot; class=\&quot;Link--inTextBlock\&quot;>view raw</a>\n        <a href=\&quot;https://gist.github.com/wolfecameron/5646b2092d41d6d31ec1abb28b3b930a#file-cross_attention-py\&quot; class=\&quot;Link--inTextBlock\&quot;>\n          cross_attention.py\n        </a>\n        hosted with &amp;#10084; by <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.com\&quot;>GitHub</a>\n      </div>\n    </div>\n</div>\n&quot;,&quot;stylesheet&quot;:&quot;https://github.githubassets.com/assets/gist-embed-04c27bb90e5b.css&quot;}" data-component-name="GitgistToDOM"><link rel="stylesheet" href="https://github.githubassets.com/assets/gist-embed-04c27bb90e5b.css"><div id="gist137033469" class="gist">
    <div class="gist-file" data-color-mode="light" data-light-theme="light">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container">
  <div id="file-cross_attention-py" class="file my-2">
    
    <div itemprop="text" class="Box-body p-0 blob-wrapper data type-python  " style="overflow:auto">

        
<div class="js-check-bidi js-blob-code-container blob-code-content">

  
  <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
  
    

    <span>
      This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
    </span>


  <div data-view-component="true" class="flash-action">        <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn">    Show hidden characters
</a>
</div>
</div>

  <span data-view-component="true" class="line-alert tooltipped tooltipped-e">
    
    

</span>

  <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="cross_attention.py">
        <tbody><tr>
          <td id="file-cross_attention-py-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
          <td id="file-cross_attention-py-LC1" class="blob-code blob-code-inner js-file-line"><span class="pl-k">import</span> <span class="pl-s1">math</span></td>
        </tr>
        <tr>
          <td id="file-cross_attention-py-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
          <td id="file-cross_attention-py-LC2" class="blob-code blob-code-inner js-file-line"><span class="pl-k">import</span> <span class="pl-s1">torch</span></td>
        </tr>
        <tr>
          <td id="file-cross_attention-py-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
          <td id="file-cross_attention-py-LC3" class="blob-code blob-code-inner js-file-line"><span class="pl-k">from</span> <span class="pl-s1">torch</span> <span class="pl-k">import</span> <span class="pl-s1">nn</span></td>
        </tr>
        <tr>
          <td id="file-cross_attention-py-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
          <td id="file-cross_attention-py-LC4" class="blob-code blob-code-inner js-file-line"><span class="pl-k">import</span> <span class="pl-s1">torch</span>.<span class="pl-s1">nn</span>.<span class="pl-s1">functional</span> <span class="pl-k">as</span> <span class="pl-c1">F</span></td>
        </tr>
        <tr>
          <td id="file-cross_attention-py-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
          <td id="file-cross_attention-py-LC5" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-cross_attention-py-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
          <td id="file-cross_attention-py-LC6" class="blob-code blob-code-inner js-file-line"><span class="pl-k">class</span> <span class="pl-v">CrossAttention</span>(<span class="pl-s1">nn</span>.<span class="pl-c1">Module</span>):</td>
        </tr>
        <tr>
          <td id="file-cross_attention-py-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
          <td id="file-cross_attention-py-LC7" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-cross_attention-py-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
          <td id="file-cross_attention-py-LC8" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">__init__</span>(<span class="pl-s1">self</span>, <span class="pl-s1">d</span>):</td>
        </tr>
        <tr>
          <td id="file-cross_attention-py-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
          <td id="file-cross_attention-py-LC9" class="blob-code blob-code-inner js-file-line">        <span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-cross_attention-py-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
          <td id="file-cross_attention-py-LC10" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        Arguments:</span></td>
        </tr>
        <tr>
          <td id="file-cross_attention-py-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
          <td id="file-cross_attention-py-LC11" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        d: size of embedding dimension</span></td>
        </tr>
        <tr>
          <td id="file-cross_attention-py-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
          <td id="file-cross_attention-py-LC12" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        """</span></td>
        </tr>
        <tr>
          <td id="file-cross_attention-py-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
          <td id="file-cross_attention-py-LC13" class="blob-code blob-code-inner js-file-line">        <span class="pl-en">super</span>().<span class="pl-c1">__init__</span>()</td>
        </tr>
        <tr>
          <td id="file-cross_attention-py-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
          <td id="file-cross_attention-py-LC14" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">d</span> <span class="pl-c1">=</span> <span class="pl-s1">d</span></td>
        </tr>
        <tr>
          <td id="file-cross_attention-py-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
          <td id="file-cross_attention-py-LC15" class="blob-code blob-code-inner js-file-line">        </td>
        </tr>
        <tr>
          <td id="file-cross_attention-py-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
          <td id="file-cross_attention-py-LC16" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># linear projection for producing query matrix</span></td>
        </tr>
        <tr>
          <td id="file-cross_attention-py-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
          <td id="file-cross_attention-py-LC17" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">w_q</span> <span class="pl-c1">=</span> <span class="pl-s1">nn</span>.<span class="pl-c1">Linear</span>(<span class="pl-s1">d</span>, <span class="pl-s1">d</span>, <span class="pl-s1">bias</span><span class="pl-c1">=</span><span class="pl-c1">False</span>)</td>
        </tr>
        <tr>
          <td id="file-cross_attention-py-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
          <td id="file-cross_attention-py-LC18" class="blob-code blob-code-inner js-file-line">        </td>
        </tr>
        <tr>
          <td id="file-cross_attention-py-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
          <td id="file-cross_attention-py-LC19" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># linear projection for producing key / value matrices</span></td>
        </tr>
        <tr>
          <td id="file-cross_attention-py-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td>
          <td id="file-cross_attention-py-LC20" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">w_kv</span> <span class="pl-c1">=</span> <span class="pl-s1">nn</span>.<span class="pl-c1">Linear</span>(<span class="pl-s1">d</span>, <span class="pl-c1">2</span><span class="pl-c1">*</span><span class="pl-s1">d</span>, <span class="pl-s1">bias</span><span class="pl-c1">=</span><span class="pl-c1">False</span>)</td>
        </tr>
        <tr>
          <td id="file-cross_attention-py-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td>
          <td id="file-cross_attention-py-LC21" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-cross_attention-py-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td>
          <td id="file-cross_attention-py-LC22" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">forward</span>(<span class="pl-s1">self</span>, <span class="pl-s1">x_1</span>, <span class="pl-s1">x_2</span>):</td>
        </tr>
        <tr>
          <td id="file-cross_attention-py-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td>
          <td id="file-cross_attention-py-LC23" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># compute query, key, and value matrices</span></td>
        </tr>
        <tr>
          <td id="file-cross_attention-py-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td>
          <td id="file-cross_attention-py-LC24" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">q</span> <span class="pl-c1">=</span> <span class="pl-s1">self</span>.<span class="pl-c1">w_q</span>(<span class="pl-s1">x_1</span>)</td>
        </tr>
        <tr>
          <td id="file-cross_attention-py-L25" class="blob-num js-line-number js-blob-rnum" data-line-number="25"></td>
          <td id="file-cross_attention-py-LC25" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">k</span>, <span class="pl-s1">v</span> <span class="pl-c1">=</span> <span class="pl-s1">self</span>.<span class="pl-c1">w_kv</span>(<span class="pl-s1">x_2</span>).<span class="pl-c1">split</span>(<span class="pl-s1">self</span>.<span class="pl-c1">d</span>, <span class="pl-s1">dim</span><span class="pl-c1">=</span><span class="pl-c1">2</span>)</td>
        </tr>
        <tr>
          <td id="file-cross_attention-py-L26" class="blob-num js-line-number js-blob-rnum" data-line-number="26"></td>
          <td id="file-cross_attention-py-LC26" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-cross_attention-py-L27" class="blob-num js-line-number js-blob-rnum" data-line-number="27"></td>
          <td id="file-cross_attention-py-LC27" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># compute the attention matrix and apply dropout</span></td>
        </tr>
        <tr>
          <td id="file-cross_attention-py-L28" class="blob-num js-line-number js-blob-rnum" data-line-number="28"></td>
          <td id="file-cross_attention-py-LC28" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">att</span> <span class="pl-c1">=</span> (<span class="pl-s1">q</span> @ <span class="pl-s1">k</span>.<span class="pl-c1">transpose</span>(<span class="pl-c1">-</span><span class="pl-c1">2</span>, <span class="pl-c1">-</span><span class="pl-c1">1</span>)) <span class="pl-c1">*</span> (<span class="pl-c1">1.0</span> <span class="pl-c1">/</span> <span class="pl-s1">math</span>.<span class="pl-c1">sqrt</span>(<span class="pl-s1">k</span>.<span class="pl-c1">size</span>(<span class="pl-c1">-</span><span class="pl-c1">1</span>)))</td>
        </tr>
        <tr>
          <td id="file-cross_attention-py-L29" class="blob-num js-line-number js-blob-rnum" data-line-number="29"></td>
          <td id="file-cross_attention-py-LC29" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">att</span> <span class="pl-c1">=</span> <span class="pl-c1">F</span>.<span class="pl-c1">softmax</span>(<span class="pl-s1">att</span>, <span class="pl-s1">dim</span><span class="pl-c1">=</span><span class="pl-c1">-</span><span class="pl-c1">1</span>)</td>
        </tr>
        <tr>
          <td id="file-cross_attention-py-L30" class="blob-num js-line-number js-blob-rnum" data-line-number="30"></td>
          <td id="file-cross_attention-py-LC30" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-cross_attention-py-L31" class="blob-num js-line-number js-blob-rnum" data-line-number="31"></td>
          <td id="file-cross_attention-py-LC31" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># compute output vectors for each token in x_1</span></td>
        </tr>
        <tr>
          <td id="file-cross_attention-py-L32" class="blob-num js-line-number js-blob-rnum" data-line-number="32"></td>
          <td id="file-cross_attention-py-LC32" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">y</span> <span class="pl-c1">=</span> <span class="pl-s1">att</span> @ <span class="pl-s1">v</span></td>
        </tr>
        <tr>
          <td id="file-cross_attention-py-L33" class="blob-num js-line-number js-blob-rnum" data-line-number="33"></td>
          <td id="file-cross_attention-py-LC33" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">return</span> <span class="pl-s1">y</span></td>
        </tr>
  </tbody></table>
</div>


    </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/wolfecameron/5646b2092d41d6d31ec1abb28b3b930a/raw/761cf359329f08286e4f8ae24c31447e79c4259d/cross_attention.py" style="float:right" class="Link--inTextBlock">view raw</a>
        <a href="https://gist.github.com/wolfecameron/5646b2092d41d6d31ec1abb28b3b930a#file-cross_attention-py" class="Link--inTextBlock">
          cross_attention.py
        </a>
        hosted with &#10084; by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
      </div>
    </div>
</div>
</div><p>An implementation of cross-attention is provided above. As outlined in this implementation, we are no longer computing attention scores between tokens within a single sequence. Rather, we are computing inter-sequence attention scores, <em>thus forming a fused representation of the two input sequences</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SgkQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F528bb4b8-06a4-4e44-81a5-c49d7c285f42_2232x1316.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SgkQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F528bb4b8-06a4-4e44-81a5-c49d7c285f42_2232x1316.png 424w, https://substackcdn.com/image/fetch/$s_!SgkQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F528bb4b8-06a4-4e44-81a5-c49d7c285f42_2232x1316.png 848w, https://substackcdn.com/image/fetch/$s_!SgkQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F528bb4b8-06a4-4e44-81a5-c49d7c285f42_2232x1316.png 1272w, https://substackcdn.com/image/fetch/$s_!SgkQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F528bb4b8-06a4-4e44-81a5-c49d7c285f42_2232x1316.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SgkQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F528bb4b8-06a4-4e44-81a5-c49d7c285f42_2232x1316.png" width="1456" height="858" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/528bb4b8-06a4-4e44-81a5-c49d7c285f42_2232x1316.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:858,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:425875,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/158954054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F528bb4b8-06a4-4e44-81a5-c49d7c285f42_2232x1316.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SgkQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F528bb4b8-06a4-4e44-81a5-c49d7c285f42_2232x1316.png 424w, https://substackcdn.com/image/fetch/$s_!SgkQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F528bb4b8-06a4-4e44-81a5-c49d7c285f42_2232x1316.png 848w, https://substackcdn.com/image/fetch/$s_!SgkQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F528bb4b8-06a4-4e44-81a5-c49d7c285f42_2232x1316.png 1272w, https://substackcdn.com/image/fetch/$s_!SgkQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F528bb4b8-06a4-4e44-81a5-c49d7c285f42_2232x1316.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Integrating image encoder features into an LLM with cross-attention</figcaption></figure></div><p><strong>Application to vLLMs.</strong> Our explanation of cross-attention might seem random at this point in the overview. As we will see, however, cross-attention is used constantly in multi-modal LLM research. We can use cross-attention to fuse image representations produced by a vision model into a text-based LLM; see above. In other words, we can incorporate visual information into an LLM as it generates its output, allowing the model to ingest and interpret images (or other modalities of data) as input in addition to just text!</p><h4><a href="https://arxiv.org/abs/2010.11929">Vision Transformers (ViT)</a> [3]</h4><blockquote><p><em>&#8220;We apply a standard Transformer directly to images, with the fewest possible modifications. To do so, we split an image into patches and provide the sequence of linear embeddings of these patches as an input to a Transformer.&#8221;</em> - from [3]</p></blockquote><p>Although the transformer (and its many variants like <a href="https://arxiv.org/abs/1810.04805">BERT</a> and <a href="https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf">GPT</a>) were first proposed for natural language processing applications, this influential model architecture has since been expanded to applications in the computer vision domain. The Vision Transformer [3] (or ViT for short) is the most commonly used architecture today. As shown in the figure below, this architecture looks very similar to an <a href="https://cameronrwolfe.substack.com/p/language-understanding-with-bert">encoder-only (BERT-style) transformer architecture</a>. We simply take a sequence of vectors as input and apply a sequence of transformer blocks that contain both <em>i)</em> bidirectional self-attention and <em>ii)</em> a feed-forward transformation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Yuok!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e2d7b7-26a1-4e68-939d-caaab3f133d9_1758x1200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Yuok!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e2d7b7-26a1-4e68-939d-caaab3f133d9_1758x1200.png 424w, https://substackcdn.com/image/fetch/$s_!Yuok!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e2d7b7-26a1-4e68-939d-caaab3f133d9_1758x1200.png 848w, https://substackcdn.com/image/fetch/$s_!Yuok!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e2d7b7-26a1-4e68-939d-caaab3f133d9_1758x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!Yuok!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e2d7b7-26a1-4e68-939d-caaab3f133d9_1758x1200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Yuok!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e2d7b7-26a1-4e68-939d-caaab3f133d9_1758x1200.png" width="1456" height="994" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/36e2d7b7-26a1-4e68-939d-caaab3f133d9_1758x1200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:994,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:328080,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/158954054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e2d7b7-26a1-4e68-939d-caaab3f133d9_1758x1200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Yuok!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e2d7b7-26a1-4e68-939d-caaab3f133d9_1758x1200.png 424w, https://substackcdn.com/image/fetch/$s_!Yuok!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e2d7b7-26a1-4e68-939d-caaab3f133d9_1758x1200.png 848w, https://substackcdn.com/image/fetch/$s_!Yuok!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e2d7b7-26a1-4e68-939d-caaab3f133d9_1758x1200.png 1272w, https://substackcdn.com/image/fetch/$s_!Yuok!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e2d7b7-26a1-4e68-939d-caaab3f133d9_1758x1200.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Standard Vision Transformer (ViT) architecture</figcaption></figure></div><p><strong>Handling input images.</strong> The input for a vision transformer is an image. In order to pass this image as input to our transformer, however, we need to convert the image into a list of vectors&#8212;<em>resembling a sequence of textual token vectors</em>. For ViTs, we do this by segmenting an image into a set of patches and flattening each patch into a vector. From here, these vectors may not be of the same size expected by the transformer, so we just linearly project them into the correct dimension.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5BTM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b21b27b-e8a1-4542-b323-c3c17abbe379_1078x662.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5BTM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b21b27b-e8a1-4542-b323-c3c17abbe379_1078x662.png 424w, https://substackcdn.com/image/fetch/$s_!5BTM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b21b27b-e8a1-4542-b323-c3c17abbe379_1078x662.png 848w, https://substackcdn.com/image/fetch/$s_!5BTM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b21b27b-e8a1-4542-b323-c3c17abbe379_1078x662.png 1272w, https://substackcdn.com/image/fetch/$s_!5BTM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b21b27b-e8a1-4542-b323-c3c17abbe379_1078x662.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5BTM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b21b27b-e8a1-4542-b323-c3c17abbe379_1078x662.png" width="1078" height="662" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3b21b27b-e8a1-4542-b323-c3c17abbe379_1078x662.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:662,&quot;width&quot;:1078,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:204062,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/158954054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b21b27b-e8a1-4542-b323-c3c17abbe379_1078x662.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5BTM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b21b27b-e8a1-4542-b323-c3c17abbe379_1078x662.png 424w, https://substackcdn.com/image/fetch/$s_!5BTM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b21b27b-e8a1-4542-b323-c3c17abbe379_1078x662.png 848w, https://substackcdn.com/image/fetch/$s_!5BTM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b21b27b-e8a1-4542-b323-c3c17abbe379_1078x662.png 1272w, https://substackcdn.com/image/fetch/$s_!5BTM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b21b27b-e8a1-4542-b323-c3c17abbe379_1078x662.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><p>Similarly to a normal transformer, we add positional embeddings to the vector for each patch. Here, the positional embedding captures the 2D position of each patch within an image. The output of this transformer architecture is a sequence of vectors for each patch that is of the same size as the input. To solve tasks like image classification, we can just add an additional classification module (e.g., a linear layer) to the end of this model, as shown in the figure above. </p><p><strong>Why the encoder? </strong>We use an encoder-only transformer architecture for the ViT, instead of the decoder-only transformer architecture that is used by most LLMs. <em>The reason for this is that the ViT is not generative</em>. For LLMs, we train the model via <a href="https://cameronrwolfe.substack.com/i/136638774/understanding-next-token-prediction">next token prediction</a> to generate sequences of text. As a result, we need to use masked self-attention in each transformer layer so that the model cannot look forward in the sequence at future tokens. Otherwise, the model would be able to cheat when predicting the next token! In contrast, the ViT should be able to look at the entire sequence of image patches to form a good representation of the image&#8212;<em>we do not need to predict the next patch in this input sequence</em>!</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Pkaz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6013a7d8-f5e8-4b43-a15f-8690e0bbe93c_1516x412.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Pkaz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6013a7d8-f5e8-4b43-a15f-8690e0bbe93c_1516x412.png 424w, https://substackcdn.com/image/fetch/$s_!Pkaz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6013a7d8-f5e8-4b43-a15f-8690e0bbe93c_1516x412.png 848w, https://substackcdn.com/image/fetch/$s_!Pkaz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6013a7d8-f5e8-4b43-a15f-8690e0bbe93c_1516x412.png 1272w, https://substackcdn.com/image/fetch/$s_!Pkaz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6013a7d8-f5e8-4b43-a15f-8690e0bbe93c_1516x412.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Pkaz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6013a7d8-f5e8-4b43-a15f-8690e0bbe93c_1516x412.png" width="464" height="126.1978021978022" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6013a7d8-f5e8-4b43-a15f-8690e0bbe93c_1516x412.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:396,&quot;width&quot;:1456,&quot;resizeWidth&quot;:464,&quot;bytes&quot;:78314,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/158954054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6013a7d8-f5e8-4b43-a15f-8690e0bbe93c_1516x412.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Pkaz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6013a7d8-f5e8-4b43-a15f-8690e0bbe93c_1516x412.png 424w, https://substackcdn.com/image/fetch/$s_!Pkaz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6013a7d8-f5e8-4b43-a15f-8690e0bbe93c_1516x412.png 848w, https://substackcdn.com/image/fetch/$s_!Pkaz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6013a7d8-f5e8-4b43-a15f-8690e0bbe93c_1516x412.png 1272w, https://substackcdn.com/image/fetch/$s_!Pkaz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6013a7d8-f5e8-4b43-a15f-8690e0bbe93c_1516x412.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><p><strong>Training ViT.</strong> The original ViT model in [3] shares the same architecture as BERT. As shown above, multiple sizes of ViT are trained, the largest of which is ViT-H (or ViT-Huge)&#8212;<em>we will see this model again later in the overview</em>. All ViT models are trained using supervised image classification on datasets of varying sizes. When ViTs are trained over small or mid-sized datasets (e.g., ImageNet), they perform comparably to&#8212;<em>or slightly worse than</em>&#8212;ResNets<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a> of comparable size. However, ViTs begin to shine when pretrained over much larger datasets (e.g., <a href="https://paperswithcode.com/dataset/jft-300m">JFT-300M</a>) and finetuned afterwards on downstream tasks; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iQUR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a07f5db-8f99-4901-a703-5f64d5dac7c1_2078x1142.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iQUR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a07f5db-8f99-4901-a703-5f64d5dac7c1_2078x1142.png 424w, https://substackcdn.com/image/fetch/$s_!iQUR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a07f5db-8f99-4901-a703-5f64d5dac7c1_2078x1142.png 848w, https://substackcdn.com/image/fetch/$s_!iQUR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a07f5db-8f99-4901-a703-5f64d5dac7c1_2078x1142.png 1272w, https://substackcdn.com/image/fetch/$s_!iQUR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a07f5db-8f99-4901-a703-5f64d5dac7c1_2078x1142.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iQUR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a07f5db-8f99-4901-a703-5f64d5dac7c1_2078x1142.png" width="1456" height="800" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2a07f5db-8f99-4901-a703-5f64d5dac7c1_2078x1142.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:800,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:735797,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/158954054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a07f5db-8f99-4901-a703-5f64d5dac7c1_2078x1142.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iQUR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a07f5db-8f99-4901-a703-5f64d5dac7c1_2078x1142.png 424w, https://substackcdn.com/image/fetch/$s_!iQUR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a07f5db-8f99-4901-a703-5f64d5dac7c1_2078x1142.png 848w, https://substackcdn.com/image/fetch/$s_!iQUR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a07f5db-8f99-4901-a703-5f64d5dac7c1_2078x1142.png 1272w, https://substackcdn.com/image/fetch/$s_!iQUR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a07f5db-8f99-4901-a703-5f64d5dac7c1_2078x1142.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [3])</figcaption></figure></div><h4><a href="https://arxiv.org/abs/2103.00020">Contrastive Language-Image Pre-Training (CLIP)</a> [4]</h4><p>The standard ViT is trained over a large dataset of supervised image classification examples. These models perform best when pretrained over a massive volume of annotated (usually by humans) data, which is difficult and expensive to obtain. In [4], authors explore an alternative approach that uses image-caption pairs, which are more readily available online, to train a powerful image representation model. This approach is called Contrastive Language-Image Pre-Training (CLIP). </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QfFq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e019441-7531-4d38-8a45-7e523dabebac_3284x1548.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QfFq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e019441-7531-4d38-8a45-7e523dabebac_3284x1548.png 424w, https://substackcdn.com/image/fetch/$s_!QfFq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e019441-7531-4d38-8a45-7e523dabebac_3284x1548.png 848w, https://substackcdn.com/image/fetch/$s_!QfFq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e019441-7531-4d38-8a45-7e523dabebac_3284x1548.png 1272w, https://substackcdn.com/image/fetch/$s_!QfFq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e019441-7531-4d38-8a45-7e523dabebac_3284x1548.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QfFq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e019441-7531-4d38-8a45-7e523dabebac_3284x1548.png" width="1456" height="686" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0e019441-7531-4d38-8a45-7e523dabebac_3284x1548.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:686,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:661036,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/158954054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e019441-7531-4d38-8a45-7e523dabebac_3284x1548.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QfFq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e019441-7531-4d38-8a45-7e523dabebac_3284x1548.png 424w, https://substackcdn.com/image/fetch/$s_!QfFq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e019441-7531-4d38-8a45-7e523dabebac_3284x1548.png 848w, https://substackcdn.com/image/fetch/$s_!QfFq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e019441-7531-4d38-8a45-7e523dabebac_3284x1548.png 1272w, https://substackcdn.com/image/fetch/$s_!QfFq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e019441-7531-4d38-8a45-7e523dabebac_3284x1548.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [4])</figcaption></figure></div><p><strong>CLIP architecture.</strong> The CLIP model is made up of two independent components: an image encoder and a text encoder. Given an image-text pair as input, we pass these inputs separately to their corresponding encoder to get an associated vector representation. The image encoder is a standard ViT model [3], whereas the text encoder is a <a href="https://cameronrwolfe.substack.com/p/decoder-only-transformers-the-workhorse">decoder-only transformer</a> (i.e., a typical GPT-style LLM). CLIP&#8217;s text encoder is not used to generate text (at least in [4]), but the authors use a decoder-only architecture to simplify the extension of CLIP to generative applications in the future. A depiction of CLIP&#8217;s architecture is provided above. </p><blockquote><p><em>&#8220;The simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.&#8221;</em> - from [4]</p></blockquote><p><strong>Contrastive learning.</strong> There are many ways that we could approach training the CLIP model described above. For example, we could classify the images based on the words in the caption [5] or use the LLM component of the architecture to generate captions based on the image [6]. However, these objectives were found in prior work to either perform poorly or cause the model to learn slowly. The key contribution of [4] is the idea of using a simple and efficient training objective&#8212;  <em>based upon ideas from <a href="https://lilianweng.github.io/posts/2021-05-31-contrastive/">contrastive learning</a></em>&#8212;to learn from image-text pairs. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!67ZT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d7c264c-a19a-43f4-b867-353f68581cbe_864x430.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!67ZT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d7c264c-a19a-43f4-b867-353f68581cbe_864x430.png 424w, https://substackcdn.com/image/fetch/$s_!67ZT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d7c264c-a19a-43f4-b867-353f68581cbe_864x430.png 848w, https://substackcdn.com/image/fetch/$s_!67ZT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d7c264c-a19a-43f4-b867-353f68581cbe_864x430.png 1272w, https://substackcdn.com/image/fetch/$s_!67ZT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d7c264c-a19a-43f4-b867-353f68581cbe_864x430.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!67ZT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d7c264c-a19a-43f4-b867-353f68581cbe_864x430.png" width="368" height="183.14814814814815" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9d7c264c-a19a-43f4-b867-353f68581cbe_864x430.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:430,&quot;width&quot;:864,&quot;resizeWidth&quot;:368,&quot;bytes&quot;:173069,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/158954054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d7c264c-a19a-43f4-b867-353f68581cbe_864x430.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!67ZT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d7c264c-a19a-43f4-b867-353f68581cbe_864x430.png 424w, https://substackcdn.com/image/fetch/$s_!67ZT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d7c264c-a19a-43f4-b867-353f68581cbe_864x430.png 848w, https://substackcdn.com/image/fetch/$s_!67ZT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d7c264c-a19a-43f4-b867-353f68581cbe_864x430.png 1272w, https://substackcdn.com/image/fetch/$s_!67ZT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d7c264c-a19a-43f4-b867-353f68581cbe_864x430.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">A schematic depiction of CLIP training objective</figcaption></figure></div><p>More specifically, CLIP is trained using the simple task of classifying the correct caption for an image among a group of candidate captions (i.e., all other captions within a training batch). Practically, this objective is implemented by:</p><ol><li><p>Passing a group of images and textual captions through their respective encoders (i.e., the ViT for images and the LLM for text). </p></li><li><p>Maximizing the <a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine similarity</a> between image and text embeddings (obtained from the encoders) of the true image-caption pairs.</p></li><li><p>Minimizing the cosine similarity between all other image-caption pairs. </p></li></ol><p>This objective is referred to as a <a href="https://github.com/RElbers/info-nce-pytorch">multi-class N-pair (or InfoNCE) loss</a> [7] and is commonly used in the contrastive and metric learning literature.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gWI7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e4e6f45-c169-4d06-887e-9a23b5bf1a55_2380x1048.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gWI7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e4e6f45-c169-4d06-887e-9a23b5bf1a55_2380x1048.png 424w, https://substackcdn.com/image/fetch/$s_!gWI7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e4e6f45-c169-4d06-887e-9a23b5bf1a55_2380x1048.png 848w, https://substackcdn.com/image/fetch/$s_!gWI7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e4e6f45-c169-4d06-887e-9a23b5bf1a55_2380x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!gWI7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e4e6f45-c169-4d06-887e-9a23b5bf1a55_2380x1048.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gWI7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e4e6f45-c169-4d06-887e-9a23b5bf1a55_2380x1048.png" width="1456" height="641" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9e4e6f45-c169-4d06-887e-9a23b5bf1a55_2380x1048.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:641,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:505048,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/158954054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e4e6f45-c169-4d06-887e-9a23b5bf1a55_2380x1048.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gWI7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e4e6f45-c169-4d06-887e-9a23b5bf1a55_2380x1048.png 424w, https://substackcdn.com/image/fetch/$s_!gWI7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e4e6f45-c169-4d06-887e-9a23b5bf1a55_2380x1048.png 848w, https://substackcdn.com/image/fetch/$s_!gWI7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e4e6f45-c169-4d06-887e-9a23b5bf1a55_2380x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!gWI7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e4e6f45-c169-4d06-887e-9a23b5bf1a55_2380x1048.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Zero-shot classification (left [4]) and CLIP training efficiency (right [4])</figcaption></figure></div><p><strong>Using CLIP.</strong> Although the CLIP model is trained with both an image and text encoder, <em>most of the work we will see in this overview only uses the image encoder from CLIP</em>. The key contribution of CLIP is not the model architecture, but rather the training objective. Using both an image and text encoder allows us to train the image encoder using the contrastive objective described above, which is very efficient (see above) and does not rely on large amounts of supervised data. The CLIP model architecture can be useful as a whole; e.g., we can use it to perform zero-shot image classification as shown above. However, <em>we can also train a CLIP model solely for the purpose of obtaining a high-quality image encoder</em>!</p><h4>From Images to Videos</h4><p>To process an image with an LLM, we can simply pass this image to an image encoder (e.g., CLIP) to produce a set of vectors&#8212;<em>or embeddings</em>&#8212;to represent this image. Then, the LLM can take these embeddings as an additional input (we will cover more details on this later in the overview). However, <em>what if we have access to a video instead of an image?</em> Interestingly, processing video inputs with an LLM is not that much different than processing image inputs&#8212;<em>we just need some strategy for converting this video into a set of vectors, similarly to an image</em>! </p><p><strong>What is a video?</strong> At the simplest level, a video is just an ordered list of images, commonly referred to as &#8220;frames&#8221;. Usually, images are stored in <a href="https://en.wikipedia.org/wiki/RGB_color_model">RGB format</a>. For example, the image in the figure below has three color channels&#8212;<em>red, blue and green</em>&#8212;as well as a height and width of five. The size of this image is <code>3 (color channels) &#215; 5 (height) &#215; 5 (width)</code>. We can also stack several images into a mini-batch of images, forming a tensor of size <code>batch &#215; 3 &#215; 5 &#215; 5</code>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Acuw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5357ad54-32e7-44d4-9fcc-0236114a062c_1630x824.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Acuw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5357ad54-32e7-44d4-9fcc-0236114a062c_1630x824.png 424w, https://substackcdn.com/image/fetch/$s_!Acuw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5357ad54-32e7-44d4-9fcc-0236114a062c_1630x824.png 848w, https://substackcdn.com/image/fetch/$s_!Acuw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5357ad54-32e7-44d4-9fcc-0236114a062c_1630x824.png 1272w, https://substackcdn.com/image/fetch/$s_!Acuw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5357ad54-32e7-44d4-9fcc-0236114a062c_1630x824.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Acuw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5357ad54-32e7-44d4-9fcc-0236114a062c_1630x824.png" width="532" height="268.9230769230769" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5357ad54-32e7-44d4-9fcc-0236114a062c_1630x824.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:736,&quot;width&quot;:1456,&quot;resizeWidth&quot;:532,&quot;bytes&quot;:94760,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/158954054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5357ad54-32e7-44d4-9fcc-0236114a062c_1630x824.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Acuw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5357ad54-32e7-44d4-9fcc-0236114a062c_1630x824.png 424w, https://substackcdn.com/image/fetch/$s_!Acuw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5357ad54-32e7-44d4-9fcc-0236114a062c_1630x824.png 848w, https://substackcdn.com/image/fetch/$s_!Acuw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5357ad54-32e7-44d4-9fcc-0236114a062c_1630x824.png 1272w, https://substackcdn.com/image/fetch/$s_!Acuw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5357ad54-32e7-44d4-9fcc-0236114a062c_1630x824.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Comparing the data structure of images and videos</figcaption></figure></div><p>The structure of video is not much different&#8212;<em>a video is just a collection of ordered frames</em>. When viewed in the correct temporal order, these frames reveal the movement of a scene through time, forming a video. Similar to images, each of these frames are usually represented in RGB-format, and all frames in a video have the same spatial resolution. For example, the video in the figure above has three frames, each with three color channels and a height and width of five, forming a tensor of size <code>3 (frames) &#215; 3 (color channels) &#215; 5 (height) &#215; 5 (width)</code>. We can also create a mini-batch of videos, but we must make sure that each video has the same number of frames&#8212;<em>this is usually done by extracting fixed-length &#8220;clips&#8221; from the video (e.g., with 64 frames)</em>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xmqc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb253f0e6-f5e7-475f-adee-b5b3dd94430a_1884x1122.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xmqc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb253f0e6-f5e7-475f-adee-b5b3dd94430a_1884x1122.png 424w, https://substackcdn.com/image/fetch/$s_!xmqc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb253f0e6-f5e7-475f-adee-b5b3dd94430a_1884x1122.png 848w, https://substackcdn.com/image/fetch/$s_!xmqc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb253f0e6-f5e7-475f-adee-b5b3dd94430a_1884x1122.png 1272w, https://substackcdn.com/image/fetch/$s_!xmqc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb253f0e6-f5e7-475f-adee-b5b3dd94430a_1884x1122.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xmqc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb253f0e6-f5e7-475f-adee-b5b3dd94430a_1884x1122.png" width="1456" height="867" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b253f0e6-f5e7-475f-adee-b5b3dd94430a_1884x1122.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:867,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:129488,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/158954054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb253f0e6-f5e7-475f-adee-b5b3dd94430a_1884x1122.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xmqc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb253f0e6-f5e7-475f-adee-b5b3dd94430a_1884x1122.png 424w, https://substackcdn.com/image/fetch/$s_!xmqc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb253f0e6-f5e7-475f-adee-b5b3dd94430a_1884x1122.png 848w, https://substackcdn.com/image/fetch/$s_!xmqc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb253f0e6-f5e7-475f-adee-b5b3dd94430a_1884x1122.png 1272w, https://substackcdn.com/image/fetch/$s_!xmqc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb253f0e6-f5e7-475f-adee-b5b3dd94430a_1884x1122.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Sub-sampling frames in a video</figcaption></figure></div><p><strong>Frame rate.</strong> Videos usually have a fixed number of <a href="https://en.wikipedia.org/wiki/Frame_rate">frames per second (FPS)</a> at which they are recorded. For example, 24 FPS is a common frame rate, which means that each second of the video will contain 24 frames. For watching movies or playing video games, having a granular frame rate is important&#8212;<em>we do not want to have any visually perceptible gaps between the frames of the video</em>. However, neural networks do not need to process videos at this level of granularity. As shown above, we can save computational costs by sub-sampling the frames within a video; e.g., sampling every eighth frame of a 24 FPS video to simulate 3 FPS.</p><p><strong>Encoding a video.</strong> Once we have sub-sampled video frames, we can simply treat a video as a set of images! Usually, we pass each video frame independently through an image encoder like CLIP, yielding a corresponding set of vectors to represent each video frame. Then, an LLM can ingest the vectors for these video frames as an additional input, similarly to an image. But, there is still a problem here: <em>the number of vectors produced for the video is large and sometimes unpredictable because the video can be of any length</em>. We need an additional module to aggregate the frame representations for a video into a single, fixed-sized set of vectors!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5WlV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0a19ba8-3763-4bc1-baf5-026d74507645_2172x986.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5WlV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0a19ba8-3763-4bc1-baf5-026d74507645_2172x986.png 424w, https://substackcdn.com/image/fetch/$s_!5WlV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0a19ba8-3763-4bc1-baf5-026d74507645_2172x986.png 848w, https://substackcdn.com/image/fetch/$s_!5WlV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0a19ba8-3763-4bc1-baf5-026d74507645_2172x986.png 1272w, https://substackcdn.com/image/fetch/$s_!5WlV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0a19ba8-3763-4bc1-baf5-026d74507645_2172x986.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5WlV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0a19ba8-3763-4bc1-baf5-026d74507645_2172x986.png" width="1456" height="661" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c0a19ba8-3763-4bc1-baf5-026d74507645_2172x986.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:661,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:259902,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/158954054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0a19ba8-3763-4bc1-baf5-026d74507645_2172x986.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5WlV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0a19ba8-3763-4bc1-baf5-026d74507645_2172x986.png 424w, https://substackcdn.com/image/fetch/$s_!5WlV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0a19ba8-3763-4bc1-baf5-026d74507645_2172x986.png 848w, https://substackcdn.com/image/fetch/$s_!5WlV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0a19ba8-3763-4bc1-baf5-026d74507645_2172x986.png 1272w, https://substackcdn.com/image/fetch/$s_!5WlV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0a19ba8-3763-4bc1-baf5-026d74507645_2172x986.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [9])</figcaption></figure></div><p>This is where the <strong>Perceiver</strong> [9] and <strong>Perceiver Resampler</strong> [10] come in handy. The perceiver (shown above) is an attention-based neural network architecture that can ingest high-dimensional input&#8212;<em>such as large set of vectors of variable size produced from the frames of a video</em>&#8212;and output a fixed-size representation based upon this input. Put simply, this means that we can pass all of our video vectors to the Perceiver and it will give us a fixed-size set of vectors in return. Then, we can easily integrate this additional input into an LLM, just like an image! </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eLJC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4277d6a0-7606-4ad8-a8f2-e70308ddd1de_1792x926.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eLJC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4277d6a0-7606-4ad8-a8f2-e70308ddd1de_1792x926.png 424w, https://substackcdn.com/image/fetch/$s_!eLJC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4277d6a0-7606-4ad8-a8f2-e70308ddd1de_1792x926.png 848w, https://substackcdn.com/image/fetch/$s_!eLJC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4277d6a0-7606-4ad8-a8f2-e70308ddd1de_1792x926.png 1272w, https://substackcdn.com/image/fetch/$s_!eLJC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4277d6a0-7606-4ad8-a8f2-e70308ddd1de_1792x926.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eLJC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4277d6a0-7606-4ad8-a8f2-e70308ddd1de_1792x926.png" width="1456" height="752" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4277d6a0-7606-4ad8-a8f2-e70308ddd1de_1792x926.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:752,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:402558,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/158954054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4277d6a0-7606-4ad8-a8f2-e70308ddd1de_1792x926.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eLJC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4277d6a0-7606-4ad8-a8f2-e70308ddd1de_1792x926.png 424w, https://substackcdn.com/image/fetch/$s_!eLJC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4277d6a0-7606-4ad8-a8f2-e70308ddd1de_1792x926.png 848w, https://substackcdn.com/image/fetch/$s_!eLJC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4277d6a0-7606-4ad8-a8f2-e70308ddd1de_1792x926.png 1272w, https://substackcdn.com/image/fetch/$s_!eLJC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4277d6a0-7606-4ad8-a8f2-e70308ddd1de_1792x926.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [10])</figcaption></figure></div><p>The Perceiver was originally applied to multi-modal LLMs by Flamingo [10], which proposed the Perceiver Resampler; see above. Flamingo samples video at one FPS (i.e., a single frame from every second of video). Each sub-sampled frame of a video is passed independently through an image encoder<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>, producing a corresponding image embedding. Before passing these image embeddings to the text-based LLM, however, we pass them through a Perceiver architecture that produces a fixed (64) number of visual token vectors for the video. Then, <em>we integrate these vectors into the LLM using cross-attention as described before</em>. </p><h2>vLLM Architectures and Training Strategies</h2><p>We now understand most background concepts relevant to vLLMs. Next, we will use these concepts to build an understanding of vLLMs from the ground up. In this section, we will focus on the architectures and training strategies that are commonly used to create vLLMs. We will keep this discussion conceptual for now, then apply these ideas to implementing a real vLLM in the next section.</p><h4>vLLM Architecture Variants</h4><p>The architecture of a vLLM always has two primary components: the LLM backbone and the vision encoder. The LLM backbone is just a standard decoder-only transformer, while the vision encoder is usually a CLIP / ViT model (with an optional Perceiver Resampler if we want to handle video-based inputs). There are two common vLLM architecture variants that fuse these components together: <em>the unified embedding and cross-modality attention architecture</em>. We use the naming scheme for these architectures proposed by <a href="https://sebastianraschka.com/">Sebastian Raschka</a> in his<a href="https://magazine.sebastianraschka.com/p/understanding-multimodal-llms"> great overview of vLLMs</a>. Now, let&#8217;s learn about how these architectures work.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9ZHQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d39309-0469-4908-9b37-5204415f85c1_1648x842.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9ZHQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d39309-0469-4908-9b37-5204415f85c1_1648x842.png 424w, https://substackcdn.com/image/fetch/$s_!9ZHQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d39309-0469-4908-9b37-5204415f85c1_1648x842.png 848w, https://substackcdn.com/image/fetch/$s_!9ZHQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d39309-0469-4908-9b37-5204415f85c1_1648x842.png 1272w, https://substackcdn.com/image/fetch/$s_!9ZHQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d39309-0469-4908-9b37-5204415f85c1_1648x842.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9ZHQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d39309-0469-4908-9b37-5204415f85c1_1648x842.png" width="588" height="300.46153846153845" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/42d39309-0469-4908-9b37-5204415f85c1_1648x842.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:744,&quot;width&quot;:1456,&quot;resizeWidth&quot;:588,&quot;bytes&quot;:148883,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/158954054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d39309-0469-4908-9b37-5204415f85c1_1648x842.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!9ZHQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d39309-0469-4908-9b37-5204415f85c1_1648x842.png 424w, https://substackcdn.com/image/fetch/$s_!9ZHQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d39309-0469-4908-9b37-5204415f85c1_1648x842.png 848w, https://substackcdn.com/image/fetch/$s_!9ZHQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d39309-0469-4908-9b37-5204415f85c1_1648x842.png 1272w, https://substackcdn.com/image/fetch/$s_!9ZHQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42d39309-0469-4908-9b37-5204415f85c1_1648x842.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Generating token vectors from raw text</figcaption></figure></div><p><strong>Token vectors.</strong> The LLM backbone takes raw text as input, but this text is first tokenized into a set of discrete tokens and converted into token vectors by retrieving the corresponding embedding for each token from an embedding layer; see above. This set of token vectors can be directly passed as input to the decoder-only transformer architecture. Similarly for images (or videos), we produce a set of token vectors from the vision encoder by passing an image or video through the vision encoder, which returns a set of visual token vectors as output; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rNP6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd022205-f5bc-4580-a4b3-3a03648d37d1_1288x1066.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rNP6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd022205-f5bc-4580-a4b3-3a03648d37d1_1288x1066.png 424w, https://substackcdn.com/image/fetch/$s_!rNP6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd022205-f5bc-4580-a4b3-3a03648d37d1_1288x1066.png 848w, https://substackcdn.com/image/fetch/$s_!rNP6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd022205-f5bc-4580-a4b3-3a03648d37d1_1288x1066.png 1272w, https://substackcdn.com/image/fetch/$s_!rNP6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd022205-f5bc-4580-a4b3-3a03648d37d1_1288x1066.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rNP6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd022205-f5bc-4580-a4b3-3a03648d37d1_1288x1066.png" width="448" height="370.7826086956522" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cd022205-f5bc-4580-a4b3-3a03648d37d1_1288x1066.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1066,&quot;width&quot;:1288,&quot;resizeWidth&quot;:448,&quot;bytes&quot;:272753,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/158954054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd022205-f5bc-4580-a4b3-3a03648d37d1_1288x1066.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!rNP6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd022205-f5bc-4580-a4b3-3a03648d37d1_1288x1066.png 424w, https://substackcdn.com/image/fetch/$s_!rNP6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd022205-f5bc-4580-a4b3-3a03648d37d1_1288x1066.png 848w, https://substackcdn.com/image/fetch/$s_!rNP6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd022205-f5bc-4580-a4b3-3a03648d37d1_1288x1066.png 1272w, https://substackcdn.com/image/fetch/$s_!rNP6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd022205-f5bc-4580-a4b3-3a03648d37d1_1288x1066.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Generating image token vectors with a vision encoder</figcaption></figure></div><p><strong>Unified embedding.</strong> Now, we have a set of text and image (or video) token vectors as input. The first common vLLM architecture simply:</p><ol><li><p>Concatenates these two modalities of vectors together, forming a single sequence of token vectors.</p></li><li><p>Passes these concatenated vectors directly as input to a decoder-only transformer architecture. </p></li></ol><p>This architecture, referred to as a unified embedding architecture, is depicted in the figure below. Notably, the size of the visual token vectors may not match that of the text token vectors. So, we usually linearly project the token vectors from the vision encoder into the correct dimension prior to concatenation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!e_fX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f88da96-bb2a-49c7-a3db-171fa92bb2fa_1498x1158.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!e_fX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f88da96-bb2a-49c7-a3db-171fa92bb2fa_1498x1158.png 424w, https://substackcdn.com/image/fetch/$s_!e_fX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f88da96-bb2a-49c7-a3db-171fa92bb2fa_1498x1158.png 848w, https://substackcdn.com/image/fetch/$s_!e_fX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f88da96-bb2a-49c7-a3db-171fa92bb2fa_1498x1158.png 1272w, https://substackcdn.com/image/fetch/$s_!e_fX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f88da96-bb2a-49c7-a3db-171fa92bb2fa_1498x1158.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!e_fX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f88da96-bb2a-49c7-a3db-171fa92bb2fa_1498x1158.png" width="630" height="487.21153846153845" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0f88da96-bb2a-49c7-a3db-171fa92bb2fa_1498x1158.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1126,&quot;width&quot;:1456,&quot;resizeWidth&quot;:630,&quot;bytes&quot;:258501,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/158954054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f88da96-bb2a-49c7-a3db-171fa92bb2fa_1498x1158.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!e_fX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f88da96-bb2a-49c7-a3db-171fa92bb2fa_1498x1158.png 424w, https://substackcdn.com/image/fetch/$s_!e_fX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f88da96-bb2a-49c7-a3db-171fa92bb2fa_1498x1158.png 848w, https://substackcdn.com/image/fetch/$s_!e_fX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f88da96-bb2a-49c7-a3db-171fa92bb2fa_1498x1158.png 1272w, https://substackcdn.com/image/fetch/$s_!e_fX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f88da96-bb2a-49c7-a3db-171fa92bb2fa_1498x1158.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The Unified Embedding Architecture</figcaption></figure></div><p>The unified embedding architecture is conceptually simple, but it increases the length of input passed to the LLM, which can cause a significant corresponding increase in computational cost during both training and inference. <em>These visual tokens are passed through every layer of our powerful LLM backbone</em>! Luckily, we can get around this issue by using a slightly different kind of vLLM architecture.</p><p><strong>Cross-modality attention.</strong> Instead of concatenating text and vision token vectors, we can just pass the text token vectors as input to the LLM. To incorporate vision info, we can add extra cross-attention modules that perform cross-attention between the text and vision token vector into select layers of the LLM&#8212;<em>usually every second or fourth layer</em>. This architectural variant is usually referred to as a cross-modality attention architecture; see below for a depiction. Notably, this architecture looks very similar to the original transformer decoder&#8212;<em>we just perform cross attention with the image encoder instead of the transformer encoder</em>!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ln4p!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc46c381b-da70-4f32-9067-570ca1fbb56b_1660x1076.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ln4p!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc46c381b-da70-4f32-9067-570ca1fbb56b_1660x1076.png 424w, https://substackcdn.com/image/fetch/$s_!Ln4p!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc46c381b-da70-4f32-9067-570ca1fbb56b_1660x1076.png 848w, https://substackcdn.com/image/fetch/$s_!Ln4p!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc46c381b-da70-4f32-9067-570ca1fbb56b_1660x1076.png 1272w, https://substackcdn.com/image/fetch/$s_!Ln4p!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc46c381b-da70-4f32-9067-570ca1fbb56b_1660x1076.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ln4p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc46c381b-da70-4f32-9067-570ca1fbb56b_1660x1076.png" width="1456" height="944" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c46c381b-da70-4f32-9067-570ca1fbb56b_1660x1076.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:944,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:306483,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/158954054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc46c381b-da70-4f32-9067-570ca1fbb56b_1660x1076.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Ln4p!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc46c381b-da70-4f32-9067-570ca1fbb56b_1660x1076.png 424w, https://substackcdn.com/image/fetch/$s_!Ln4p!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc46c381b-da70-4f32-9067-570ca1fbb56b_1660x1076.png 848w, https://substackcdn.com/image/fetch/$s_!Ln4p!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc46c381b-da70-4f32-9067-570ca1fbb56b_1660x1076.png 1272w, https://substackcdn.com/image/fetch/$s_!Ln4p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc46c381b-da70-4f32-9067-570ca1fbb56b_1660x1076.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The Cross-Modality Attention Architecture</figcaption></figure></div><p>The benefit of this architecture is that we do not increase the length of the input passed to the LLM. Rather, we merge visual information into the LLM by using cross-attention, which is much more computationally efficient. Additionally, the cross-modality attention architecture adds new layers into the model architecture for fusing visual and textual information, rather than relying on the existing layers of the LLM to perform this fusion. For this reason, <em>we can actually leave the LLM backbone fixed during training and only train the added layers</em>, thus ensuring that the LLM&#8217;s performance on text-only tasks is not changed at all.</p><h4>How do we train vLLMs?</h4><p>In this overview, we will only consider LLMs that can ingest visual inputs&#8212;<em>these models still only generate text as output</em>. So, we can train these models similarly to any other LLM: using <a href="https://cameronrwolfe.substack.com/i/136638774/understanding-next-token-prediction">next-token prediction</a>. Even for the unified embedding architecture, we primarily train the model by predicting textual tokens&#8212;<em>we do not typically try to predict visual tokens (i.e., perform next-image prediction)</em>. </p><blockquote><p><em>&#8220;The visual encoding of Gemini models is inspired by our own foundational work on Flamingo, CoCa, and PaLI, with the important distinction that the models are multimodal from the beginning and can natively output images using discrete image tokens.&#8221;</em> - from [11]</p></blockquote><p>Going beyond the training objective, however, there are several strategies that we can follow for training a vLLM. For example, we could perform <strong>native multi-modal training</strong>, meaning that we initialize all components of the architecture from scratch and train the model using multi-modal data (i.e., text, images, videos and more) from the beginning; e.g., this approach is used to train Gemini [11].</p><p>In practice, however, native multi-modality is complex and difficult. There are many issues that we may encounter when using such approach:</p><ul><li><p>Getting access to a large volume of paired image-and-text data is hard.</p></li><li><p>Efficient tokenization of visual data at pretraining scale is hard.</p></li><li><p>Imbalances between modalities can arise; e.g., the model may learn to ignore images because text usually provides enough info for next token prediction. </p></li></ul><p>For these reasons, vLLMs are more frequently trained using a <strong>compositional approach</strong>. Specifically, this means that we start by pretraining the LLM backbone and the vision encoder independently. Then, we have an additional training phase&#8212;<em>we will call this the fusion stage</em>&#8212;that combines the text and vision models together into a single vLLM. This approach has several benefits:</p><ul><li><p>The development of text and image models can be parallelized.</p></li><li><p>Existing text-based LLMs&#8212;<em>which are very powerful and advanced</em>&#8212;can be used as a starting point for training vLLMs.</p></li><li><p>A much larger volume of data is available because we can use text-only, vision-only, and paired text-and-vision data for training.</p></li></ul><p>During the fusion phase, we may or may not train the full vLLM architecture. For example, when using a cross-modality attention architecture, we can freeze the LLM backbone during fusion and only train the cross-attention and vision encoder layers. Such an approach is common in the literature because it allows us to start with an existing, text-based LLM and create a corresponding vLLM without making any modifications to the underlying LLM backbone. As we will see, this was the exact approach used to train the LLaMA-3.2 Vision models.</p><h2>LLaMA-3.2 Vision: Powerful, Open vLLMs</h2><p>Now that we understand the concepts underlying vLLMs, let&#8217;s take a look at a practical case study. The LLaMA-3 [1] LLMs were originally text-only but have since been extended to handle image (and video) inputs. These models are also (mostly) open source<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>, so we can gain a deep understanding of them by <em>i)</em> studying the details provided in their corresponding technical reports and <em>ii)</em> looking at their code. In this section, we will study in detail how the LLaMA-3 suite of LLMs has been extended to create a corresponding suite of vLLMs. </p><h4><a href="https://arxiv.org/abs/2407.21783">Extending LLaMA-3 to Images and Video</a> [1]</h4><p>Proposed in [1], LLaMA-3 is one of the most popular and powerful suites of open-source LLMs. LLaMA-3 models are all dense&#8212;<em>meaning they do not use an <a href="https://cameronrwolfe.substack.com/p/moe-llms">MoE architecture</a></em><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a>&#8212;and come in three different sizes: 8B, 70B, and 405B. These models improve upon prior <a href="https://cameronrwolfe.substack.com/p/llama-2-from-the-ground-up">LLaMA-2 models</a> by an order of magnitude&#8212;<em>they have a 30&#215; larger context window (128k vs. 4k), use a 30&#215; (15.6T tokens vs. 1.8T tokens) larger dataset, and are trained using 50&#215; the amount of compute.</em> </p><blockquote><p><em>&#8220;We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks&#8230; The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach.&#8221;</em> - from [2]</p></blockquote><p>The initial LLaMA-3 models only accept text as input. However, authors include experiments in [1] that incorporate both vision (i.e., image and video) and speech features. <em>We will learn how LLaMA-3 is trained on visual inputs in this section</em>. </p><p><strong>Compositional vLLMs.</strong> LLaMA-3 follows a compositional approach to creating a multi-modal model. We begin by independently pretraining both a vision encoder and a text-only LLM. Here, the text-only LLM is the text-based LLaMA-3 model, while the vision encoder is a pretrained CLIP model. Adopting a cross-modality attention architecture, we then insert cross-attention layers between these two models and focus on training these extra layers. We will refer to these cross-attention layers as an &#8220;image adapter&#8221; for convenience. By doing this, the LLM is taught to incorporate additional visual features when generating output. </p><p>The <strong>vision encoder</strong> for LLaMA-3 is based upon the ViT [3] architecture&#8212;<em>the 630M parameter ViT-H</em><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a><em> model in particular</em>&#8212;and is pretrained via a contrastive objective on 2.5B image-text pairs. In other words, <em>this model is nearly identical to the image encoder component of the CLIP [4] architecture!</em> We create visual features with this model by passing an image through the model and extracting the corresponding embeddings; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uNLo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ca3ac2-9461-4c6b-b22c-ec7535f726cc_1178x1178.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uNLo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ca3ac2-9461-4c6b-b22c-ec7535f726cc_1178x1178.png 424w, https://substackcdn.com/image/fetch/$s_!uNLo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ca3ac2-9461-4c6b-b22c-ec7535f726cc_1178x1178.png 848w, https://substackcdn.com/image/fetch/$s_!uNLo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ca3ac2-9461-4c6b-b22c-ec7535f726cc_1178x1178.png 1272w, https://substackcdn.com/image/fetch/$s_!uNLo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ca3ac2-9461-4c6b-b22c-ec7535f726cc_1178x1178.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uNLo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ca3ac2-9461-4c6b-b22c-ec7535f726cc_1178x1178.png" width="490" height="490" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/54ca3ac2-9461-4c6b-b22c-ec7535f726cc_1178x1178.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1178,&quot;width&quot;:1178,&quot;resizeWidth&quot;:490,&quot;bytes&quot;:228172,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/158954054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ca3ac2-9461-4c6b-b22c-ec7535f726cc_1178x1178.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uNLo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ca3ac2-9461-4c6b-b22c-ec7535f726cc_1178x1178.png 424w, https://substackcdn.com/image/fetch/$s_!uNLo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ca3ac2-9461-4c6b-b22c-ec7535f726cc_1178x1178.png 848w, https://substackcdn.com/image/fetch/$s_!uNLo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ca3ac2-9461-4c6b-b22c-ec7535f726cc_1178x1178.png 1272w, https://substackcdn.com/image/fetch/$s_!uNLo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54ca3ac2-9461-4c6b-b22c-ec7535f726cc_1178x1178.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Concatenating embeddings from multiple ViT layers</figcaption></figure></div><p>Notably, we know from <a href="https://arxiv.org/abs/2312.00784">prior research</a> that image encoders trained with contrastive (CLIP-style) objectives capture semantic information but fail to capture the fine-grained perceptual details of an image. For this reason, any LLM relying upon such visual features may fail to answer questions that require exact localization within an image; see below from an example with <a href="https://cdn.openai.com/papers/GPTV_System_Card.pdf">GPT-4V</a>. As shown above, this issue is addressed in LLaMA-3 by extracting visual features from several different layers<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a> of the vision encoder and concatenating them together.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Dee5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F266ca8c9-2059-43b2-aa4e-ff22f37e7608_1072x1148.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Dee5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F266ca8c9-2059-43b2-aa4e-ff22f37e7608_1072x1148.png 424w, https://substackcdn.com/image/fetch/$s_!Dee5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F266ca8c9-2059-43b2-aa4e-ff22f37e7608_1072x1148.png 848w, https://substackcdn.com/image/fetch/$s_!Dee5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F266ca8c9-2059-43b2-aa4e-ff22f37e7608_1072x1148.png 1272w, https://substackcdn.com/image/fetch/$s_!Dee5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F266ca8c9-2059-43b2-aa4e-ff22f37e7608_1072x1148.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Dee5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F266ca8c9-2059-43b2-aa4e-ff22f37e7608_1072x1148.png" width="330" height="353.3955223880597" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/266ca8c9-2059-43b2-aa4e-ff22f37e7608_1072x1148.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1148,&quot;width&quot;:1072,&quot;resizeWidth&quot;:330,&quot;bytes&quot;:407235,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/158954054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F266ca8c9-2059-43b2-aa4e-ff22f37e7608_1072x1148.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Dee5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F266ca8c9-2059-43b2-aa4e-ff22f37e7608_1072x1148.png 424w, https://substackcdn.com/image/fetch/$s_!Dee5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F266ca8c9-2059-43b2-aa4e-ff22f37e7608_1072x1148.png 848w, https://substackcdn.com/image/fetch/$s_!Dee5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F266ca8c9-2059-43b2-aa4e-ff22f37e7608_1072x1148.png 1272w, https://substackcdn.com/image/fetch/$s_!Dee5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F266ca8c9-2059-43b2-aa4e-ff22f37e7608_1072x1148.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(<a href="https://arxiv.org/abs/2312.00784">source</a>)</figcaption></figure></div><p>LLaMA-3 also adds several additional self-attention layers after the image encoder and prior to fusion with the LLM&#8212;<em>the final image encoder has a total of 850M parameters</em>. This encoder produces 7680-dimensional embeddings for each patch in the input image, and each image has <code>16 &#215; 16 = 256</code> patches in total. </p><blockquote><p><em>&#8220;We introduce cross-attention layers between the visual token representations produced by the image encoder and the token representations produced by the language model.&#8221;</em> - from [1]</p></blockquote><p><strong>Image adapter.</strong> To incorporate features from the image encoder into LLaMA-3, we use a cross-attention-based image adapter. More specifically, cross-attention layers, which compute attention between the textual tokens of the LLM and the image embeddings of the image encoder, are added to every fourth transformer block of the LLM. These cross-attention layers significantly increase the size of the model; e.g., LLaMA-3-405B has ~500B parameters with the image adapter. However, the image adapter allows the LLM to incorporate information from the image encoder into its token representations when generating text. </p><p><strong>Video adapter.</strong> In addition to images, authors in [1] extend LLaMA-3 to support video inputs. Given that videos are just a sequence of images (or frames), we do not have to significantly modify the existing architecture. The model takes 64 frames as input, each of which is passed through the existing image encoder; see below. To capture the temporal relationship between frames, we use a Perceiver Resampler, which aggregates the representation of 32 consecutive frames into one. Finally, additional video cross-attention layers are added into the LLM.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!S8ko!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe96d2d1d-97f2-4d42-a94a-08219ec4deee_1604x998.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!S8ko!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe96d2d1d-97f2-4d42-a94a-08219ec4deee_1604x998.png 424w, https://substackcdn.com/image/fetch/$s_!S8ko!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe96d2d1d-97f2-4d42-a94a-08219ec4deee_1604x998.png 848w, https://substackcdn.com/image/fetch/$s_!S8ko!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe96d2d1d-97f2-4d42-a94a-08219ec4deee_1604x998.png 1272w, https://substackcdn.com/image/fetch/$s_!S8ko!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe96d2d1d-97f2-4d42-a94a-08219ec4deee_1604x998.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!S8ko!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe96d2d1d-97f2-4d42-a94a-08219ec4deee_1604x998.png" width="1456" height="906" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e96d2d1d-97f2-4d42-a94a-08219ec4deee_1604x998.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:906,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:263970,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/158954054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe96d2d1d-97f2-4d42-a94a-08219ec4deee_1604x998.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!S8ko!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe96d2d1d-97f2-4d42-a94a-08219ec4deee_1604x998.png 424w, https://substackcdn.com/image/fetch/$s_!S8ko!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe96d2d1d-97f2-4d42-a94a-08219ec4deee_1604x998.png 848w, https://substackcdn.com/image/fetch/$s_!S8ko!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe96d2d1d-97f2-4d42-a94a-08219ec4deee_1604x998.png 1272w, https://substackcdn.com/image/fetch/$s_!S8ko!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe96d2d1d-97f2-4d42-a94a-08219ec4deee_1604x998.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>The full architecture of the multi-modal LLaMA-3, including both video and image components, is shown above. Here, we can see that both image and video inputs are first processed via the image encoder, then incorporated into the LLM via cross-attention layers. For videos, we add an extra aggregation module&#8212;<em>the Perceiver Resampler</em>&#8212;to capture the sequential relationship between video frames.</p><p><strong>Pretraining dataset.</strong> Both the image encoder and cross-attention layers are trained on a large dataset of image-text pairs. This dataset is filtered to <em>i)</em> remove non-English captions, <em>ii)</em> remove duplicates, iii) remove low-quality data, and <em>iv)</em> maximize diversity (i.e., based on n-gram <a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf">TF-IDF</a> scores). A very similar process is followed to collect video-text pairs for training the video adapter.</p><p>To improve the document understanding capabilities of LLaMA-3, authors in [1] also concatenate <a href="https://en.wikipedia.org/wiki/Optical_character_recognition">OCR</a> output to the end of each textual caption and collect a large number of documents&#8212;<em>represented as images</em>&#8212;with associated text. Other notable sources of multi-modal training data for LLaMA-3 include:</p><ul><li><p><em>Visual grounding</em>: noun phrases in the text are linked to bounding boxes / masks in the image that are either overlaid in the image or specified via (normalized) coordinates in the text.</p></li><li><p><em>Screenshot parsing</em>: screenshots from HTML code are rendered and the model is asked to predict the code that produced an element&#8212;<em>indicated by an overlaid bounding box</em>&#8212;in the screenshot.</p></li><li><p><em>Question-answer pairs</em>: a large volume of QA data from several sources.</p></li><li><p><em>Synthetic captions</em>: images with synthetic captions generated by an early version of LLaMA-3. Authors in [1] observe that synthetic captions tend to be more comprehensive than the original human-written captions. </p></li><li><p><em>Synthetic structured images</em>: charts, tables, flowcharts, math equations, and more accompanied by a structured representation (e.g., markdown or LaTeX). </p></li></ul><p><strong>Image adapter training.</strong> Prior to training the image adapter, the image encoder is pretrained for several epochs over the image-text pairs in the dataset described above. When training the adapter, the weights of the image encoder are not fixed&#8212;<em>they continue to be updated</em>. However, the LLM weights are frozen during this training process. As a result, the LLM backbone of the multi-modal LLaMA-3 model is identical to text-only LLaMA-3, <em>ensuring parity on text-only tasks</em>.</p><p>The image adapter is trained in two phases, both of which use a <a href="https://cameronrwolfe.substack.com/i/136638774/understanding-next-token-prediction">standard language modeling objective</a> applied on the textual caption. In the first phase, all images are resized to a lower resolution to make the training process as efficient as possible. This initial training phase is followed by a second, shorter phase in which we increase the resolution of images and use a smaller (sampled) version of the original dataset that emphasizes the highest quality data. After both training phases are complete, we train the video adapter&#8212;<em>beginning with the fully-trained image encoder and adapter</em>&#8212;over the video-text dataset using a similar process.</p><blockquote><p><em>&#8220;After pre-training, we fine-tune the model on highly curated multi-modal conversational data to enable chat capabilities. We further implement direct preference optimization (DPO) to boost human evaluation performance and rejection sampling to improve multi-modal reasoning capabilities.&#8221; - from [1]</em></p></blockquote><p><strong>Post training.</strong> Similar to the text-based LLaMA-3 model, multi-modal models undergo an entire post training procedure that aligns the model to human preferences, teaches it how to follow instructions, improves its ability to handle conversational inputs and more. Similarly to the text-only LLaMA-3 model, multi-modal models are post trained using a combination of <a href="https://cameronrwolfe.substack.com/p/understanding-and-using-supervised">supervised finetuning (SFT)</a>, <a href="https://arxiv.org/abs/2110.14168">rejection sampling (RS)</a> and <a href="https://arxiv.org/abs/2305.18290">direct preference optimization (DPO)</a> applied multiple times sequentially (i.e., in &#8220;rounds&#8221;). This process is depicted below, and a full overview of post training for LLaMA-3 can be found <a href="https://www.interconnects.ai/p/frontier-model-post-training">here</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Yqyc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefcd982d-ef70-4c48-be86-79ae58b6496b_1276x614.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Yqyc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefcd982d-ef70-4c48-be86-79ae58b6496b_1276x614.png 424w, https://substackcdn.com/image/fetch/$s_!Yqyc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefcd982d-ef70-4c48-be86-79ae58b6496b_1276x614.png 848w, https://substackcdn.com/image/fetch/$s_!Yqyc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefcd982d-ef70-4c48-be86-79ae58b6496b_1276x614.png 1272w, https://substackcdn.com/image/fetch/$s_!Yqyc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefcd982d-ef70-4c48-be86-79ae58b6496b_1276x614.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Yqyc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefcd982d-ef70-4c48-be86-79ae58b6496b_1276x614.png" width="1276" height="614" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/efcd982d-ef70-4c48-be86-79ae58b6496b_1276x614.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:614,&quot;width&quot;:1276,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:112440,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/158954054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefcd982d-ef70-4c48-be86-79ae58b6496b_1276x614.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Yqyc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefcd982d-ef70-4c48-be86-79ae58b6496b_1276x614.png 424w, https://substackcdn.com/image/fetch/$s_!Yqyc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefcd982d-ef70-4c48-be86-79ae58b6496b_1276x614.png 848w, https://substackcdn.com/image/fetch/$s_!Yqyc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefcd982d-ef70-4c48-be86-79ae58b6496b_1276x614.png 1272w, https://substackcdn.com/image/fetch/$s_!Yqyc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefcd982d-ef70-4c48-be86-79ae58b6496b_1276x614.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>Unlike when we are training the image encoder and adapter, we do not use the weights of the base LLaMA-3 model for our LLM during post training. Instead, we replace the weights of this base model with those of the <a href="https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct">LLaMA-3-Instruct</a> model, which has already undergone extensive post training. The dataset for post training is collected from a variety of sources:</p><ul><li><p>Academic datasets that have been converted into conversational format either via a template or by rewriting with an LLM.</p></li><li><p>Human-annotated datasets that are collected by <em>i)</em> providing a seed image or video and asking the human to write an associated conversation or <em>ii)</em> asking humans to compare model outputs to form preference pairs. </p></li><li><p>Synthetic datasets collected by giving the text representation (i.e., caption) of an image or video to an LLM and prompting the model to generate related question-answer pairs.</p></li><li><p>Existing model outputs that have been subtly (but meaningfully) perturbed by an LLM to produce an error, thus forming a preference pair.</p></li></ul><p>Several unique strategies are adopted to optimize the post trained model&#8217;s performance. For example, authors train several models&#8212;<em>with different hyperparameters</em>&#8212;at each stage of post training and obtain the final model by taking the average of these models&#8217; weights. This <a href="https://cameronrwolfe.substack.com/p/model-merging">model merging approach</a> outperforms the best model obtained via a <a href="https://en.wikipedia.org/wiki/Hyperparameter_optimization">hyperparameter grid search</a>.</p><h4><a href="https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/">LLaMA-3.2: Medium-Sized Vision LLMs</a> [2]</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0pfb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f18040b-5c65-4ffa-96b6-44647e3bea57_3840x2160.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0pfb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f18040b-5c65-4ffa-96b6-44647e3bea57_3840x2160.png 424w, https://substackcdn.com/image/fetch/$s_!0pfb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f18040b-5c65-4ffa-96b6-44647e3bea57_3840x2160.png 848w, https://substackcdn.com/image/fetch/$s_!0pfb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f18040b-5c65-4ffa-96b6-44647e3bea57_3840x2160.png 1272w, https://substackcdn.com/image/fetch/$s_!0pfb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f18040b-5c65-4ffa-96b6-44647e3bea57_3840x2160.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0pfb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f18040b-5c65-4ffa-96b6-44647e3bea57_3840x2160.png" width="620" height="348.75" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1f18040b-5c65-4ffa-96b6-44647e3bea57_3840x2160.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:620,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0pfb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f18040b-5c65-4ffa-96b6-44647e3bea57_3840x2160.png 424w, https://substackcdn.com/image/fetch/$s_!0pfb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f18040b-5c65-4ffa-96b6-44647e3bea57_3840x2160.png 848w, https://substackcdn.com/image/fetch/$s_!0pfb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f18040b-5c65-4ffa-96b6-44647e3bea57_3840x2160.png 1272w, https://substackcdn.com/image/fetch/$s_!0pfb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f18040b-5c65-4ffa-96b6-44647e3bea57_3840x2160.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p>Preliminary experiments with multi-modal LLaMA-3 models were provided in [1], but these models were not officially released until LLaMA-3.2 [2]. The 11B and 90B parameter LLaMA-3.2 Vision models<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-9" href="#footnote-9" target="_self">9</a> were the first LLaMA models to support images as input and have strong capabilities on visual understanding tasks like image captioning and document understanding. Other modalities explored in [1]&#8212;<em>such as speech and video</em>&#8212;were not included in LLaMA-3.2, and no multi-modal version of the largest (405B parameter) LLaMA-3.1 model is released. </p><p><strong>LLaMA-3.2 Vision architecture.</strong> The architecture described in [2] for the LLaMA-3.2 Vision models perfectly matches that of the preliminary models outlined in [1]. These models are comprised of:</p><ul><li><p>A pretrained LLM backbone.</p></li><li><p>A pretrained vision encoder.</p></li><li><p>Several cross-attention layers between the LLM and vision encoder.</p></li></ul><p>The LLM backbone for LLaMA-3.2 is simply the text-only LLaMA-3.1-8B and LLaMA-3.1-70B models. The vision LLMs are trained in several stages on image-text pairs, but the LLM backbone is not updated during training&#8212;<em>we only update the image encoder and adapter layers</em>. As a result, the performance of LLaMA-3.2 Vision models on text-only tasks is left in tact relative to LLaMA-3.1. </p><p><strong>Stages of training.</strong> As mentioned previously, the LLaMA-3.2 Vision models are trained in multiple stages. First, we must pretrain the LLM backbone and image encoder independently of each other. We then integrate these models together by adding cross attention layers between the two models and pretrain the combined vision model over a large (and noisy) dataset of image-text pairs. Lastly, we train the model further on a medium-sized dataset of higher-quality, enhanced data and perform post training. The post training strategy for vision models includes several rounds of SFT, rejection sampling and DPO (i.e., same as LLaMA-3.1).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eP5i!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05814b16-71ad-4958-8622-d1e662c48939_1452x1116.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eP5i!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05814b16-71ad-4958-8622-d1e662c48939_1452x1116.png 424w, https://substackcdn.com/image/fetch/$s_!eP5i!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05814b16-71ad-4958-8622-d1e662c48939_1452x1116.png 848w, https://substackcdn.com/image/fetch/$s_!eP5i!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05814b16-71ad-4958-8622-d1e662c48939_1452x1116.png 1272w, https://substackcdn.com/image/fetch/$s_!eP5i!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05814b16-71ad-4958-8622-d1e662c48939_1452x1116.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eP5i!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05814b16-71ad-4958-8622-d1e662c48939_1452x1116.png" width="1452" height="1116" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/05814b16-71ad-4958-8622-d1e662c48939_1452x1116.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1116,&quot;width&quot;:1452,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:192981,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/158954054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05814b16-71ad-4958-8622-d1e662c48939_1452x1116.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eP5i!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05814b16-71ad-4958-8622-d1e662c48939_1452x1116.png 424w, https://substackcdn.com/image/fetch/$s_!eP5i!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05814b16-71ad-4958-8622-d1e662c48939_1452x1116.png 848w, https://substackcdn.com/image/fetch/$s_!eP5i!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05814b16-71ad-4958-8622-d1e662c48939_1452x1116.png 1272w, https://substackcdn.com/image/fetch/$s_!eP5i!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05814b16-71ad-4958-8622-d1e662c48939_1452x1116.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p><strong>Model evaluation.</strong> For text-based tasks, the performance of LLaMA-3.2 models is identical to that of LLaMa-3.1&#8212;<em>the LLM backbone is left unchanged by the multi-modal pretraining process</em>. However, authors also evaluate the LLaMA-3.2 Vision models across a wide range of visual understanding tasks in [2]; see above. Most notably, these models have strong performance on tasks that involve documents, charts or diagrams. Such an ability is not surprising given that the model is trained over a large number of document-text pairs, as well as synthetic images of charts and tables. On other visual understanding tasks, LLaMA-3.2 continues to perform well and is competitive with several leading foundation models. </p><h4>LLaMA-3.2 Vision Implementation</h4><p>Now that we&#8217;ve learned about the LLaMA-3.2 Vision models, let&#8217;s take a deeper look at their implementation. To do this, we will study their code in <a href="https://github.com/pytorch/torchtune">torchtune</a>. For simplicity, we will omit some details from the implementation and instead present pseudocode that outlines the key modeling components. However, those who are interested can always read through the <a href="https://github.com/pytorch/torchtune/tree/main/torchtune/models/llama3_2_vision">full code</a> in torchtune!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!p6XN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e163a5c-aa41-48ce-a6b8-1fea597ef0a0_1476x1128.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!p6XN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e163a5c-aa41-48ce-a6b8-1fea597ef0a0_1476x1128.png 424w, https://substackcdn.com/image/fetch/$s_!p6XN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e163a5c-aa41-48ce-a6b8-1fea597ef0a0_1476x1128.png 848w, https://substackcdn.com/image/fetch/$s_!p6XN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e163a5c-aa41-48ce-a6b8-1fea597ef0a0_1476x1128.png 1272w, https://substackcdn.com/image/fetch/$s_!p6XN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e163a5c-aa41-48ce-a6b8-1fea597ef0a0_1476x1128.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!p6XN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e163a5c-aa41-48ce-a6b8-1fea597ef0a0_1476x1128.png" width="1456" height="1113" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4e163a5c-aa41-48ce-a6b8-1fea597ef0a0_1476x1128.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1113,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:238673,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/158954054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e163a5c-aa41-48ce-a6b8-1fea597ef0a0_1476x1128.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!p6XN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e163a5c-aa41-48ce-a6b8-1fea597ef0a0_1476x1128.png 424w, https://substackcdn.com/image/fetch/$s_!p6XN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e163a5c-aa41-48ce-a6b8-1fea597ef0a0_1476x1128.png 848w, https://substackcdn.com/image/fetch/$s_!p6XN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e163a5c-aa41-48ce-a6b8-1fea597ef0a0_1476x1128.png 1272w, https://substackcdn.com/image/fetch/$s_!p6XN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e163a5c-aa41-48ce-a6b8-1fea597ef0a0_1476x1128.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Top-level structure.</strong> If we look at the primary function for instantiating a LLaMA-3.2 Vision architecture, we will see that the model is&#8212;<em>as we should expect</em>&#8212;made up of two primary components: an image encoder and an LLM backbone (called the vision decoder above). These two models are combined in a <code>FusionModel</code>. As shown above, we can toggle the trainable components of this <code>FusionModel</code>, which handles setting each model component as trainable or not and passes the output of the vision encoder to the vision decoder in a generic fashion. </p><pre><code># compute the output of the vision encoder
encoder_embed = None
if encoder_input is not None:
    encoder_embed = self.encoder(**encoder_input)

# pass the vision encoder output to the vision decoder
output = self.decoder(
   tokens=tokens,
   mask=mask,
   encoder_input=encoder_embed,
   encoder_mask=encoder_mask,
   input_pos=input_pos,
)</code></pre><p>Notably, the input-output structure of the <code>FusionModel</code> is identical to that of a standard transformer decoder in PyTorch<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-10" href="#footnote-10" target="_self">10</a>&#8212;<em>these two types of models can be used interchangeably</em>. As shown in the code above, we can also supply an encoder mask that allows us to mask any image tokens from chosen textual tokens.</p><blockquote><p><em>&#8220;DeepFusion is a type of fused model architecture where a pretrained encoder is combined with a pretrained decoder (LLM)&#8230; This module makes no assumptions on how the encoder and decoder are fused; it simply passes in the encoder embeddings to the decoder and lets the decoder handle any fusion.&#8221;</em> - <a href="https://github.com/pytorch/torchtune/blob/main/torchtune/modules/model_fusion/_deep_fusion.py">source</a></p></blockquote><p>The <strong>vision encoder</strong> used by LLaMA-3.2 Vision is a standard, <a href="https://pytorch.org/torchtune/0.5/generated/torchtune.models.clip.clip_vision_encoder.html#torchtune.models.clip.clip_vision_encoder">CLIP-based vision encoder</a>. This encoder passes an input image through CLIP to retrieve a set of image embeddings. From here, we do not directly pass the output of CLIP to the vision decoder&#8212;<em>there is an additional </em><code>VisionProjectionHead</code><em> module that sits between CLIP and the vision decoder</em>. The implementation is provided below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FwWD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9aed6e0-ffe0-490f-b768-e30454fb5ea6_1542x3028.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FwWD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9aed6e0-ffe0-490f-b768-e30454fb5ea6_1542x3028.png 424w, https://substackcdn.com/image/fetch/$s_!FwWD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9aed6e0-ffe0-490f-b768-e30454fb5ea6_1542x3028.png 848w, https://substackcdn.com/image/fetch/$s_!FwWD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9aed6e0-ffe0-490f-b768-e30454fb5ea6_1542x3028.png 1272w, https://substackcdn.com/image/fetch/$s_!FwWD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9aed6e0-ffe0-490f-b768-e30454fb5ea6_1542x3028.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FwWD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9aed6e0-ffe0-490f-b768-e30454fb5ea6_1542x3028.png" width="1456" height="2859" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e9aed6e0-ffe0-490f-b768-e30454fb5ea6_1542x3028.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2859,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:591383,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/158954054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9aed6e0-ffe0-490f-b768-e30454fb5ea6_1542x3028.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FwWD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9aed6e0-ffe0-490f-b768-e30454fb5ea6_1542x3028.png 424w, https://substackcdn.com/image/fetch/$s_!FwWD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9aed6e0-ffe0-490f-b768-e30454fb5ea6_1542x3028.png 848w, https://substackcdn.com/image/fetch/$s_!FwWD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9aed6e0-ffe0-490f-b768-e30454fb5ea6_1542x3028.png 1272w, https://substackcdn.com/image/fetch/$s_!FwWD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9aed6e0-ffe0-490f-b768-e30454fb5ea6_1542x3028.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This module passes the CLIP embeddings through several extra self-attention layers prior to being ingested by the vision decoder. Additionally, the projection head pulls features from several hidden layers of the CLIP model&#8212;<em>instead of just taking the final layer&#8217;s output</em>&#8212;to ensure that perceptual information is not lost. All of these embeddings are concatenated together and linearly projected so that they match the size of textual token vectors used by the vision decoder.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tv5f!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3d9f263-68b8-4df6-9fad-113744c8755b_1664x2766.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tv5f!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3d9f263-68b8-4df6-9fad-113744c8755b_1664x2766.png 424w, https://substackcdn.com/image/fetch/$s_!tv5f!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3d9f263-68b8-4df6-9fad-113744c8755b_1664x2766.png 848w, https://substackcdn.com/image/fetch/$s_!tv5f!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3d9f263-68b8-4df6-9fad-113744c8755b_1664x2766.png 1272w, https://substackcdn.com/image/fetch/$s_!tv5f!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3d9f263-68b8-4df6-9fad-113744c8755b_1664x2766.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tv5f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3d9f263-68b8-4df6-9fad-113744c8755b_1664x2766.png" width="1456" height="2420" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f3d9f263-68b8-4df6-9fad-113744c8755b_1664x2766.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2420,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:612645,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/158954054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3d9f263-68b8-4df6-9fad-113744c8755b_1664x2766.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tv5f!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3d9f263-68b8-4df6-9fad-113744c8755b_1664x2766.png 424w, https://substackcdn.com/image/fetch/$s_!tv5f!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3d9f263-68b8-4df6-9fad-113744c8755b_1664x2766.png 848w, https://substackcdn.com/image/fetch/$s_!tv5f!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3d9f263-68b8-4df6-9fad-113744c8755b_1664x2766.png 1272w, https://substackcdn.com/image/fetch/$s_!tv5f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3d9f263-68b8-4df6-9fad-113744c8755b_1664x2766.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The <strong>vision decoder</strong> for LLaMA-3.2 Vision is nearly identical to a standard, text-based LLM; see above. We just modify this architecture to add cross-attention layers to a subset of the layers in the decoder. To do this, a <code>FusionLayer</code> is used, which keeps the parameters of the cross-attention layer and decoder block separate. This way, <em>we can toggle whether each of these components should be trained or not</em>. For example, LLAMA-3.2 trains the cross-attention layers and leaves the LLM backbone fixed throughout the multi-modal training process.</p><h2>Closing Remarks</h2><p>The primary takeaway that we should glean from this overview is the fact that vLLMs are not much different than standard, text-based LLMs. We simply add an additional image encoder to this model, as well as some extra layers to fuse the two models together. The fusion between the image encoder and the text-based LLM can be accomplished either via a unified embedding architecture or with cross-modality attention. From here, we can just train this combined model (in multiple phases) over image-text pairs, forming a powerful vLLM. Many variants of vLLMs exist, <em>but the fundamental ideas behind them really are that simple</em>!</p><h4>New to the newsletter?</h4><p>Hi! I&#8217;m <a href="https://cameronrwolfe.me/">Cameron R. Wolfe</a>, Deep Learning Ph.D. and Machine Learning Scientist at <a href="https://research.netflix.com/research-area/nlp-and-conversations">Netflix</a>. This is the Deep (Learning) Focus newsletter, where I help readers better understand important topics in AI research. If you like the newsletter, please subscribe, share it, or follow me on <a href="https://twitter.com/cwolferesearch">X</a> and <a href="https://www.linkedin.com/in/cameron-r-wolfe-ph-d-04744a238/">LinkedIn</a>!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://cameronrwolfe.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://cameronrwolfe.substack.com/subscribe?"><span>Subscribe now</span></a></p><h4>Bibliography</h4><p>[1] Grattafiori, Aaron, et al. "The llama 3 herd of models." <em>arXiv preprint arXiv:2407.21783</em> (2024).</p><p>[2] Meta LLaMA Team. &#8220;Llama 3.2: Revolutionizing edge AI and vision with open, customizable models&#8221; https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/ (2024).</p><p>[3] Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." <em>arXiv preprint arXiv:2010.11929</em> (2020).</p><p>[4] Radford, Alec, et al. "Learning transferable visual models from natural language supervision." <em>International conference on machine learning</em>. PmLR, 2021.</p><p>[5] Joulin, Armand, et al. "Learning visual features from large weakly supervised data." <em>Computer Vision&#8211;ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11&#8211;14, 2016, Proceedings, Part VII 14</em>. Springer International Publishing, 2016.</p><p>[6] Desai, Karan, and Justin Johnson. "Virtex: Learning visual representations from textual annotations." <em>Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</em>. 2021.</p><p>[7] Sohn, Kihyuk. &#8220;Improved deep metric learning with multi-class n-pair loss objective.&#8221; Advances in neural information processing systems 29 (2016).</p><p>[8] Vaswani, Ashish, et al. "Attention is all you need." <em>Advances in neural information processing systems</em> 30 (2017).</p><p>[9] Jaegle, Andrew, et al. "Perceiver: General perception with iterative attention." <em>International conference on machine learning</em>. PMLR, 2021.</p><p>[10] Alayrac, Jean-Baptiste, et al. "Flamingo: a visual language model for few-shot learning." <em>Advances in neural information processing systems</em> 35 (2022): 23716-23736.</p><p>[11] Team, Gemini, et al. "Gemini: A family of highly capable multimodal models, 2024." <em>arXiv preprint arXiv:2312.11805</em> (2024).</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Obviously, the decoder-only transformer has no encoder component, so the cross attention modules are simply removed from this architecture. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>When we compute attention scores, we divide all attention scores by the square root of <code>d</code>, the size of the vectors being used for self-attention. This is called scaled dot product attention, and performing this division helps to improve training stability. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Prior to the proposal of ViTs, the most commonly-used architecture for computer vision tasks was convolutional neural networks (CNNs), or <a href="https://arxiv.org/abs/1512.03385">ResNets</a> in particular. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>Specifically, Flamingo uses a <a href="https://arxiv.org/abs/1512.03385">ResNet</a> architecture to produce image embeddings, but we could also use CLIP (the more commonly-used vision encoder for LLMs). </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>See <a href="https://www.interconnects.ai/p/an-open-source-llm">this writeup</a> for a deeper overview of the actual definition of open source and different kinds of &#8220;open&#8221; LLMs that exist. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>In [1], authors mention that they avoid using an MoE architecture due to their design principle of maximizing simplicity. MoEs are more complex and difficult to train. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>The &#8220;H&#8221; here just stands for &#8220;Huge&#8221;. This is the biggest ViT architecture in terms of total parameters explored in [3]. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>The motivation for this strategy is that different layers of the ViT will capture different kinds of information. For example, the early layers of the model are likely to capture low-level spatial details, while later layers capture semantic information; see <a href="https://arxiv.org/abs/1311.2901">this paper</a>. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-9" href="#footnote-anchor-9" class="footnote-number" contenteditable="false" target="_self">9</a><div class="footnote-content"><p>Lightweight models with 1B and 3B parameters were also released as part of LLaMA-3.2, but these models only support textual input. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-10" href="#footnote-anchor-10" class="footnote-number" contenteditable="false" target="_self">10</a><div class="footnote-content"><p>Remember, the transformer decoder has the option to provide a sequence of token vectors from an encoder as input by default. In the standard transformer, the encoder is the text encoder from the full encoder-decoder architecture. For LLaMA-3.2 Vision, the encoder is a vision encoder!</p></div></div>]]></content:encoded></item><item><title><![CDATA[nanoMoE: Mixture-of-Experts (MoE) LLMs from Scratch in PyTorch]]></title><description><![CDATA[An introductory, simple, and functional implementation of MoE LLM pretraining...]]></description><link>https://cameronrwolfe.substack.com/p/nano-moe</link><guid isPermaLink="false">https://cameronrwolfe.substack.com/p/nano-moe</guid><dc:creator><![CDATA[Cameron R. Wolfe, Ph.D.]]></dc:creator><pubDate>Mon, 10 Mar 2025 09:33:27 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/868fce62-b8a5-4ae9-8c71-71494ff27787_2394x1342.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_RW0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667e0409-7c24-4510-bd91-355333224863_2394x1342.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_RW0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667e0409-7c24-4510-bd91-355333224863_2394x1342.png 424w, https://substackcdn.com/image/fetch/$s_!_RW0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667e0409-7c24-4510-bd91-355333224863_2394x1342.png 848w, https://substackcdn.com/image/fetch/$s_!_RW0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667e0409-7c24-4510-bd91-355333224863_2394x1342.png 1272w, https://substackcdn.com/image/fetch/$s_!_RW0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667e0409-7c24-4510-bd91-355333224863_2394x1342.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_RW0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667e0409-7c24-4510-bd91-355333224863_2394x1342.png" width="1456" height="816" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/667e0409-7c24-4510-bd91-355333224863_2394x1342.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:816,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1011266,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/155023686?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667e0409-7c24-4510-bd91-355333224863_2394x1342.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_RW0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667e0409-7c24-4510-bd91-355333224863_2394x1342.png 424w, https://substackcdn.com/image/fetch/$s_!_RW0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667e0409-7c24-4510-bd91-355333224863_2394x1342.png 848w, https://substackcdn.com/image/fetch/$s_!_RW0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667e0409-7c24-4510-bd91-355333224863_2394x1342.png 1272w, https://substackcdn.com/image/fetch/$s_!_RW0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F667e0409-7c24-4510-bd91-355333224863_2394x1342.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Research on large language models (LLMs) has progressed at a shocking pace over the last several years. However, the architecture upon which most LLMs are based&#8212;<em>the <a href="https://cameronrwolfe.substack.com/p/decoder-only-transformers-the-workhorse">decoder-only transformer</a></em>&#8212;has remained fixed despite the chaotic and rapid advancements in this field. More recently, we are starting to see a new<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> architecture, called a Mixture-of-Experts (MoE), being adopted in top research labs. For example, GPT-4 is rumored to be MoE-based, as well as the recently-proposed&#8212;<em>and very popular</em>&#8212;<a href="https://arxiv.org/abs/2412.19437">DeepSeek-v3</a> and <a href="https://arxiv.org/abs/2501.12948">R1</a> models; see below.</p><blockquote><p><em>&#8220;To further push the boundaries of open-source model capabilities, we scale up our models and introduce DeepSeek-V3, a large Mixture-of-Experts (MoE) model with 671B parameters, of which 37B are activated for each token.&#8221;</em> - from [8]</p></blockquote><p>MoE-based LLMs use a modified version of the decoder-only transformer that has become popular due to an ability to make the training and usage of large models more efficient. MoE-based LLMs are very large in terms of their number of total parameters. However, only a subset of these parameters&#8212;<em>selected dynamically during inference</em>&#8212;are used when computing the model&#8217;s output. The sparsity of MoEs <a href="https://cameronrwolfe.substack.com/i/154340424/the-pros-and-cons-of-using-moes">drastically reduces</a> the cost of very large and powerful LLMs.</p><p>Given that many frontier LLMs are starting to use MoE-based architectures, developing an in-depth understanding of MoEs is important. In this post, we will take a step in this direction by building (and pretraining) a mid-sized MoE model&#8212;<em>called nanoMoE</em>&#8212;from scratch in PyTorch. All of the code for nanoMoE is available in the repository below, which is a fork of <a href="https://karpathy.ai/">Andrej Karpathy</a>&#8217;s <a href="https://github.com/karpathy/nanoGPT">nanoGPT</a> library that has been expanded to support MoE pretraining. To understand how nanoMoE works, we will start by outlining necessary background information. Then, we will build each component of nanoMoE from the ground up, eventually culminating in a (successful) pretraining run for the model. </p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://github.com/wolfecameron/nanoMoE&quot;,&quot;text&quot;:&quot;nanoMoE Repository&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://github.com/wolfecameron/nanoMoE"><span>nanoMoE Repository</span></a></p><h2>Basics of Decoder-Only Transformers</h2><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;fe312098-bc2d-4418-9057-9e61a46db77e&quot;,&quot;caption&quot;:&quot;The current pace of AI research is staggering. Keeping up with the most recent publications is a difficult feat, leaving even experts in the field feeling as if they are failing to grasp the finer details of this evolving frontier. In the domain of large language models (LLMs) especially, impactful research is being released constantly, inc&#8230;&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Decoder-Only Transformers: The Workhorse of Generative LLMs&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:29736521,&quot;name&quot;:&quot;Cameron R. Wolfe, Ph.D.&quot;,&quot;bio&quot;:&quot;ML @ Netflix &#8226; Rice University PhD &#8226; I make AI understandable&quot;,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/69aba7df-b571-4609-aa47-fc2d031c11b8_1242x1595.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-03-04T09:33:07.426Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6e3c9db5-400a-49de-a235-e09bc3aa3689_2392x1342.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://cameronrwolfe.substack.com/p/decoder-only-transformers-the-workhorse&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:142044446,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:96,&quot;comment_count&quot;:14,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Deep (Learning) Focus&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab9b43fb-52d5-40da-995d-5b7cd3f91064_896x896.png&quot;,&quot;belowTheFold&quot;:false,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p>In order to understand MoE-based LLMs, we first need to understand the standard architecture upon which most LLMs are based&#8212;<em>the decoder-only transformer architecture</em>. This architecture is a modified version of the encoder-decoder transformer architecture [1] that was popularized by <a href="https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf">GPT</a>.  Although we have studied this architecture deeply in prior posts (see above), we will go over it again here, as this knowledge is essential to the rest of the post. While explaining the architecture, we will rely on Andrej Karpathy&#8217;s <a href="https://github.com/karpathy/nanoGPT">nanoGPT</a>&#8212;<em>a minimal and functional implementation of decoder-only transformers</em>&#8212;as a reference.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qc6a!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c20b8a-b589-4282-8be7-e2509f4e0803_912x1112.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qc6a!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c20b8a-b589-4282-8be7-e2509f4e0803_912x1112.png 424w, https://substackcdn.com/image/fetch/$s_!qc6a!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c20b8a-b589-4282-8be7-e2509f4e0803_912x1112.png 848w, https://substackcdn.com/image/fetch/$s_!qc6a!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c20b8a-b589-4282-8be7-e2509f4e0803_912x1112.png 1272w, https://substackcdn.com/image/fetch/$s_!qc6a!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c20b8a-b589-4282-8be7-e2509f4e0803_912x1112.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qc6a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c20b8a-b589-4282-8be7-e2509f4e0803_912x1112.png" width="368" height="448.70175438596493" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e0c20b8a-b589-4282-8be7-e2509f4e0803_912x1112.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1112,&quot;width&quot;:912,&quot;resizeWidth&quot;:368,&quot;bytes&quot;:228575,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/155023686?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c20b8a-b589-4282-8be7-e2509f4e0803_912x1112.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qc6a!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c20b8a-b589-4282-8be7-e2509f4e0803_912x1112.png 424w, https://substackcdn.com/image/fetch/$s_!qc6a!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c20b8a-b589-4282-8be7-e2509f4e0803_912x1112.png 848w, https://substackcdn.com/image/fetch/$s_!qc6a!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c20b8a-b589-4282-8be7-e2509f4e0803_912x1112.png 1272w, https://substackcdn.com/image/fetch/$s_!qc6a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0c20b8a-b589-4282-8be7-e2509f4e0803_912x1112.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p><strong>Original architecture. </strong>The transformer, originally proposed for solving machine translation tasks in [1], has both an encoder and a decoder module; see above. We will not focus on the full (encoder-decoder) transformer here. However, a detailed (and widely cited) overview of this architecture can be found <a href="https://jalammar.github.io/illustrated-transformer/">here</a>. </p><p>The decoder-only transformer, which is more commonly-used for modern LLMs, simply removes the encoder from this architecture and uses only the decoder<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>, as indicated by the name. Practically, this means that every layer of the decoder-only transformer architecture contains the following:</p><ol><li><p>A masked self-attention layer.</p></li><li><p>A feed-forward layer.</p></li></ol><p>To form the full decoder-only transformer architecture, we just stack <code>L</code> of these layers, which are identical in structure but have independent weights, on top of each other. A depiction of this structure is provided in the figure below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aQxq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414bf0b5-2043-4fb5-bdab-e0153f893861_1634x808.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aQxq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414bf0b5-2043-4fb5-bdab-e0153f893861_1634x808.png 424w, https://substackcdn.com/image/fetch/$s_!aQxq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414bf0b5-2043-4fb5-bdab-e0153f893861_1634x808.png 848w, https://substackcdn.com/image/fetch/$s_!aQxq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414bf0b5-2043-4fb5-bdab-e0153f893861_1634x808.png 1272w, https://substackcdn.com/image/fetch/$s_!aQxq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414bf0b5-2043-4fb5-bdab-e0153f893861_1634x808.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aQxq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414bf0b5-2043-4fb5-bdab-e0153f893861_1634x808.png" width="1456" height="720" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/414bf0b5-2043-4fb5-bdab-e0153f893861_1634x808.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:720,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:166892,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/155023686?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414bf0b5-2043-4fb5-bdab-e0153f893861_1634x808.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aQxq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414bf0b5-2043-4fb5-bdab-e0153f893861_1634x808.png 424w, https://substackcdn.com/image/fetch/$s_!aQxq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414bf0b5-2043-4fb5-bdab-e0153f893861_1634x808.png 848w, https://substackcdn.com/image/fetch/$s_!aQxq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414bf0b5-2043-4fb5-bdab-e0153f893861_1634x808.png 1272w, https://substackcdn.com/image/fetch/$s_!aQxq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414bf0b5-2043-4fb5-bdab-e0153f893861_1634x808.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The decoder-only transformer architecture</figcaption></figure></div><p>Let&#8217;s now discuss each component of the architecture in isolation to gain a better understanding. We will start with the input structure for the model, followed by the components of each layer (i.e., self-attention and feed-forward layers) and how they are combined to form the full model architecture.</p><h4>From Text to Tokens</h4><p>As most of us probably know, the input to an LLM is just a sequence of text (i.e., the prompt). However, the input that we see in the figure above is not a sequence of text! Rather, the model&#8217;s input is a list of token vectors. If we are passing text to the model as input, <em>how do we produce these vectors from our textual input?</em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!m6ce!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56dd3364-44d1-4587-a0b8-3909f1f02f31_1132x282.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!m6ce!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56dd3364-44d1-4587-a0b8-3909f1f02f31_1132x282.png 424w, https://substackcdn.com/image/fetch/$s_!m6ce!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56dd3364-44d1-4587-a0b8-3909f1f02f31_1132x282.png 848w, https://substackcdn.com/image/fetch/$s_!m6ce!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56dd3364-44d1-4587-a0b8-3909f1f02f31_1132x282.png 1272w, https://substackcdn.com/image/fetch/$s_!m6ce!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56dd3364-44d1-4587-a0b8-3909f1f02f31_1132x282.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!m6ce!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56dd3364-44d1-4587-a0b8-3909f1f02f31_1132x282.png" width="442" height="110.1095406360424" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/56dd3364-44d1-4587-a0b8-3909f1f02f31_1132x282.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:282,&quot;width&quot;:1132,&quot;resizeWidth&quot;:442,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!m6ce!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56dd3364-44d1-4587-a0b8-3909f1f02f31_1132x282.png 424w, https://substackcdn.com/image/fetch/$s_!m6ce!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56dd3364-44d1-4587-a0b8-3909f1f02f31_1132x282.png 848w, https://substackcdn.com/image/fetch/$s_!m6ce!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56dd3364-44d1-4587-a0b8-3909f1f02f31_1132x282.png 1272w, https://substackcdn.com/image/fetch/$s_!m6ce!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56dd3364-44d1-4587-a0b8-3909f1f02f31_1132x282.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Converting raw text into a sequence of tokens</figcaption></figure></div><p><strong>Tokenization.</strong> The first step of constructing the input for an LLM is breaking the raw textual input&#8212;<em>a sequence of characters</em>&#8212;into discrete tokens. This process, called tokenization, is handled by the model&#8217;s <a href="https://huggingface.co/learn/nlp-course/en/chapter2/4">tokenizer</a>. There are many kinds of tokenizers, but Byte-Pair Encoding (BPE) tokenizers [2] are the most common; see <a href="https://www.youtube.com/watch?v=zduSFxRajkE">here</a> for more details. These tokenizers take a sequence of raw text as input and break this text into a sequence of discrete tokens as shown in the figure above. </p><div class="github-gist" data-attrs="{&quot;innerHTML&quot;:&quot;<div id=\&quot;gist136529091\&quot; class=\&quot;gist\&quot;>\n    <div class=\&quot;gist-file\&quot; translate=\&quot;no\&quot; data-color-mode=\&quot;light\&quot; data-light-theme=\&quot;light\&quot;>\n      <div class=\&quot;gist-data\&quot;>\n        <div class=\&quot;js-gist-file-update-container js-task-list-container\&quot;>\n  <div id=\&quot;file-tokenizer_example-py\&quot; class=\&quot;file my-2\&quot;>\n    \n    <div itemprop=\&quot;text\&quot;\n      class=\&quot;Box-body p-0 blob-wrapper data type-python  \&quot;\n      style=\&quot;overflow: auto\&quot; tabindex=\&quot;0\&quot; role=\&quot;region\&quot;\n      aria-label=\&quot;file-tokenizer_example-py\&quot;\n    >\n\n        \n<div class=\&quot;js-check-bidi js-blob-code-container blob-code-content\&quot;>\n\n  <template class=\&quot;js-file-alert-template\&quot;>\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash flash-warn flash-full d-flex flex-items-center\&quot;>\n  <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n    <span>\n      This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.\n      <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.co/hiddenchars\&quot; target=\&quot;_blank\&quot;>Learn more about bidirectional Unicode characters</a>\n    </span>\n\n\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash-action\&quot;>        <a href=\&quot;{{ revealButtonHref }}\&quot; data-view-component=\&quot;true\&quot; class=\&quot;btn-sm btn\&quot;>    Show hidden characters\n</a>\n</div>\n</div></template>\n<template class=\&quot;js-line-alert-template\&quot;>\n  <span aria-label=\&quot;This line has hidden Unicode characters\&quot; data-view-component=\&quot;true\&quot; class=\&quot;line-alert tooltipped tooltipped-e\&quot;>\n    <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n</span></template>\n\n  <table data-hpc class=\&quot;highlight tab-size js-file-line-container\&quot; data-tab-size=\&quot;8\&quot; data-paste-markdown-skip data-tagsearch-path=\&quot;tokenizer_example.py\&quot;>\n        <tr>\n          <td id=\&quot;file-tokenizer_example-py-L1\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;1\&quot;></td>\n          <td id=\&quot;file-tokenizer_example-py-LC1\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>import</span> <span class=pl-s1>torch</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-tokenizer_example-py-L2\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;2\&quot;></td>\n          <td id=\&quot;file-tokenizer_example-py-LC2\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>from</span> <span class=pl-s1>transformers</span> <span class=pl-k>import</span> <span class=pl-v>AutoTokenizer</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-tokenizer_example-py-L3\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;3\&quot;></td>\n          <td id=\&quot;file-tokenizer_example-py-LC3\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-tokenizer_example-py-L4\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;4\&quot;></td>\n          <td id=\&quot;file-tokenizer_example-py-LC4\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># load the llama-3.2 tokenizer</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-tokenizer_example-py-L5\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;5\&quot;></td>\n          <td id=\&quot;file-tokenizer_example-py-LC5\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>tokenizer</span> <span class=pl-c1>=</span> <span class=pl-v>AutoTokenizer</span>.<span class=pl-c1>from_pretrained</span>(<span class=pl-s>&amp;#39;meta-llama/Llama-3.1-8B&amp;#39;</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-tokenizer_example-py-L6\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;6\&quot;></td>\n          <td id=\&quot;file-tokenizer_example-py-LC6\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-tokenizer_example-py-L7\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;7\&quot;></td>\n          <td id=\&quot;file-tokenizer_example-py-LC7\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># raw text</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-tokenizer_example-py-L8\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;8\&quot;></td>\n          <td id=\&quot;file-tokenizer_example-py-LC8\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>text</span> <span class=pl-c1>=</span> <span class=pl-s>&amp;quot;This raw text will be tokenized&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-tokenizer_example-py-L9\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;9\&quot;></td>\n          <td id=\&quot;file-tokenizer_example-py-LC9\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-tokenizer_example-py-L10\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;10\&quot;></td>\n          <td id=\&quot;file-tokenizer_example-py-LC10\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># create tokens using tokenizer</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-tokenizer_example-py-L11\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;11\&quot;></td>\n          <td id=\&quot;file-tokenizer_example-py-LC11\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>tokens</span> <span class=pl-c1>=</span> <span class=pl-s1>tokenizer</span>.<span class=pl-c1>tokenize</span>(<span class=pl-s1>text</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-tokenizer_example-py-L12\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;12\&quot;></td>\n          <td id=\&quot;file-tokenizer_example-py-LC12\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>token_ids</span> <span class=pl-c1>=</span> <span class=pl-s1>tokenizer</span>.<span class=pl-c1>convert_tokens_to_ids</span>(<span class=pl-s1>tokens</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-tokenizer_example-py-L13\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;13\&quot;></td>\n          <td id=\&quot;file-tokenizer_example-py-LC13\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># token_ids = tokenizer.encode(text)  # directly create token ids</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-tokenizer_example-py-L14\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;14\&quot;></td>\n          <td id=\&quot;file-tokenizer_example-py-LC14\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-tokenizer_example-py-L15\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;15\&quot;></td>\n          <td id=\&quot;file-tokenizer_example-py-LC15\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># view the results</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-tokenizer_example-py-L16\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;16\&quot;></td>\n          <td id=\&quot;file-tokenizer_example-py-LC16\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-en>print</span>(<span class=pl-s>&amp;quot;Original Text:&amp;quot;</span>, <span class=pl-s1>text</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-tokenizer_example-py-L17\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;17\&quot;></td>\n          <td id=\&quot;file-tokenizer_example-py-LC17\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-en>print</span>(<span class=pl-s>&amp;quot;Tokens:&amp;quot;</span>, <span class=pl-s1>tokens</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-tokenizer_example-py-L18\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;18\&quot;></td>\n          <td id=\&quot;file-tokenizer_example-py-LC18\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-en>print</span>(<span class=pl-s>&amp;quot;Token IDs:&amp;quot;</span>, <span class=pl-s1>token_ids</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-tokenizer_example-py-L19\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;19\&quot;></td>\n          <td id=\&quot;file-tokenizer_example-py-LC19\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-tokenizer_example-py-L20\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;20\&quot;></td>\n          <td id=\&quot;file-tokenizer_example-py-LC20\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># create token embedding layer</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-tokenizer_example-py-L21\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;21\&quot;></td>\n          <td id=\&quot;file-tokenizer_example-py-LC21\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c1>VOCABULARY_SIZE</span>: <span class=pl-smi>int</span> <span class=pl-c1>=</span> <span class=pl-c1>128000</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-tokenizer_example-py-L22\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;22\&quot;></td>\n          <td id=\&quot;file-tokenizer_example-py-LC22\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c1>EMBEDDING_DIM</span>: <span class=pl-smi>int</span> <span class=pl-c1>=</span> <span class=pl-c1>768</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-tokenizer_example-py-L23\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;23\&quot;></td>\n          <td id=\&quot;file-tokenizer_example-py-LC23\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>token_embedding_layer</span> <span class=pl-c1>=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>nn</span>.<span class=pl-c1>Embedding</span>(</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-tokenizer_example-py-L24\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;24\&quot;></td>\n          <td id=\&quot;file-tokenizer_example-py-LC24\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>num_embeddings</span><span class=pl-c1>=</span><span class=pl-c1>VOCABULARY_SIZE</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-tokenizer_example-py-L25\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;25\&quot;></td>\n          <td id=\&quot;file-tokenizer_example-py-LC25\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>embedding_dim</span><span class=pl-c1>=</span><span class=pl-c1>EMBEDDING_DIM</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-tokenizer_example-py-L26\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;26\&quot;></td>\n          <td id=\&quot;file-tokenizer_example-py-LC26\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-tokenizer_example-py-L27\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;27\&quot;></td>\n          <td id=\&quot;file-tokenizer_example-py-LC27\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-tokenizer_example-py-L28\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;28\&quot;></td>\n          <td id=\&quot;file-tokenizer_example-py-LC28\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># get token embeddings (IDs must be passed as a tensor, not a list)</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-tokenizer_example-py-L29\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;29\&quot;></td>\n          <td id=\&quot;file-tokenizer_example-py-LC29\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>token_emb</span> <span class=pl-c1>=</span> <span class=pl-en>token_embedding_layer</span>(<span class=pl-s1>torch</span>.<span class=pl-c1>tensor</span>(<span class=pl-s1>token_ids</span>))</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-tokenizer_example-py-L30\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;30\&quot;></td>\n          <td id=\&quot;file-tokenizer_example-py-LC30\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-en>print</span>(<span class=pl-s>f&amp;#39;Token Embeddings Shape: <span class=pl-s1><span class=pl-kos>{</span><span class=pl-s1>token_emb</span>.<span class=pl-c1>shape</span><span class=pl-kos>}</span></span>&amp;#39;</span>)</td>\n        </tr>\n  </table>\n</div>\n\n\n    </div>\n\n  </div>\n</div>\n\n      </div>\n      <div class=\&quot;gist-meta\&quot;>\n        <a href=\&quot;https://gist.github.com/wolfecameron/82db74244e4c46206f5d7c1336d7f4cd/raw/d40c26b715758b2c99b000bf7360f1bd3cd59b48/tokenizer_example.py\&quot; style=\&quot;float:right\&quot; class=\&quot;Link--inTextBlock\&quot;>view raw</a>\n        <a href=\&quot;https://gist.github.com/wolfecameron/82db74244e4c46206f5d7c1336d7f4cd#file-tokenizer_example-py\&quot; class=\&quot;Link--inTextBlock\&quot;>\n          tokenizer_example.py\n        </a>\n        hosted with &amp;#10084; by <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.com\&quot;>GitHub</a>\n      </div>\n    </div>\n</div>\n&quot;,&quot;stylesheet&quot;:&quot;https://github.githubassets.com/assets/gist-embed-9060cf3ad5bb.css&quot;}" data-component-name="GitgistToDOM"><link rel="stylesheet" href="https://github.githubassets.com/assets/gist-embed-9060cf3ad5bb.css"><div id="gist136529091" class="gist">
    <div class="gist-file" data-color-mode="light" data-light-theme="light">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container">
  <div id="file-tokenizer_example-py" class="file my-2">
    
    <div itemprop="text" class="Box-body p-0 blob-wrapper data type-python  " style="overflow:auto">

        
<div class="js-check-bidi js-blob-code-container blob-code-content">

  
  <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
  
    

    <span>
      This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
    </span>


  <div data-view-component="true" class="flash-action">        <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn">    Show hidden characters
</a>
</div>
</div>

  <span data-view-component="true" class="line-alert tooltipped tooltipped-e">
    
    

</span>

  <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="tokenizer_example.py">
        <tbody><tr>
          <td id="file-tokenizer_example-py-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
          <td id="file-tokenizer_example-py-LC1" class="blob-code blob-code-inner js-file-line"><span class="pl-k">import</span> <span class="pl-s1">torch</span></td>
        </tr>
        <tr>
          <td id="file-tokenizer_example-py-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
          <td id="file-tokenizer_example-py-LC2" class="blob-code blob-code-inner js-file-line"><span class="pl-k">from</span> <span class="pl-s1">transformers</span> <span class="pl-k">import</span> <span class="pl-v">AutoTokenizer</span></td>
        </tr>
        <tr>
          <td id="file-tokenizer_example-py-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
          <td id="file-tokenizer_example-py-LC3" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-tokenizer_example-py-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
          <td id="file-tokenizer_example-py-LC4" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># load the llama-3.2 tokenizer</span></td>
        </tr>
        <tr>
          <td id="file-tokenizer_example-py-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
          <td id="file-tokenizer_example-py-LC5" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">tokenizer</span> <span class="pl-c1">=</span> <span class="pl-v">AutoTokenizer</span>.<span class="pl-c1">from_pretrained</span>(<span class="pl-s">'meta-llama/Llama-3.1-8B'</span>)</td>
        </tr>
        <tr>
          <td id="file-tokenizer_example-py-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
          <td id="file-tokenizer_example-py-LC6" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-tokenizer_example-py-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
          <td id="file-tokenizer_example-py-LC7" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># raw text</span></td>
        </tr>
        <tr>
          <td id="file-tokenizer_example-py-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
          <td id="file-tokenizer_example-py-LC8" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">text</span> <span class="pl-c1">=</span> <span class="pl-s">"This raw text will be tokenized"</span></td>
        </tr>
        <tr>
          <td id="file-tokenizer_example-py-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
          <td id="file-tokenizer_example-py-LC9" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-tokenizer_example-py-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
          <td id="file-tokenizer_example-py-LC10" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># create tokens using tokenizer</span></td>
        </tr>
        <tr>
          <td id="file-tokenizer_example-py-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
          <td id="file-tokenizer_example-py-LC11" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">tokens</span> <span class="pl-c1">=</span> <span class="pl-s1">tokenizer</span>.<span class="pl-c1">tokenize</span>(<span class="pl-s1">text</span>)</td>
        </tr>
        <tr>
          <td id="file-tokenizer_example-py-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
          <td id="file-tokenizer_example-py-LC12" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">token_ids</span> <span class="pl-c1">=</span> <span class="pl-s1">tokenizer</span>.<span class="pl-c1">convert_tokens_to_ids</span>(<span class="pl-s1">tokens</span>)</td>
        </tr>
        <tr>
          <td id="file-tokenizer_example-py-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
          <td id="file-tokenizer_example-py-LC13" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># token_ids = tokenizer.encode(text)  # directly create token ids</span></td>
        </tr>
        <tr>
          <td id="file-tokenizer_example-py-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
          <td id="file-tokenizer_example-py-LC14" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-tokenizer_example-py-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
          <td id="file-tokenizer_example-py-LC15" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># view the results</span></td>
        </tr>
        <tr>
          <td id="file-tokenizer_example-py-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
          <td id="file-tokenizer_example-py-LC16" class="blob-code blob-code-inner js-file-line"><span class="pl-en">print</span>(<span class="pl-s">"Original Text:"</span>, <span class="pl-s1">text</span>)</td>
        </tr>
        <tr>
          <td id="file-tokenizer_example-py-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
          <td id="file-tokenizer_example-py-LC17" class="blob-code blob-code-inner js-file-line"><span class="pl-en">print</span>(<span class="pl-s">"Tokens:"</span>, <span class="pl-s1">tokens</span>)</td>
        </tr>
        <tr>
          <td id="file-tokenizer_example-py-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
          <td id="file-tokenizer_example-py-LC18" class="blob-code blob-code-inner js-file-line"><span class="pl-en">print</span>(<span class="pl-s">"Token IDs:"</span>, <span class="pl-s1">token_ids</span>)</td>
        </tr>
        <tr>
          <td id="file-tokenizer_example-py-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
          <td id="file-tokenizer_example-py-LC19" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-tokenizer_example-py-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td>
          <td id="file-tokenizer_example-py-LC20" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># create token embedding layer</span></td>
        </tr>
        <tr>
          <td id="file-tokenizer_example-py-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td>
          <td id="file-tokenizer_example-py-LC21" class="blob-code blob-code-inner js-file-line"><span class="pl-c1">VOCABULARY_SIZE</span>: <span class="pl-smi">int</span> <span class="pl-c1">=</span> <span class="pl-c1">128000</span></td>
        </tr>
        <tr>
          <td id="file-tokenizer_example-py-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td>
          <td id="file-tokenizer_example-py-LC22" class="blob-code blob-code-inner js-file-line"><span class="pl-c1">EMBEDDING_DIM</span>: <span class="pl-smi">int</span> <span class="pl-c1">=</span> <span class="pl-c1">768</span></td>
        </tr>
        <tr>
          <td id="file-tokenizer_example-py-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td>
          <td id="file-tokenizer_example-py-LC23" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">token_embedding_layer</span> <span class="pl-c1">=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">nn</span>.<span class="pl-c1">Embedding</span>(</td>
        </tr>
        <tr>
          <td id="file-tokenizer_example-py-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td>
          <td id="file-tokenizer_example-py-LC24" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">num_embeddings</span><span class="pl-c1">=</span><span class="pl-c1">VOCABULARY_SIZE</span>,</td>
        </tr>
        <tr>
          <td id="file-tokenizer_example-py-L25" class="blob-num js-line-number js-blob-rnum" data-line-number="25"></td>
          <td id="file-tokenizer_example-py-LC25" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">embedding_dim</span><span class="pl-c1">=</span><span class="pl-c1">EMBEDDING_DIM</span>,</td>
        </tr>
        <tr>
          <td id="file-tokenizer_example-py-L26" class="blob-num js-line-number js-blob-rnum" data-line-number="26"></td>
          <td id="file-tokenizer_example-py-LC26" class="blob-code blob-code-inner js-file-line">)</td>
        </tr>
        <tr>
          <td id="file-tokenizer_example-py-L27" class="blob-num js-line-number js-blob-rnum" data-line-number="27"></td>
          <td id="file-tokenizer_example-py-LC27" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-tokenizer_example-py-L28" class="blob-num js-line-number js-blob-rnum" data-line-number="28"></td>
          <td id="file-tokenizer_example-py-LC28" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># get token embeddings (IDs must be passed as a tensor, not a list)</span></td>
        </tr>
        <tr>
          <td id="file-tokenizer_example-py-L29" class="blob-num js-line-number js-blob-rnum" data-line-number="29"></td>
          <td id="file-tokenizer_example-py-LC29" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">token_emb</span> <span class="pl-c1">=</span> <span class="pl-en">token_embedding_layer</span>(<span class="pl-s1">torch</span>.<span class="pl-c1">tensor</span>(<span class="pl-s1">token_ids</span>))</td>
        </tr>
        <tr>
          <td id="file-tokenizer_example-py-L30" class="blob-num js-line-number js-blob-rnum" data-line-number="30"></td>
          <td id="file-tokenizer_example-py-LC30" class="blob-code blob-code-inner js-file-line"><span class="pl-en">print</span>(<span class="pl-s">f'Token Embeddings Shape: <span class="pl-s1"><span class="pl-kos">{</span><span class="pl-s1">token_emb</span>.<span class="pl-c1">shape</span><span class="pl-kos">}</span></span>'</span>)</td>
        </tr>
  </tbody></table>
</div>


    </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/wolfecameron/82db74244e4c46206f5d7c1336d7f4cd/raw/d40c26b715758b2c99b000bf7360f1bd3cd59b48/tokenizer_example.py" style="float:right" class="Link--inTextBlock">view raw</a>
        <a href="https://gist.github.com/wolfecameron/82db74244e4c46206f5d7c1336d7f4cd#file-tokenizer_example-py" class="Link--inTextBlock">
          tokenizer_example.py
        </a>
        hosted with &#10084; by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
      </div>
    </div>
</div>
</div><p>Packages for training and interacting with LLMs (e.g., <a href="https://huggingface.co/docs/transformers/en/index">HuggingFace</a> or <a href="https://pytorch.org/torchtune/main/index.html">torchtune</a>) provide interfaces for interacting with tokenizers. Additionally, OpenAI has released the <a href="https://github.com/openai/tiktoken">tiktoken</a> package for interacting with GPT tokenizers. The code snippet above tokenizes a textual sequence as follows:</p><ul><li><p><em>Raw Text</em>: <code>This raw text will be tokenized</code></p></li><li><p><em>Tokenized Text</em>: <code>['This', '&#288;raw', '&#288;text', '&#288;will', '&#288;be', '&#288;token', 'ized']</code></p></li></ul><p>Here, the <code>&#288;</code> character indicates that a token immediately follows a whitespace. Such special characters are tokenizer-dependent. For example, many tokenizers instead use a <code>#</code> character to indicate the continuation of a word, which would yield<code>['token', '#ized']</code> for the final two tokens in the above sequence.</p><p><strong>Vocabulary.</strong> Each LLM is trained with a specific tokenizer, though a single tokenizer may be used for several different LLMs. The set of tokens that can be produced by a given tokenizer is also fixed. As such, an LLM has a fixed set of tokens that it understands (i.e., those produced by the tokenizer) and is trained on. This fixed set of tokens is colloquially referred to as the LLM&#8217;s &#8220;vocabulary&#8221;; see below. Vocabulary sizes change between models and depend on several factors (e.g., multilingual models tend to have larger vocabularies), but vocabulary sizes of 64K to 256K total tokens are relatively common for recent LLMs.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_81W!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8aadf17-3bf6-4b79-9688-b6bfbc5840b1_1830x888.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_81W!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8aadf17-3bf6-4b79-9688-b6bfbc5840b1_1830x888.png 424w, https://substackcdn.com/image/fetch/$s_!_81W!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8aadf17-3bf6-4b79-9688-b6bfbc5840b1_1830x888.png 848w, https://substackcdn.com/image/fetch/$s_!_81W!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8aadf17-3bf6-4b79-9688-b6bfbc5840b1_1830x888.png 1272w, https://substackcdn.com/image/fetch/$s_!_81W!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8aadf17-3bf6-4b79-9688-b6bfbc5840b1_1830x888.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_81W!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8aadf17-3bf6-4b79-9688-b6bfbc5840b1_1830x888.png" width="562" height="272.8942307692308" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b8aadf17-3bf6-4b79-9688-b6bfbc5840b1_1830x888.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:707,&quot;width&quot;:1456,&quot;resizeWidth&quot;:562,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_81W!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8aadf17-3bf6-4b79-9688-b6bfbc5840b1_1830x888.png 424w, https://substackcdn.com/image/fetch/$s_!_81W!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8aadf17-3bf6-4b79-9688-b6bfbc5840b1_1830x888.png 848w, https://substackcdn.com/image/fetch/$s_!_81W!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8aadf17-3bf6-4b79-9688-b6bfbc5840b1_1830x888.png 1272w, https://substackcdn.com/image/fetch/$s_!_81W!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8aadf17-3bf6-4b79-9688-b6bfbc5840b1_1830x888.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Token vocabulary (and vectors) for an LLM</figcaption></figure></div><p><strong>Token IDs and Embeddings.</strong> Each token in the LLM&#8217;s vocabulary is associated with a unique integer ID. For example, the prior code yields this sequence of IDs when tokenizing our text: <code>[2028, 7257, 1495, 690, 387, 4037, 1534]</code>. Each of these IDs is associated with a vector, known as a token embedding, in an <a href="https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html">embedding layer</a>. An embedding layer is just a large matrix that stores many rows of vector embeddings. To retrieve the embedding for a token, we just lookup the corresponding row&#8212;<em>given by the token ID</em>&#8212;in the embedding layer; see above.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2lb3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f723f2-056a-4fc0-a3f7-7aa151fe297e_1194x1026.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2lb3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f723f2-056a-4fc0-a3f7-7aa151fe297e_1194x1026.png 424w, https://substackcdn.com/image/fetch/$s_!2lb3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f723f2-056a-4fc0-a3f7-7aa151fe297e_1194x1026.png 848w, https://substackcdn.com/image/fetch/$s_!2lb3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f723f2-056a-4fc0-a3f7-7aa151fe297e_1194x1026.png 1272w, https://substackcdn.com/image/fetch/$s_!2lb3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f723f2-056a-4fc0-a3f7-7aa151fe297e_1194x1026.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2lb3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f723f2-056a-4fc0-a3f7-7aa151fe297e_1194x1026.png" width="440" height="378.09045226130655" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e2f723f2-056a-4fc0-a3f7-7aa151fe297e_1194x1026.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1026,&quot;width&quot;:1194,&quot;resizeWidth&quot;:440,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2lb3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f723f2-056a-4fc0-a3f7-7aa151fe297e_1194x1026.png 424w, https://substackcdn.com/image/fetch/$s_!2lb3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f723f2-056a-4fc0-a3f7-7aa151fe297e_1194x1026.png 848w, https://substackcdn.com/image/fetch/$s_!2lb3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f723f2-056a-4fc0-a3f7-7aa151fe297e_1194x1026.png 1272w, https://substackcdn.com/image/fetch/$s_!2lb3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f723f2-056a-4fc0-a3f7-7aa151fe297e_1194x1026.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Input matrix of token embeddings (or vectors)</figcaption></figure></div><p>We now have a list of token embeddings. We can stack these embeddings into a matrix to form the actual input that is ingested by the transformer architecture; see above. In PyTorch, the creation of this matrix is handled automatically by the tokenizer and embedding layer, as shown in the prior code.</p><p>The token embedding matrix is of size <code>[C, d]</code>, where <code>C</code> is the number of tokens in our input and <code>d</code> is the dimension of token embeddings that is adopted by the LLM. We usually have a batch of <code>B</code> input sequences instead of a single input sequence, forming an input matrix of size <code>[B, C, d]</code>. The dimension <code>d</code> impacts the sizes of all layers or activations within the transformer, which makes <code>d</code> an important hyperparameter choice. Prior to passing this matrix to the transformer as input, we also add a positional embedding to each token in the input<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>, which communicates the position of each token within its sequence to the transformer. </p><h4>(Masked and Multi-Headed) Self-Attention</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0TwV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe97978e4-cc11-41e0-8fb4-0010039c3769_1456x818.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0TwV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe97978e4-cc11-41e0-8fb4-0010039c3769_1456x818.webp 424w, https://substackcdn.com/image/fetch/$s_!0TwV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe97978e4-cc11-41e0-8fb4-0010039c3769_1456x818.webp 848w, https://substackcdn.com/image/fetch/$s_!0TwV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe97978e4-cc11-41e0-8fb4-0010039c3769_1456x818.webp 1272w, https://substackcdn.com/image/fetch/$s_!0TwV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe97978e4-cc11-41e0-8fb4-0010039c3769_1456x818.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0TwV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe97978e4-cc11-41e0-8fb4-0010039c3769_1456x818.webp" width="1456" height="818" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e97978e4-cc11-41e0-8fb4-0010039c3769_1456x818.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:818,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:145386,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/155023686?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe97978e4-cc11-41e0-8fb4-0010039c3769_1456x818.webp&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0TwV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe97978e4-cc11-41e0-8fb4-0010039c3769_1456x818.webp 424w, https://substackcdn.com/image/fetch/$s_!0TwV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe97978e4-cc11-41e0-8fb4-0010039c3769_1456x818.webp 848w, https://substackcdn.com/image/fetch/$s_!0TwV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe97978e4-cc11-41e0-8fb4-0010039c3769_1456x818.webp 1272w, https://substackcdn.com/image/fetch/$s_!0TwV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe97978e4-cc11-41e0-8fb4-0010039c3769_1456x818.webp 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Now, we are ready to pass our input&#8212;<em>a token embedding matrix</em>&#8212;to the decoder-only transformer to begin processing. As previously outlined, the transformer contains repeated blocks with self-attention and a feed-forward transformation, each followed by normalization operations. Let&#8217;s look at self-attention first. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xR2F!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd593d54a-21a2-4b60-9b71-73d69f8647a7_1556x820.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xR2F!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd593d54a-21a2-4b60-9b71-73d69f8647a7_1556x820.png 424w, https://substackcdn.com/image/fetch/$s_!xR2F!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd593d54a-21a2-4b60-9b71-73d69f8647a7_1556x820.png 848w, https://substackcdn.com/image/fetch/$s_!xR2F!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd593d54a-21a2-4b60-9b71-73d69f8647a7_1556x820.png 1272w, https://substackcdn.com/image/fetch/$s_!xR2F!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd593d54a-21a2-4b60-9b71-73d69f8647a7_1556x820.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xR2F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd593d54a-21a2-4b60-9b71-73d69f8647a7_1556x820.png" width="1556" height="820" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d593d54a-21a2-4b60-9b71-73d69f8647a7_1556x820.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1556,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:154522,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xR2F!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd593d54a-21a2-4b60-9b71-73d69f8647a7_1556x820.png 424w, https://substackcdn.com/image/fetch/$s_!xR2F!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd593d54a-21a2-4b60-9b71-73d69f8647a7_1556x820.png 848w, https://substackcdn.com/image/fetch/$s_!xR2F!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd593d54a-21a2-4b60-9b71-73d69f8647a7_1556x820.png 1272w, https://substackcdn.com/image/fetch/$s_!xR2F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd593d54a-21a2-4b60-9b71-73d69f8647a7_1556x820.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p><strong>What is self-attention?</strong> Put simply, self-attention transforms the representation of each token in a sequence based upon its relationship to other tokens in the sequence. Intuitively, self-attention bases the representation of each token on the other tokens in the sequence (including itself) that are most relevant to that token. In other words, <em>we learn which tokens to &#8220;pay attention&#8221; to when trying to understand the meaning of a token in our sequence</em>. For example, we see above that the representation for the word <code>making</code> is heavily influenced by the words <code>more</code> and <code>difficult</code>, which help to convey the overall meaning of the sentence. </p><blockquote><p><em>&#8220;An attention function [maps] a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.&#8221;</em> - from [1]</p></blockquote><p><strong>Scaled Dot Product Attention.</strong> Given our input token matrix of size <code>[C, d]</code> (i.e., we will assume that we are processing a single input sequence instead of a batch for simplicity), we begin by projecting our input using three separate linear projections, forming three separate sets of (transformed) token vectors. These projections are referred to as the key, query and value projections; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9XpX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd271f9ab-5159-4429-95e4-957db67fe2ec_1360x1286.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9XpX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd271f9ab-5159-4429-95e4-957db67fe2ec_1360x1286.png 424w, https://substackcdn.com/image/fetch/$s_!9XpX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd271f9ab-5159-4429-95e4-957db67fe2ec_1360x1286.png 848w, https://substackcdn.com/image/fetch/$s_!9XpX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd271f9ab-5159-4429-95e4-957db67fe2ec_1360x1286.png 1272w, https://substackcdn.com/image/fetch/$s_!9XpX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd271f9ab-5159-4429-95e4-957db67fe2ec_1360x1286.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9XpX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd271f9ab-5159-4429-95e4-957db67fe2ec_1360x1286.png" width="522" height="493.59705882352944" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d271f9ab-5159-4429-95e4-957db67fe2ec_1360x1286.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1286,&quot;width&quot;:1360,&quot;resizeWidth&quot;:522,&quot;bytes&quot;:164342,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/155023686?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd271f9ab-5159-4429-95e4-957db67fe2ec_1360x1286.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9XpX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd271f9ab-5159-4429-95e4-957db67fe2ec_1360x1286.png 424w, https://substackcdn.com/image/fetch/$s_!9XpX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd271f9ab-5159-4429-95e4-957db67fe2ec_1360x1286.png 848w, https://substackcdn.com/image/fetch/$s_!9XpX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd271f9ab-5159-4429-95e4-957db67fe2ec_1360x1286.png 1272w, https://substackcdn.com/image/fetch/$s_!9XpX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd271f9ab-5159-4429-95e4-957db67fe2ec_1360x1286.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Creating key, query and value vectors</figcaption></figure></div><p>This naming convention might seem random, but it comes from prior research in <a href="https://en.wikipedia.org/wiki/Information_retrieval">information retrieval</a>. The intuitive reasoning for the name of each projection is as follows:</p><ul><li><p>A <strong>query</strong> is what you use to search for information. It represents the current token for which we want to find other relevant tokens in the sequence.</p></li><li><p>The <strong>key</strong> represents each other token in the sequence and acts as an index to match the query with other relevant tokens in the sequence.</p></li><li><p>The <strong>value</strong> is the actual information that is retrieved once a query matches a key. The value is used to compute each token&#8217;s output in self-attention.</p></li></ul><p><strong>Computing attention scores. </strong>After projecting the input, we compute an attention score <code>a[i, j]</code> for each pair of tokens <code>[i, j]</code> in our input sequence. Intuitively, this attention score, which lies in the <code>[0, 1]</code> range, captures how much a given token should &#8220;pay attention&#8221; to another token in the sequence&#8212;<em>higher attention scores indicate that a pair of tokens are very relevant to each other.</em> As hinted at above, attention scores are generated using the key and query vectors. We compute <code>a[i, j]</code> by taking the <a href="https://en.wikipedia.org/wiki/Dot_product">dot product</a> of the query vector for token <code>i</code> with the key vector for token <code>j</code>; see below for a depiction of this process.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DgOf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5392813f-9332-4965-83ce-cf75b2ea3cb2_2102x1272.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DgOf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5392813f-9332-4965-83ce-cf75b2ea3cb2_2102x1272.png 424w, https://substackcdn.com/image/fetch/$s_!DgOf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5392813f-9332-4965-83ce-cf75b2ea3cb2_2102x1272.png 848w, https://substackcdn.com/image/fetch/$s_!DgOf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5392813f-9332-4965-83ce-cf75b2ea3cb2_2102x1272.png 1272w, https://substackcdn.com/image/fetch/$s_!DgOf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5392813f-9332-4965-83ce-cf75b2ea3cb2_2102x1272.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DgOf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5392813f-9332-4965-83ce-cf75b2ea3cb2_2102x1272.png" width="1456" height="881" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5392813f-9332-4965-83ce-cf75b2ea3cb2_2102x1272.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:881,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:190820,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/155023686?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5392813f-9332-4965-83ce-cf75b2ea3cb2_2102x1272.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DgOf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5392813f-9332-4965-83ce-cf75b2ea3cb2_2102x1272.png 424w, https://substackcdn.com/image/fetch/$s_!DgOf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5392813f-9332-4965-83ce-cf75b2ea3cb2_2102x1272.png 848w, https://substackcdn.com/image/fetch/$s_!DgOf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5392813f-9332-4965-83ce-cf75b2ea3cb2_2102x1272.png 1272w, https://substackcdn.com/image/fetch/$s_!DgOf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5392813f-9332-4965-83ce-cf75b2ea3cb2_2102x1272.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Computing an attention score for a token pair</figcaption></figure></div><p>We can efficiently compute all pairwise attention scores in a sequence by:</p><ul><li><p>Stacking the query and key vectors into two matrices.</p></li><li><p>Multiplying the query matrix with the transposed key matrix.</p></li></ul><p>This operation forms a matrix of size <code>[C, C]</code>&#8212;<em>called the attention matrix</em>&#8212;that contains all pairwise attention scores over the entire sequence. From here, we divide each value in the attention matrix by the square root of <code>d</code>&#8212;<em>an approach that has been found to improve training stability [1]</em>&#8212;and apply a <a href="https://en.wikipedia.org/wiki/Softmax_function">softmax operation</a> to each row of the attention matrix; see below. After softmax has been applied, each row of the attention matrix forms a valid probability distribution&#8212;<em>each row contains positive values that sum to one.</em> The <code>i</code>-th row of the attention matrix stores probabilities between the <code>i</code>-th token and each other token in our sequence. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CRTj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39953be3-b209-44aa-ac88-2a9cc8b6026d_1734x818.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CRTj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39953be3-b209-44aa-ac88-2a9cc8b6026d_1734x818.png 424w, https://substackcdn.com/image/fetch/$s_!CRTj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39953be3-b209-44aa-ac88-2a9cc8b6026d_1734x818.png 848w, https://substackcdn.com/image/fetch/$s_!CRTj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39953be3-b209-44aa-ac88-2a9cc8b6026d_1734x818.png 1272w, https://substackcdn.com/image/fetch/$s_!CRTj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39953be3-b209-44aa-ac88-2a9cc8b6026d_1734x818.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CRTj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39953be3-b209-44aa-ac88-2a9cc8b6026d_1734x818.png" width="614" height="289.71016483516485" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/39953be3-b209-44aa-ac88-2a9cc8b6026d_1734x818.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:687,&quot;width&quot;:1456,&quot;resizeWidth&quot;:614,&quot;bytes&quot;:114998,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/155023686?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39953be3-b209-44aa-ac88-2a9cc8b6026d_1734x818.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CRTj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39953be3-b209-44aa-ac88-2a9cc8b6026d_1734x818.png 424w, https://substackcdn.com/image/fetch/$s_!CRTj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39953be3-b209-44aa-ac88-2a9cc8b6026d_1734x818.png 848w, https://substackcdn.com/image/fetch/$s_!CRTj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39953be3-b209-44aa-ac88-2a9cc8b6026d_1734x818.png 1272w, https://substackcdn.com/image/fetch/$s_!CRTj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39953be3-b209-44aa-ac88-2a9cc8b6026d_1734x818.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Computing attention scores and output for self-attention</figcaption></figure></div><p><strong>Computing output. </strong>Once we have the attention scores, deriving the output of self-attention is easy. The output for each token is simply a weighted combination of value vectors, where the weights are given by the attention scores. To compute this output, we simply multiply the attention matrix by the value matrix as shown above. Notably, self-attention preserves the size of its input&#8212;<em>a transformed, </em><code>d</code><em>-dimensional output vector is produced for each token vector within the input</em>.</p><p><strong>Masked self-attention.</strong> So far, the formulation we have learned is for vanilla (or bidirectional self-attention). As mentioned previously, however, decoder-only transformers use masked self-attention, which modifies the underlying attention pattern by &#8220;masking out&#8221; tokens that come after each token in the sequence. Each token can only consider tokens that come before it&#8212;<em>following tokens are masked</em>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PY6O!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3d910cc-fd59-45dd-b2b6-9452a6f69bf0_2316x694.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PY6O!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3d910cc-fd59-45dd-b2b6-9452a6f69bf0_2316x694.png 424w, https://substackcdn.com/image/fetch/$s_!PY6O!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3d910cc-fd59-45dd-b2b6-9452a6f69bf0_2316x694.png 848w, https://substackcdn.com/image/fetch/$s_!PY6O!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3d910cc-fd59-45dd-b2b6-9452a6f69bf0_2316x694.png 1272w, https://substackcdn.com/image/fetch/$s_!PY6O!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3d910cc-fd59-45dd-b2b6-9452a6f69bf0_2316x694.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PY6O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3d910cc-fd59-45dd-b2b6-9452a6f69bf0_2316x694.png" width="1456" height="436" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a3d910cc-fd59-45dd-b2b6-9452a6f69bf0_2316x694.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:436,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:160995,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/155023686?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3d910cc-fd59-45dd-b2b6-9452a6f69bf0_2316x694.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PY6O!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3d910cc-fd59-45dd-b2b6-9452a6f69bf0_2316x694.png 424w, https://substackcdn.com/image/fetch/$s_!PY6O!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3d910cc-fd59-45dd-b2b6-9452a6f69bf0_2316x694.png 848w, https://substackcdn.com/image/fetch/$s_!PY6O!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3d910cc-fd59-45dd-b2b6-9452a6f69bf0_2316x694.png 1272w, https://substackcdn.com/image/fetch/$s_!PY6O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3d910cc-fd59-45dd-b2b6-9452a6f69bf0_2316x694.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Computing masked attention scores</figcaption></figure></div><p>Let&#8217;s consider a token sequence <code>[&#8220;LLM&#8221;, &#8220;#s&#8221;, &#8220;are&#8221;, &#8220;cool&#8221;, &#8220;.&#8221;]</code> and compute masked attention scores for the token <code>&#8220;are&#8221;</code>. So far, we have learned that self-attention will compute an attention score between <code>&#8220;are&#8221;</code> and every other token in the sequence. With masked self-attention, however, we only compute attention scores for <code>&#8220;LLM&#8221;</code>, <code>&#8220;#s&#8221;</code>, and <code>&#8220;are&#8221;</code>. <em>Masked self-attention prohibits us from looking forward in the sequence</em>! Practically, this is achieved by simply setting all attention scores for these tokens to negative infinity, yielding a pairwise probability of zero for masked tokens after the application of softmax.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Eei9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65c156ae-5cc5-4f7f-8652-dd5311b19beb_544x724.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Eei9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65c156ae-5cc5-4f7f-8652-dd5311b19beb_544x724.png 424w, https://substackcdn.com/image/fetch/$s_!Eei9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65c156ae-5cc5-4f7f-8652-dd5311b19beb_544x724.png 848w, https://substackcdn.com/image/fetch/$s_!Eei9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65c156ae-5cc5-4f7f-8652-dd5311b19beb_544x724.png 1272w, https://substackcdn.com/image/fetch/$s_!Eei9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65c156ae-5cc5-4f7f-8652-dd5311b19beb_544x724.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Eei9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65c156ae-5cc5-4f7f-8652-dd5311b19beb_544x724.png" width="266" height="354.0147058823529" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/65c156ae-5cc5-4f7f-8652-dd5311b19beb_544x724.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:724,&quot;width&quot;:544,&quot;resizeWidth&quot;:266,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Eei9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65c156ae-5cc5-4f7f-8652-dd5311b19beb_544x724.png 424w, https://substackcdn.com/image/fetch/$s_!Eei9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65c156ae-5cc5-4f7f-8652-dd5311b19beb_544x724.png 848w, https://substackcdn.com/image/fetch/$s_!Eei9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65c156ae-5cc5-4f7f-8652-dd5311b19beb_544x724.png 1272w, https://substackcdn.com/image/fetch/$s_!Eei9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65c156ae-5cc5-4f7f-8652-dd5311b19beb_544x724.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p><strong>Attention heads.</strong> The attention operation we have described so far uses softmax to normalize attention scores that are computed across the sequence. Although this approach forms a valid probability distribution, it also limits the ability of self-attention to focus on multiple positions within the sequence&#8212;<em>the probability distribution can easily be dominated by one (or a few) words</em>. To solve this issue, we typically compute attention across multiple &#8220;heads&#8221; in parallel; see above.</p><p>Within each head, the masked attention operation is identical. However, we:</p><ol><li><p> Use separate key, query, and value projections for each attention head.</p></li><li><p>Reduce the dimensionality of the key, query, and value vectors (i.e., this can be done by modifying the linear projection) to reduce computational costs.</p></li></ol><p>More specifically, we will change the dimensionality of vectors in each attention head from <code>d</code> to <code>d // H</code>, where <code>H</code> is the number of attention heads, to keep the computational costs of multi-headed self-attention (relatively) fixed.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6keH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1a2682-07ad-4daa-a3ae-f4d3c59d9fb0_2194x992.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6keH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1a2682-07ad-4daa-a3ae-f4d3c59d9fb0_2194x992.png 424w, https://substackcdn.com/image/fetch/$s_!6keH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1a2682-07ad-4daa-a3ae-f4d3c59d9fb0_2194x992.png 848w, https://substackcdn.com/image/fetch/$s_!6keH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1a2682-07ad-4daa-a3ae-f4d3c59d9fb0_2194x992.png 1272w, https://substackcdn.com/image/fetch/$s_!6keH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1a2682-07ad-4daa-a3ae-f4d3c59d9fb0_2194x992.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6keH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1a2682-07ad-4daa-a3ae-f4d3c59d9fb0_2194x992.png" width="1456" height="658" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8c1a2682-07ad-4daa-a3ae-f4d3c59d9fb0_2194x992.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:658,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:283300,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/155023686?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1a2682-07ad-4daa-a3ae-f4d3c59d9fb0_2194x992.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6keH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1a2682-07ad-4daa-a3ae-f4d3c59d9fb0_2194x992.png 424w, https://substackcdn.com/image/fetch/$s_!6keH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1a2682-07ad-4daa-a3ae-f4d3c59d9fb0_2194x992.png 848w, https://substackcdn.com/image/fetch/$s_!6keH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1a2682-07ad-4daa-a3ae-f4d3c59d9fb0_2194x992.png 1272w, https://substackcdn.com/image/fetch/$s_!6keH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1a2682-07ad-4daa-a3ae-f4d3c59d9fb0_2194x992.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Combining the output of multiple attention heads</figcaption></figure></div><p>Now, we have several attention heads that compute self-attention in parallel. However, we still need to produce a single output representation from the multiple heads of our self-attention module. We have several options for combining the output of each attention head; e.g., concatenation, averaging, projecting, and more. However, the vanilla implementation of multi-headed self-attention does the following (depicted above):</p><ul><li><p>Concatenates the output of each head.</p></li><li><p>Linearly projects the concatenated output.</p></li></ul><p>Because each attention head outputs token vectors of dimension <code>d // H</code>, the concatenated output of all attention heads has dimension <code>d</code> . Thus, the multi-headed self-attention operation still preserves the original size of the input.</p><div class="github-gist" data-attrs="{&quot;innerHTML&quot;:&quot;<div id=\&quot;gist128793495\&quot; class=\&quot;gist\&quot;>\n    <div class=\&quot;gist-file\&quot; translate=\&quot;no\&quot; data-color-mode=\&quot;light\&quot; data-light-theme=\&quot;light\&quot;>\n      <div class=\&quot;gist-data\&quot;>\n        <div class=\&quot;js-gist-file-update-container js-task-list-container\&quot;>\n  <div id=\&quot;file-causal_self_attention-py\&quot; class=\&quot;file my-2\&quot;>\n    \n    <div itemprop=\&quot;text\&quot;\n      class=\&quot;Box-body p-0 blob-wrapper data type-python  \&quot;\n      style=\&quot;overflow: auto\&quot; tabindex=\&quot;0\&quot; role=\&quot;region\&quot;\n      aria-label=\&quot;file-causal_self_attention-py\&quot;\n    >\n\n        \n<div class=\&quot;js-check-bidi js-blob-code-container blob-code-content\&quot;>\n\n  <template class=\&quot;js-file-alert-template\&quot;>\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash flash-warn flash-full d-flex flex-items-center\&quot;>\n  <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n    <span>\n      This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.\n      <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.co/hiddenchars\&quot; target=\&quot;_blank\&quot;>Learn more about bidirectional Unicode characters</a>\n    </span>\n\n\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash-action\&quot;>        <a href=\&quot;{{ revealButtonHref }}\&quot; data-view-component=\&quot;true\&quot; class=\&quot;btn-sm btn\&quot;>    Show hidden characters\n</a>\n</div>\n</div></template>\n<template class=\&quot;js-line-alert-template\&quot;>\n  <span aria-label=\&quot;This line has hidden Unicode characters\&quot; data-view-component=\&quot;true\&quot; class=\&quot;line-alert tooltipped tooltipped-e\&quot;>\n    <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n</span></template>\n\n  <table data-hpc class=\&quot;highlight tab-size js-file-line-container\&quot; data-tab-size=\&quot;8\&quot; data-paste-markdown-skip data-tagsearch-path=\&quot;causal_self_attention.py\&quot;>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L1\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;1\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC1\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L2\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;2\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC2\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>Source: https://github.com/karpathy/nanoGPT/blob/master/model.py</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L3\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;3\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC3\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L4\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;4\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC4\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L5\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;5\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC5\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>import</span> <span class=pl-s1>math</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L6\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;6\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC6\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>import</span> <span class=pl-s1>torch</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L7\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;7\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC7\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>from</span> <span class=pl-s1>torch</span> <span class=pl-k>import</span> <span class=pl-s1>nn</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L8\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;8\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC8\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>import</span> <span class=pl-s1>torch</span>.<span class=pl-s1>nn</span>.<span class=pl-s1>functional</span> <span class=pl-k>as</span> <span class=pl-c1>F</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L9\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;9\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC9\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L10\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;10\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC10\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>class</span> <span class=pl-v>CausalSelfAttention</span>(<span class=pl-s1>nn</span>.<span class=pl-c1>Module</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L11\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;11\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC11\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L12\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;12\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC12\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>__init__</span>(</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L13\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;13\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC13\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L14\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;14\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC14\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>d</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L15\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;15\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC15\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c1>H</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L16\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;16\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC16\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c1>T</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L17\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;17\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC17\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>bias</span><span class=pl-c1>=</span><span class=pl-c1>False</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L18\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;18\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC18\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>dropout</span><span class=pl-c1>=</span><span class=pl-c1>0.2</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L19\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;19\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC19\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    ):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L20\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;20\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC20\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L21\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;21\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC21\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        Arguments:</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L22\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;22\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC22\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        d: size of embedding dimension</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L23\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;23\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC23\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        H: number of attention heads</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L24\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;24\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC24\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        T: maximum length of input sequences (in tokens)</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L25\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;25\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC25\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        bias: whether or not to use bias in linear layers</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L26\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;26\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC26\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        dropout: probability of dropout</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L27\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;27\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC27\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        &amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L28\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;28\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC28\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-en>super</span>().<span class=pl-c1>__init__</span>()</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L29\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;29\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC29\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>assert</span> <span class=pl-s1>d</span> <span class=pl-c1>%</span> <span class=pl-c1>H</span> <span class=pl-c1>==</span> <span class=pl-c1>0</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L30\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;30\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC30\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L31\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;31\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC31\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># key, query, value projections for all heads, but in a batch</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L32\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;32\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC32\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># output is 3X the dimension because it includes key, query and value</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L33\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;33\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC33\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>c_attn</span> <span class=pl-c1>=</span> <span class=pl-s1>nn</span>.<span class=pl-c1>Linear</span>(<span class=pl-s1>d</span>, <span class=pl-c1>3</span><span class=pl-c1>*</span><span class=pl-s1>d</span>, <span class=pl-s1>bias</span><span class=pl-c1>=</span><span class=pl-s1>bias</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L34\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;34\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC34\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L35\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;35\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC35\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># projection of concatenated attention head outputs</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L36\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;36\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC36\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>c_proj</span> <span class=pl-c1>=</span> <span class=pl-s1>nn</span>.<span class=pl-c1>Linear</span>(<span class=pl-s1>d</span>, <span class=pl-s1>d</span>, <span class=pl-s1>bias</span><span class=pl-c1>=</span><span class=pl-s1>bias</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L37\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;37\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC37\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L38\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;38\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC38\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># dropout modules</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L39\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;39\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC39\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>attn_dropout</span> <span class=pl-c1>=</span> <span class=pl-s1>nn</span>.<span class=pl-c1>Dropout</span>(<span class=pl-s1>dropout</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L40\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;40\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC40\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>resid_dropout</span> <span class=pl-c1>=</span> <span class=pl-s1>nn</span>.<span class=pl-c1>Dropout</span>(<span class=pl-s1>dropout</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L41\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;41\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC41\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>H</span> <span class=pl-c1>=</span> <span class=pl-c1>H</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L42\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;42\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC42\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>d</span> <span class=pl-c1>=</span> <span class=pl-s1>d</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L43\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;43\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC43\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L44\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;44\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC44\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># causal mask to ensure that attention is only applied to</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L45\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;45\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC45\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># the left in the input sequence</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L46\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;46\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC46\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>register_buffer</span>(<span class=pl-s>&amp;quot;mask&amp;quot;</span>, <span class=pl-s1>torch</span>.<span class=pl-c1>tril</span>(<span class=pl-s1>torch</span>.<span class=pl-c1>ones</span>(<span class=pl-c1>T</span>, <span class=pl-c1>T</span>))</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L47\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;47\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC47\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>                                    .<span class=pl-c1>view</span>(<span class=pl-c1>1</span>, <span class=pl-c1>1</span>, <span class=pl-c1>T</span>, <span class=pl-c1>T</span>))</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L48\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;48\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC48\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L49\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;49\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC49\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>forward</span>(<span class=pl-s1>self</span>, <span class=pl-s1>x</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L50\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;50\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC50\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c1>B</span>, <span class=pl-c1>T</span>, <span class=pl-s1>_</span> <span class=pl-c1>=</span> <span class=pl-s1>x</span>.<span class=pl-c1>size</span>() <span class=pl-c># batch size, sequence length, embedding dimensionality</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L51\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;51\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC51\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L52\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;52\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC52\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># compute query, key, and value vectors for all heads in batch</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L53\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;53\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC53\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># split the output into separate query, key, and value tensors</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L54\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;54\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC54\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>q</span>, <span class=pl-s1>k</span>, <span class=pl-s1>v</span>  <span class=pl-c1>=</span> <span class=pl-s1>self</span>.<span class=pl-c1>c_attn</span>(<span class=pl-s1>x</span>).<span class=pl-c1>split</span>(<span class=pl-s1>self</span>.<span class=pl-c1>d</span>, <span class=pl-s1>dim</span><span class=pl-c1>=</span><span class=pl-c1>2</span>) <span class=pl-c># [B, T, d]</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L55\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;55\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC55\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L56\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;56\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC56\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># reshape tensor into sequences of smaller token vectors for each head</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L57\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;57\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC57\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>k</span> <span class=pl-c1>=</span> <span class=pl-s1>k</span>.<span class=pl-c1>view</span>(<span class=pl-c1>B</span>, <span class=pl-c1>T</span>, <span class=pl-s1>self</span>.<span class=pl-c1>H</span>, <span class=pl-s1>self</span>.<span class=pl-c1>d</span> <span class=pl-c1>//</span> <span class=pl-s1>self</span>.<span class=pl-c1>H</span>).<span class=pl-c1>transpose</span>(<span class=pl-c1>1</span>, <span class=pl-c1>2</span>) <span class=pl-c># [B, H, T, d // H]</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L58\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;58\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC58\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>q</span> <span class=pl-c1>=</span> <span class=pl-s1>q</span>.<span class=pl-c1>view</span>(<span class=pl-c1>B</span>, <span class=pl-c1>T</span>, <span class=pl-s1>self</span>.<span class=pl-c1>H</span>, <span class=pl-s1>self</span>.<span class=pl-c1>d</span> <span class=pl-c1>//</span> <span class=pl-s1>self</span>.<span class=pl-c1>H</span>).<span class=pl-c1>transpose</span>(<span class=pl-c1>1</span>, <span class=pl-c1>2</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L59\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;59\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC59\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>v</span> <span class=pl-c1>=</span> <span class=pl-s1>v</span>.<span class=pl-c1>view</span>(<span class=pl-c1>B</span>, <span class=pl-c1>T</span>, <span class=pl-s1>self</span>.<span class=pl-c1>H</span>, <span class=pl-s1>self</span>.<span class=pl-c1>d</span> <span class=pl-c1>//</span> <span class=pl-s1>self</span>.<span class=pl-c1>H</span>).<span class=pl-c1>transpose</span>(<span class=pl-c1>1</span>, <span class=pl-c1>2</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L60\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;60\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC60\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L61\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;61\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC61\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># compute the attention matrix, perform masking, and apply dropout</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L62\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;62\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC62\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>att</span> <span class=pl-c1>=</span> (<span class=pl-s1>q</span> @ <span class=pl-s1>k</span>.<span class=pl-c1>transpose</span>(<span class=pl-c1>-</span><span class=pl-c1>2</span>, <span class=pl-c1>-</span><span class=pl-c1>1</span>)) <span class=pl-c1>*</span> (<span class=pl-c1>1.0</span> <span class=pl-c1>/</span> <span class=pl-s1>math</span>.<span class=pl-c1>sqrt</span>(<span class=pl-s1>k</span>.<span class=pl-c1>size</span>(<span class=pl-c1>-</span><span class=pl-c1>1</span>))) <span class=pl-c># [B, H, T, T]</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L63\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;63\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC63\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>att</span> <span class=pl-c1>=</span> <span class=pl-s1>att</span>.<span class=pl-c1>masked_fill</span>(<span class=pl-s1>self</span>.<span class=pl-c1>mask</span>[:,:,:<span class=pl-c1>T</span>,:<span class=pl-c1>T</span>] <span class=pl-c1>==</span> <span class=pl-c1>0</span>, <span class=pl-en>float</span>(<span class=pl-s>&amp;#39;-inf&amp;#39;</span>))</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L64\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;64\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC64\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>att</span> <span class=pl-c1>=</span> <span class=pl-c1>F</span>.<span class=pl-c1>softmax</span>(<span class=pl-s1>att</span>, <span class=pl-s1>dim</span><span class=pl-c1>=</span><span class=pl-c1>-</span><span class=pl-c1>1</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L65\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;65\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC65\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>att</span> <span class=pl-c1>=</span> <span class=pl-s1>self</span>.<span class=pl-c1>attn_dropout</span>(<span class=pl-s1>att</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L66\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;66\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC66\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L67\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;67\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC67\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># compute output vectors for each token</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L68\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;68\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC68\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>y</span> <span class=pl-c1>=</span> <span class=pl-s1>att</span> @ <span class=pl-s1>v</span> <span class=pl-c># [B, H, T, d // H]</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L69\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;69\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC69\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L70\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;70\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC70\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># concatenate outputs from each attention head and linearly project</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L71\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;71\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC71\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>y</span> <span class=pl-c1>=</span> <span class=pl-s1>y</span>.<span class=pl-c1>transpose</span>(<span class=pl-c1>1</span>, <span class=pl-c1>2</span>).<span class=pl-c1>contiguous</span>().<span class=pl-c1>view</span>(<span class=pl-c1>B</span>, <span class=pl-c1>T</span>, <span class=pl-s1>self</span>.<span class=pl-c1>d</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L72\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;72\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC72\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>y</span> <span class=pl-c1>=</span> <span class=pl-s1>self</span>.<span class=pl-c1>resid_dropout</span>(<span class=pl-s1>self</span>.<span class=pl-c1>c_proj</span>(<span class=pl-s1>y</span>))</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-causal_self_attention-py-L73\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;73\&quot;></td>\n          <td id=\&quot;file-causal_self_attention-py-LC73\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>return</span> <span class=pl-s1>y</span></td>\n        </tr>\n  </table>\n</div>\n\n\n    </div>\n\n  </div>\n</div>\n\n      </div>\n      <div class=\&quot;gist-meta\&quot;>\n        <a href=\&quot;https://gist.github.com/wolfecameron/26863dbbc322b15d2e224a2569868256/raw/21a836285584d6437e477f035a26c39efdc5f442/causal_self_attention.py\&quot; style=\&quot;float:right\&quot; class=\&quot;Link--inTextBlock\&quot;>view raw</a>\n        <a href=\&quot;https://gist.github.com/wolfecameron/26863dbbc322b15d2e224a2569868256#file-causal_self_attention-py\&quot; class=\&quot;Link--inTextBlock\&quot;>\n          causal_self_attention.py\n        </a>\n        hosted with &amp;#10084; by <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.com\&quot;>GitHub</a>\n      </div>\n    </div>\n</div>\n&quot;,&quot;stylesheet&quot;:&quot;https://github.githubassets.com/assets/gist-embed-9060cf3ad5bb.css&quot;}" data-component-name="GitgistToDOM"><link rel="stylesheet" href="https://github.githubassets.com/assets/gist-embed-9060cf3ad5bb.css"><div id="gist128793495" class="gist">
    <div class="gist-file" data-color-mode="light" data-light-theme="light">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container">
  <div id="file-causal_self_attention-py" class="file my-2">
    
    <div itemprop="text" class="Box-body p-0 blob-wrapper data type-python  " style="overflow:auto">

        
<div class="js-check-bidi js-blob-code-container blob-code-content">

  
  <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
  
    

    <span>
      This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
    </span>


  <div data-view-component="true" class="flash-action">        <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn">    Show hidden characters
</a>
</div>
</div>

  <span data-view-component="true" class="line-alert tooltipped tooltipped-e">
    
    

</span>

  <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="causal_self_attention.py">
        <tbody><tr>
          <td id="file-causal_self_attention-py-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
          <td id="file-causal_self_attention-py-LC1" class="blob-code blob-code-inner js-file-line"><span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
          <td id="file-causal_self_attention-py-LC2" class="blob-code blob-code-inner js-file-line"><span class="pl-s">Source: https://github.com/karpathy/nanoGPT/blob/master/model.py</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
          <td id="file-causal_self_attention-py-LC3" class="blob-code blob-code-inner js-file-line"><span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
          <td id="file-causal_self_attention-py-LC4" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
          <td id="file-causal_self_attention-py-LC5" class="blob-code blob-code-inner js-file-line"><span class="pl-k">import</span> <span class="pl-s1">math</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
          <td id="file-causal_self_attention-py-LC6" class="blob-code blob-code-inner js-file-line"><span class="pl-k">import</span> <span class="pl-s1">torch</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
          <td id="file-causal_self_attention-py-LC7" class="blob-code blob-code-inner js-file-line"><span class="pl-k">from</span> <span class="pl-s1">torch</span> <span class="pl-k">import</span> <span class="pl-s1">nn</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
          <td id="file-causal_self_attention-py-LC8" class="blob-code blob-code-inner js-file-line"><span class="pl-k">import</span> <span class="pl-s1">torch</span>.<span class="pl-s1">nn</span>.<span class="pl-s1">functional</span> <span class="pl-k">as</span> <span class="pl-c1">F</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
          <td id="file-causal_self_attention-py-LC9" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
          <td id="file-causal_self_attention-py-LC10" class="blob-code blob-code-inner js-file-line"><span class="pl-k">class</span> <span class="pl-v">CausalSelfAttention</span>(<span class="pl-s1">nn</span>.<span class="pl-c1">Module</span>):</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
          <td id="file-causal_self_attention-py-LC11" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
          <td id="file-causal_self_attention-py-LC12" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">__init__</span>(</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
          <td id="file-causal_self_attention-py-LC13" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>,</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
          <td id="file-causal_self_attention-py-LC14" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">d</span>,</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
          <td id="file-causal_self_attention-py-LC15" class="blob-code blob-code-inner js-file-line">        <span class="pl-c1">H</span>,</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
          <td id="file-causal_self_attention-py-LC16" class="blob-code blob-code-inner js-file-line">        <span class="pl-c1">T</span>,</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
          <td id="file-causal_self_attention-py-LC17" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">bias</span><span class="pl-c1">=</span><span class="pl-c1">False</span>,</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
          <td id="file-causal_self_attention-py-LC18" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">dropout</span><span class="pl-c1">=</span><span class="pl-c1">0.2</span>,</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
          <td id="file-causal_self_attention-py-LC19" class="blob-code blob-code-inner js-file-line">    ):</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td>
          <td id="file-causal_self_attention-py-LC20" class="blob-code blob-code-inner js-file-line">        <span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td>
          <td id="file-causal_self_attention-py-LC21" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        Arguments:</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td>
          <td id="file-causal_self_attention-py-LC22" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        d: size of embedding dimension</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td>
          <td id="file-causal_self_attention-py-LC23" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        H: number of attention heads</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td>
          <td id="file-causal_self_attention-py-LC24" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        T: maximum length of input sequences (in tokens)</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L25" class="blob-num js-line-number js-blob-rnum" data-line-number="25"></td>
          <td id="file-causal_self_attention-py-LC25" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        bias: whether or not to use bias in linear layers</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L26" class="blob-num js-line-number js-blob-rnum" data-line-number="26"></td>
          <td id="file-causal_self_attention-py-LC26" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        dropout: probability of dropout</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L27" class="blob-num js-line-number js-blob-rnum" data-line-number="27"></td>
          <td id="file-causal_self_attention-py-LC27" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        """</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L28" class="blob-num js-line-number js-blob-rnum" data-line-number="28"></td>
          <td id="file-causal_self_attention-py-LC28" class="blob-code blob-code-inner js-file-line">        <span class="pl-en">super</span>().<span class="pl-c1">__init__</span>()</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L29" class="blob-num js-line-number js-blob-rnum" data-line-number="29"></td>
          <td id="file-causal_self_attention-py-LC29" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">assert</span> <span class="pl-s1">d</span> <span class="pl-c1">%</span> <span class="pl-c1">H</span> <span class="pl-c1">==</span> <span class="pl-c1">0</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L30" class="blob-num js-line-number js-blob-rnum" data-line-number="30"></td>
          <td id="file-causal_self_attention-py-LC30" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L31" class="blob-num js-line-number js-blob-rnum" data-line-number="31"></td>
          <td id="file-causal_self_attention-py-LC31" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># key, query, value projections for all heads, but in a batch</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L32" class="blob-num js-line-number js-blob-rnum" data-line-number="32"></td>
          <td id="file-causal_self_attention-py-LC32" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># output is 3X the dimension because it includes key, query and value</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L33" class="blob-num js-line-number js-blob-rnum" data-line-number="33"></td>
          <td id="file-causal_self_attention-py-LC33" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">c_attn</span> <span class="pl-c1">=</span> <span class="pl-s1">nn</span>.<span class="pl-c1">Linear</span>(<span class="pl-s1">d</span>, <span class="pl-c1">3</span><span class="pl-c1">*</span><span class="pl-s1">d</span>, <span class="pl-s1">bias</span><span class="pl-c1">=</span><span class="pl-s1">bias</span>)</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L34" class="blob-num js-line-number js-blob-rnum" data-line-number="34"></td>
          <td id="file-causal_self_attention-py-LC34" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L35" class="blob-num js-line-number js-blob-rnum" data-line-number="35"></td>
          <td id="file-causal_self_attention-py-LC35" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># projection of concatenated attention head outputs</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L36" class="blob-num js-line-number js-blob-rnum" data-line-number="36"></td>
          <td id="file-causal_self_attention-py-LC36" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">c_proj</span> <span class="pl-c1">=</span> <span class="pl-s1">nn</span>.<span class="pl-c1">Linear</span>(<span class="pl-s1">d</span>, <span class="pl-s1">d</span>, <span class="pl-s1">bias</span><span class="pl-c1">=</span><span class="pl-s1">bias</span>)</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L37" class="blob-num js-line-number js-blob-rnum" data-line-number="37"></td>
          <td id="file-causal_self_attention-py-LC37" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L38" class="blob-num js-line-number js-blob-rnum" data-line-number="38"></td>
          <td id="file-causal_self_attention-py-LC38" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># dropout modules</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L39" class="blob-num js-line-number js-blob-rnum" data-line-number="39"></td>
          <td id="file-causal_self_attention-py-LC39" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">attn_dropout</span> <span class="pl-c1">=</span> <span class="pl-s1">nn</span>.<span class="pl-c1">Dropout</span>(<span class="pl-s1">dropout</span>)</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L40" class="blob-num js-line-number js-blob-rnum" data-line-number="40"></td>
          <td id="file-causal_self_attention-py-LC40" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">resid_dropout</span> <span class="pl-c1">=</span> <span class="pl-s1">nn</span>.<span class="pl-c1">Dropout</span>(<span class="pl-s1">dropout</span>)</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L41" class="blob-num js-line-number js-blob-rnum" data-line-number="41"></td>
          <td id="file-causal_self_attention-py-LC41" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">H</span> <span class="pl-c1">=</span> <span class="pl-c1">H</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L42" class="blob-num js-line-number js-blob-rnum" data-line-number="42"></td>
          <td id="file-causal_self_attention-py-LC42" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">d</span> <span class="pl-c1">=</span> <span class="pl-s1">d</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L43" class="blob-num js-line-number js-blob-rnum" data-line-number="43"></td>
          <td id="file-causal_self_attention-py-LC43" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L44" class="blob-num js-line-number js-blob-rnum" data-line-number="44"></td>
          <td id="file-causal_self_attention-py-LC44" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># causal mask to ensure that attention is only applied to</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L45" class="blob-num js-line-number js-blob-rnum" data-line-number="45"></td>
          <td id="file-causal_self_attention-py-LC45" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># the left in the input sequence</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L46" class="blob-num js-line-number js-blob-rnum" data-line-number="46"></td>
          <td id="file-causal_self_attention-py-LC46" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">register_buffer</span>(<span class="pl-s">"mask"</span>, <span class="pl-s1">torch</span>.<span class="pl-c1">tril</span>(<span class="pl-s1">torch</span>.<span class="pl-c1">ones</span>(<span class="pl-c1">T</span>, <span class="pl-c1">T</span>))</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L47" class="blob-num js-line-number js-blob-rnum" data-line-number="47"></td>
          <td id="file-causal_self_attention-py-LC47" class="blob-code blob-code-inner js-file-line">                                    .<span class="pl-c1">view</span>(<span class="pl-c1">1</span>, <span class="pl-c1">1</span>, <span class="pl-c1">T</span>, <span class="pl-c1">T</span>))</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L48" class="blob-num js-line-number js-blob-rnum" data-line-number="48"></td>
          <td id="file-causal_self_attention-py-LC48" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L49" class="blob-num js-line-number js-blob-rnum" data-line-number="49"></td>
          <td id="file-causal_self_attention-py-LC49" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">forward</span>(<span class="pl-s1">self</span>, <span class="pl-s1">x</span>):</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L50" class="blob-num js-line-number js-blob-rnum" data-line-number="50"></td>
          <td id="file-causal_self_attention-py-LC50" class="blob-code blob-code-inner js-file-line">        <span class="pl-c1">B</span>, <span class="pl-c1">T</span>, <span class="pl-s1">_</span> <span class="pl-c1">=</span> <span class="pl-s1">x</span>.<span class="pl-c1">size</span>() <span class="pl-c"># batch size, sequence length, embedding dimensionality</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L51" class="blob-num js-line-number js-blob-rnum" data-line-number="51"></td>
          <td id="file-causal_self_attention-py-LC51" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L52" class="blob-num js-line-number js-blob-rnum" data-line-number="52"></td>
          <td id="file-causal_self_attention-py-LC52" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># compute query, key, and value vectors for all heads in batch</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L53" class="blob-num js-line-number js-blob-rnum" data-line-number="53"></td>
          <td id="file-causal_self_attention-py-LC53" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># split the output into separate query, key, and value tensors</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L54" class="blob-num js-line-number js-blob-rnum" data-line-number="54"></td>
          <td id="file-causal_self_attention-py-LC54" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">q</span>, <span class="pl-s1">k</span>, <span class="pl-s1">v</span>  <span class="pl-c1">=</span> <span class="pl-s1">self</span>.<span class="pl-c1">c_attn</span>(<span class="pl-s1">x</span>).<span class="pl-c1">split</span>(<span class="pl-s1">self</span>.<span class="pl-c1">d</span>, <span class="pl-s1">dim</span><span class="pl-c1">=</span><span class="pl-c1">2</span>) <span class="pl-c"># [B, T, d]</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L55" class="blob-num js-line-number js-blob-rnum" data-line-number="55"></td>
          <td id="file-causal_self_attention-py-LC55" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L56" class="blob-num js-line-number js-blob-rnum" data-line-number="56"></td>
          <td id="file-causal_self_attention-py-LC56" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># reshape tensor into sequences of smaller token vectors for each head</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L57" class="blob-num js-line-number js-blob-rnum" data-line-number="57"></td>
          <td id="file-causal_self_attention-py-LC57" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">k</span> <span class="pl-c1">=</span> <span class="pl-s1">k</span>.<span class="pl-c1">view</span>(<span class="pl-c1">B</span>, <span class="pl-c1">T</span>, <span class="pl-s1">self</span>.<span class="pl-c1">H</span>, <span class="pl-s1">self</span>.<span class="pl-c1">d</span> <span class="pl-c1">//</span> <span class="pl-s1">self</span>.<span class="pl-c1">H</span>).<span class="pl-c1">transpose</span>(<span class="pl-c1">1</span>, <span class="pl-c1">2</span>) <span class="pl-c"># [B, H, T, d // H]</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L58" class="blob-num js-line-number js-blob-rnum" data-line-number="58"></td>
          <td id="file-causal_self_attention-py-LC58" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">q</span> <span class="pl-c1">=</span> <span class="pl-s1">q</span>.<span class="pl-c1">view</span>(<span class="pl-c1">B</span>, <span class="pl-c1">T</span>, <span class="pl-s1">self</span>.<span class="pl-c1">H</span>, <span class="pl-s1">self</span>.<span class="pl-c1">d</span> <span class="pl-c1">//</span> <span class="pl-s1">self</span>.<span class="pl-c1">H</span>).<span class="pl-c1">transpose</span>(<span class="pl-c1">1</span>, <span class="pl-c1">2</span>)</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L59" class="blob-num js-line-number js-blob-rnum" data-line-number="59"></td>
          <td id="file-causal_self_attention-py-LC59" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">v</span> <span class="pl-c1">=</span> <span class="pl-s1">v</span>.<span class="pl-c1">view</span>(<span class="pl-c1">B</span>, <span class="pl-c1">T</span>, <span class="pl-s1">self</span>.<span class="pl-c1">H</span>, <span class="pl-s1">self</span>.<span class="pl-c1">d</span> <span class="pl-c1">//</span> <span class="pl-s1">self</span>.<span class="pl-c1">H</span>).<span class="pl-c1">transpose</span>(<span class="pl-c1">1</span>, <span class="pl-c1">2</span>)</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L60" class="blob-num js-line-number js-blob-rnum" data-line-number="60"></td>
          <td id="file-causal_self_attention-py-LC60" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L61" class="blob-num js-line-number js-blob-rnum" data-line-number="61"></td>
          <td id="file-causal_self_attention-py-LC61" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># compute the attention matrix, perform masking, and apply dropout</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L62" class="blob-num js-line-number js-blob-rnum" data-line-number="62"></td>
          <td id="file-causal_self_attention-py-LC62" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">att</span> <span class="pl-c1">=</span> (<span class="pl-s1">q</span> @ <span class="pl-s1">k</span>.<span class="pl-c1">transpose</span>(<span class="pl-c1">-</span><span class="pl-c1">2</span>, <span class="pl-c1">-</span><span class="pl-c1">1</span>)) <span class="pl-c1">*</span> (<span class="pl-c1">1.0</span> <span class="pl-c1">/</span> <span class="pl-s1">math</span>.<span class="pl-c1">sqrt</span>(<span class="pl-s1">k</span>.<span class="pl-c1">size</span>(<span class="pl-c1">-</span><span class="pl-c1">1</span>))) <span class="pl-c"># [B, H, T, T]</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L63" class="blob-num js-line-number js-blob-rnum" data-line-number="63"></td>
          <td id="file-causal_self_attention-py-LC63" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">att</span> <span class="pl-c1">=</span> <span class="pl-s1">att</span>.<span class="pl-c1">masked_fill</span>(<span class="pl-s1">self</span>.<span class="pl-c1">mask</span>[:,:,:<span class="pl-c1">T</span>,:<span class="pl-c1">T</span>] <span class="pl-c1">==</span> <span class="pl-c1">0</span>, <span class="pl-en">float</span>(<span class="pl-s">'-inf'</span>))</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L64" class="blob-num js-line-number js-blob-rnum" data-line-number="64"></td>
          <td id="file-causal_self_attention-py-LC64" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">att</span> <span class="pl-c1">=</span> <span class="pl-c1">F</span>.<span class="pl-c1">softmax</span>(<span class="pl-s1">att</span>, <span class="pl-s1">dim</span><span class="pl-c1">=</span><span class="pl-c1">-</span><span class="pl-c1">1</span>)</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L65" class="blob-num js-line-number js-blob-rnum" data-line-number="65"></td>
          <td id="file-causal_self_attention-py-LC65" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">att</span> <span class="pl-c1">=</span> <span class="pl-s1">self</span>.<span class="pl-c1">attn_dropout</span>(<span class="pl-s1">att</span>)</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L66" class="blob-num js-line-number js-blob-rnum" data-line-number="66"></td>
          <td id="file-causal_self_attention-py-LC66" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L67" class="blob-num js-line-number js-blob-rnum" data-line-number="67"></td>
          <td id="file-causal_self_attention-py-LC67" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># compute output vectors for each token</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L68" class="blob-num js-line-number js-blob-rnum" data-line-number="68"></td>
          <td id="file-causal_self_attention-py-LC68" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">y</span> <span class="pl-c1">=</span> <span class="pl-s1">att</span> @ <span class="pl-s1">v</span> <span class="pl-c"># [B, H, T, d // H]</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L69" class="blob-num js-line-number js-blob-rnum" data-line-number="69"></td>
          <td id="file-causal_self_attention-py-LC69" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L70" class="blob-num js-line-number js-blob-rnum" data-line-number="70"></td>
          <td id="file-causal_self_attention-py-LC70" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># concatenate outputs from each attention head and linearly project</span></td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L71" class="blob-num js-line-number js-blob-rnum" data-line-number="71"></td>
          <td id="file-causal_self_attention-py-LC71" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">y</span> <span class="pl-c1">=</span> <span class="pl-s1">y</span>.<span class="pl-c1">transpose</span>(<span class="pl-c1">1</span>, <span class="pl-c1">2</span>).<span class="pl-c1">contiguous</span>().<span class="pl-c1">view</span>(<span class="pl-c1">B</span>, <span class="pl-c1">T</span>, <span class="pl-s1">self</span>.<span class="pl-c1">d</span>)</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L72" class="blob-num js-line-number js-blob-rnum" data-line-number="72"></td>
          <td id="file-causal_self_attention-py-LC72" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">y</span> <span class="pl-c1">=</span> <span class="pl-s1">self</span>.<span class="pl-c1">resid_dropout</span>(<span class="pl-s1">self</span>.<span class="pl-c1">c_proj</span>(<span class="pl-s1">y</span>))</td>
        </tr>
        <tr>
          <td id="file-causal_self_attention-py-L73" class="blob-num js-line-number js-blob-rnum" data-line-number="73"></td>
          <td id="file-causal_self_attention-py-LC73" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">return</span> <span class="pl-s1">y</span></td>
        </tr>
  </tbody></table>
</div>


    </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/wolfecameron/26863dbbc322b15d2e224a2569868256/raw/21a836285584d6437e477f035a26c39efdc5f442/causal_self_attention.py" style="float:right" class="Link--inTextBlock">view raw</a>
        <a href="https://gist.github.com/wolfecameron/26863dbbc322b15d2e224a2569868256#file-causal_self_attention-py" class="Link--inTextBlock">
          causal_self_attention.py
        </a>
        hosted with &#10084; by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
      </div>
    </div>
</div>
</div><p><strong>Full implementation.</strong> A full implementation of masked multi-headed self-attention is provided above. Here, we go beyond a single input sequence of size <code>[C, d]</code> and process a batch of inputs of size <code>[B, C, d]</code>. The above code implements each of the components that we have described so far:</p><ul><li><p><em>Lines 52-59</em>: compute key, query and value projections (using a single linear projection) for each attention head and split / reshape them as necessary.</p></li><li><p><em>Lines 62-65</em>: compute attention scores, mask the attention scores, then apply a softmax transformation to the result<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>.</p></li><li><p><em>Line 68</em>: compute output vectors by taking the product of the attention matrix and the value matrix. </p></li><li><p><em>Lines 71-72</em>: concatenate the outputs from each attention head and apply a linear projection to form the final output.</p></li></ul><p>Although we use some fancy matrix manipulations and operations in PyTorch, this implementation exactly matches our description of masked self-attention!</p><h4>Feed-Forward Transformation</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nuO_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F252f6acf-2ef1-4531-8ce4-2dce7778f1a0_1870x564.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nuO_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F252f6acf-2ef1-4531-8ce4-2dce7778f1a0_1870x564.png 424w, https://substackcdn.com/image/fetch/$s_!nuO_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F252f6acf-2ef1-4531-8ce4-2dce7778f1a0_1870x564.png 848w, https://substackcdn.com/image/fetch/$s_!nuO_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F252f6acf-2ef1-4531-8ce4-2dce7778f1a0_1870x564.png 1272w, https://substackcdn.com/image/fetch/$s_!nuO_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F252f6acf-2ef1-4531-8ce4-2dce7778f1a0_1870x564.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nuO_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F252f6acf-2ef1-4531-8ce4-2dce7778f1a0_1870x564.png" width="1456" height="439" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/252f6acf-2ef1-4531-8ce4-2dce7778f1a0_1870x564.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:439,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:161646,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/155023686?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F252f6acf-2ef1-4531-8ce4-2dce7778f1a0_1870x564.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nuO_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F252f6acf-2ef1-4531-8ce4-2dce7778f1a0_1870x564.png 424w, https://substackcdn.com/image/fetch/$s_!nuO_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F252f6acf-2ef1-4531-8ce4-2dce7778f1a0_1870x564.png 848w, https://substackcdn.com/image/fetch/$s_!nuO_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F252f6acf-2ef1-4531-8ce4-2dce7778f1a0_1870x564.png 1272w, https://substackcdn.com/image/fetch/$s_!nuO_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F252f6acf-2ef1-4531-8ce4-2dce7778f1a0_1870x564.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Pointwise feed-forward transformation</figcaption></figure></div><p>In addition to masked self-attention, each block of the transformer contains a pointwise<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a> feed-forward transformation; see above. This transformation passes each token vector within the sequence through the same feed-forward neural network. Usually, this is a two-layer network with a non-linear activation (e.g., <a href="https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html">ReLU</a>, <a href="https://pytorch.org/docs/stable/generated/torch.nn.GELU.html">GeLU </a>or SwiGLU [3]) in the hidden layer. In most cases, the dimension of the hidden layer is larger than the original dimension of our token embeddings (e.g., by 4&#215;). Implementing a feed-forward neural network in PyTorch is easy to accomplish with the <a href="https://pytorch.org/docs/stable/generated/torch.nn.Linear.html">Linear module</a>; see below for an example. </p><div class="github-gist" data-attrs="{&quot;innerHTML&quot;:&quot;<div id=\&quot;gist128793760\&quot; class=\&quot;gist\&quot;>\n    <div class=\&quot;gist-file\&quot; translate=\&quot;no\&quot; data-color-mode=\&quot;light\&quot; data-light-theme=\&quot;light\&quot;>\n      <div class=\&quot;gist-data\&quot;>\n        <div class=\&quot;js-gist-file-update-container js-task-list-container\&quot;>\n  <div id=\&quot;file-transformer_ffnn-py\&quot; class=\&quot;file my-2\&quot;>\n    \n    <div itemprop=\&quot;text\&quot;\n      class=\&quot;Box-body p-0 blob-wrapper data type-python  \&quot;\n      style=\&quot;overflow: auto\&quot; tabindex=\&quot;0\&quot; role=\&quot;region\&quot;\n      aria-label=\&quot;file-transformer_ffnn-py\&quot;\n    >\n\n        \n<div class=\&quot;js-check-bidi js-blob-code-container blob-code-content\&quot;>\n\n  <template class=\&quot;js-file-alert-template\&quot;>\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash flash-warn flash-full d-flex flex-items-center\&quot;>\n  <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n    <span>\n      This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.\n      <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.co/hiddenchars\&quot; target=\&quot;_blank\&quot;>Learn more about bidirectional Unicode characters</a>\n    </span>\n\n\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash-action\&quot;>        <a href=\&quot;{{ revealButtonHref }}\&quot; data-view-component=\&quot;true\&quot; class=\&quot;btn-sm btn\&quot;>    Show hidden characters\n</a>\n</div>\n</div></template>\n<template class=\&quot;js-line-alert-template\&quot;>\n  <span aria-label=\&quot;This line has hidden Unicode characters\&quot; data-view-component=\&quot;true\&quot; class=\&quot;line-alert tooltipped tooltipped-e\&quot;>\n    <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n</span></template>\n\n  <table data-hpc class=\&quot;highlight tab-size js-file-line-container\&quot; data-tab-size=\&quot;8\&quot; data-paste-markdown-skip data-tagsearch-path=\&quot;transformer_ffnn.py\&quot;>\n        <tr>\n          <td id=\&quot;file-transformer_ffnn-py-L1\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;1\&quot;></td>\n          <td id=\&quot;file-transformer_ffnn-py-LC1\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-transformer_ffnn-py-L2\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;2\&quot;></td>\n          <td id=\&quot;file-transformer_ffnn-py-LC2\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>Source: https://github.com/karpathy/nanoGPT/blob/master/model.py</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-transformer_ffnn-py-L3\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;3\&quot;></td>\n          <td id=\&quot;file-transformer_ffnn-py-LC3\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-transformer_ffnn-py-L4\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;4\&quot;></td>\n          <td id=\&quot;file-transformer_ffnn-py-LC4\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-transformer_ffnn-py-L5\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;5\&quot;></td>\n          <td id=\&quot;file-transformer_ffnn-py-LC5\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>from</span> <span class=pl-s1>torch</span> <span class=pl-k>import</span> <span class=pl-s1>nn</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-transformer_ffnn-py-L6\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;6\&quot;></td>\n          <td id=\&quot;file-transformer_ffnn-py-LC6\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-transformer_ffnn-py-L7\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;7\&quot;></td>\n          <td id=\&quot;file-transformer_ffnn-py-LC7\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>class</span> <span class=pl-c1>MLP</span>(<span class=pl-s1>nn</span>.<span class=pl-c1>Module</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-transformer_ffnn-py-L8\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;8\&quot;></td>\n          <td id=\&quot;file-transformer_ffnn-py-LC8\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-transformer_ffnn-py-L9\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;9\&quot;></td>\n          <td id=\&quot;file-transformer_ffnn-py-LC9\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>__init__</span>(</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-transformer_ffnn-py-L10\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;10\&quot;></td>\n          <td id=\&quot;file-transformer_ffnn-py-LC10\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-transformer_ffnn-py-L11\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;11\&quot;></td>\n          <td id=\&quot;file-transformer_ffnn-py-LC11\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>d</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-transformer_ffnn-py-L12\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;12\&quot;></td>\n          <td id=\&quot;file-transformer_ffnn-py-LC12\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>bias</span><span class=pl-c1>=</span><span class=pl-c1>False</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-transformer_ffnn-py-L13\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;13\&quot;></td>\n          <td id=\&quot;file-transformer_ffnn-py-LC13\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>dropout</span><span class=pl-c1>=</span><span class=pl-c1>0.2</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-transformer_ffnn-py-L14\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;14\&quot;></td>\n          <td id=\&quot;file-transformer_ffnn-py-LC14\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    ):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-transformer_ffnn-py-L15\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;15\&quot;></td>\n          <td id=\&quot;file-transformer_ffnn-py-LC15\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-transformer_ffnn-py-L16\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;16\&quot;></td>\n          <td id=\&quot;file-transformer_ffnn-py-LC16\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        Arguments:</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-transformer_ffnn-py-L17\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;17\&quot;></td>\n          <td id=\&quot;file-transformer_ffnn-py-LC17\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        d: size of embedding dimension</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-transformer_ffnn-py-L18\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;18\&quot;></td>\n          <td id=\&quot;file-transformer_ffnn-py-LC18\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        bias: whether or not to use bias in linear layers</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-transformer_ffnn-py-L19\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;19\&quot;></td>\n          <td id=\&quot;file-transformer_ffnn-py-LC19\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        dropout: probability of dropout</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-transformer_ffnn-py-L20\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;20\&quot;></td>\n          <td id=\&quot;file-transformer_ffnn-py-LC20\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        &amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-transformer_ffnn-py-L21\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;21\&quot;></td>\n          <td id=\&quot;file-transformer_ffnn-py-LC21\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-transformer_ffnn-py-L22\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;22\&quot;></td>\n          <td id=\&quot;file-transformer_ffnn-py-LC22\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-en>super</span>().<span class=pl-c1>__init__</span>()</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-transformer_ffnn-py-L23\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;23\&quot;></td>\n          <td id=\&quot;file-transformer_ffnn-py-LC23\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>c_fc</span>    <span class=pl-c1>=</span> <span class=pl-s1>nn</span>.<span class=pl-c1>Linear</span>(<span class=pl-s1>d</span>, <span class=pl-c1>4</span> <span class=pl-c1>*</span> <span class=pl-s1>d</span>, <span class=pl-s1>bias</span><span class=pl-c1>=</span><span class=pl-s1>bias</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-transformer_ffnn-py-L24\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;24\&quot;></td>\n          <td id=\&quot;file-transformer_ffnn-py-LC24\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>gelu</span>    <span class=pl-c1>=</span> <span class=pl-s1>nn</span>.<span class=pl-c1>GELU</span>()</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-transformer_ffnn-py-L25\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;25\&quot;></td>\n          <td id=\&quot;file-transformer_ffnn-py-LC25\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>c_proj</span>  <span class=pl-c1>=</span> <span class=pl-s1>nn</span>.<span class=pl-c1>Linear</span>(<span class=pl-c1>4</span> <span class=pl-c1>*</span> <span class=pl-s1>d</span>, <span class=pl-s1>d</span>, <span class=pl-s1>bias</span><span class=pl-c1>=</span><span class=pl-s1>bias</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-transformer_ffnn-py-L26\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;26\&quot;></td>\n          <td id=\&quot;file-transformer_ffnn-py-LC26\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>dropout</span> <span class=pl-c1>=</span> <span class=pl-s1>nn</span>.<span class=pl-c1>Dropout</span>(<span class=pl-s1>dropout</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-transformer_ffnn-py-L27\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;27\&quot;></td>\n          <td id=\&quot;file-transformer_ffnn-py-LC27\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-transformer_ffnn-py-L28\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;28\&quot;></td>\n          <td id=\&quot;file-transformer_ffnn-py-LC28\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>forward</span>(<span class=pl-s1>self</span>, <span class=pl-s1>x</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-transformer_ffnn-py-L29\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;29\&quot;></td>\n          <td id=\&quot;file-transformer_ffnn-py-LC29\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>x</span> <span class=pl-c1>=</span> <span class=pl-s1>self</span>.<span class=pl-c1>c_fc</span>(<span class=pl-s1>x</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-transformer_ffnn-py-L30\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;30\&quot;></td>\n          <td id=\&quot;file-transformer_ffnn-py-LC30\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>x</span> <span class=pl-c1>=</span> <span class=pl-s1>self</span>.<span class=pl-c1>gelu</span>(<span class=pl-s1>x</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-transformer_ffnn-py-L31\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;31\&quot;></td>\n          <td id=\&quot;file-transformer_ffnn-py-LC31\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>x</span> <span class=pl-c1>=</span> <span class=pl-s1>self</span>.<span class=pl-c1>c_proj</span>(<span class=pl-s1>x</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-transformer_ffnn-py-L32\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;32\&quot;></td>\n          <td id=\&quot;file-transformer_ffnn-py-LC32\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>x</span> <span class=pl-c1>=</span> <span class=pl-s1>self</span>.<span class=pl-c1>dropout</span>(<span class=pl-s1>x</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-transformer_ffnn-py-L33\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;33\&quot;></td>\n          <td id=\&quot;file-transformer_ffnn-py-LC33\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>return</span> <span class=pl-s1>x</span></td>\n        </tr>\n  </table>\n</div>\n\n\n    </div>\n\n  </div>\n</div>\n\n      </div>\n      <div class=\&quot;gist-meta\&quot;>\n        <a href=\&quot;https://gist.github.com/wolfecameron/3ed9274a0297aab403b5e2d2254ee0ac/raw/77e99ec9495603504be2169fa962ffe0a7b9cf31/transformer_ffnn.py\&quot; style=\&quot;float:right\&quot; class=\&quot;Link--inTextBlock\&quot;>view raw</a>\n        <a href=\&quot;https://gist.github.com/wolfecameron/3ed9274a0297aab403b5e2d2254ee0ac#file-transformer_ffnn-py\&quot; class=\&quot;Link--inTextBlock\&quot;>\n          transformer_ffnn.py\n        </a>\n        hosted with &amp;#10084; by <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.com\&quot;>GitHub</a>\n      </div>\n    </div>\n</div>\n&quot;,&quot;stylesheet&quot;:&quot;https://github.githubassets.com/assets/gist-embed-9060cf3ad5bb.css&quot;}" data-component-name="GitgistToDOM"><link rel="stylesheet" href="https://github.githubassets.com/assets/gist-embed-9060cf3ad5bb.css"><div id="gist128793760" class="gist">
    <div class="gist-file" data-color-mode="light" data-light-theme="light">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container">
  <div id="file-transformer_ffnn-py" class="file my-2">
    
    <div itemprop="text" class="Box-body p-0 blob-wrapper data type-python  " style="overflow:auto">

        
<div class="js-check-bidi js-blob-code-container blob-code-content">

  
  <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
  
    

    <span>
      This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
    </span>


  <div data-view-component="true" class="flash-action">        <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn">    Show hidden characters
</a>
</div>
</div>

  <span data-view-component="true" class="line-alert tooltipped tooltipped-e">
    
    

</span>

  <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="transformer_ffnn.py">
        <tbody><tr>
          <td id="file-transformer_ffnn-py-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
          <td id="file-transformer_ffnn-py-LC1" class="blob-code blob-code-inner js-file-line"><span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-transformer_ffnn-py-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
          <td id="file-transformer_ffnn-py-LC2" class="blob-code blob-code-inner js-file-line"><span class="pl-s">Source: https://github.com/karpathy/nanoGPT/blob/master/model.py</span></td>
        </tr>
        <tr>
          <td id="file-transformer_ffnn-py-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
          <td id="file-transformer_ffnn-py-LC3" class="blob-code blob-code-inner js-file-line"><span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-transformer_ffnn-py-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
          <td id="file-transformer_ffnn-py-LC4" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-transformer_ffnn-py-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
          <td id="file-transformer_ffnn-py-LC5" class="blob-code blob-code-inner js-file-line"><span class="pl-k">from</span> <span class="pl-s1">torch</span> <span class="pl-k">import</span> <span class="pl-s1">nn</span></td>
        </tr>
        <tr>
          <td id="file-transformer_ffnn-py-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
          <td id="file-transformer_ffnn-py-LC6" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-transformer_ffnn-py-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
          <td id="file-transformer_ffnn-py-LC7" class="blob-code blob-code-inner js-file-line"><span class="pl-k">class</span> <span class="pl-c1">MLP</span>(<span class="pl-s1">nn</span>.<span class="pl-c1">Module</span>):</td>
        </tr>
        <tr>
          <td id="file-transformer_ffnn-py-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
          <td id="file-transformer_ffnn-py-LC8" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-transformer_ffnn-py-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
          <td id="file-transformer_ffnn-py-LC9" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">__init__</span>(</td>
        </tr>
        <tr>
          <td id="file-transformer_ffnn-py-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
          <td id="file-transformer_ffnn-py-LC10" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>,</td>
        </tr>
        <tr>
          <td id="file-transformer_ffnn-py-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
          <td id="file-transformer_ffnn-py-LC11" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">d</span>,</td>
        </tr>
        <tr>
          <td id="file-transformer_ffnn-py-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
          <td id="file-transformer_ffnn-py-LC12" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">bias</span><span class="pl-c1">=</span><span class="pl-c1">False</span>,</td>
        </tr>
        <tr>
          <td id="file-transformer_ffnn-py-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
          <td id="file-transformer_ffnn-py-LC13" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">dropout</span><span class="pl-c1">=</span><span class="pl-c1">0.2</span></td>
        </tr>
        <tr>
          <td id="file-transformer_ffnn-py-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
          <td id="file-transformer_ffnn-py-LC14" class="blob-code blob-code-inner js-file-line">    ):</td>
        </tr>
        <tr>
          <td id="file-transformer_ffnn-py-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
          <td id="file-transformer_ffnn-py-LC15" class="blob-code blob-code-inner js-file-line">        <span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-transformer_ffnn-py-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
          <td id="file-transformer_ffnn-py-LC16" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        Arguments:</span></td>
        </tr>
        <tr>
          <td id="file-transformer_ffnn-py-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
          <td id="file-transformer_ffnn-py-LC17" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        d: size of embedding dimension</span></td>
        </tr>
        <tr>
          <td id="file-transformer_ffnn-py-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
          <td id="file-transformer_ffnn-py-LC18" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        bias: whether or not to use bias in linear layers</span></td>
        </tr>
        <tr>
          <td id="file-transformer_ffnn-py-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
          <td id="file-transformer_ffnn-py-LC19" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        dropout: probability of dropout</span></td>
        </tr>
        <tr>
          <td id="file-transformer_ffnn-py-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td>
          <td id="file-transformer_ffnn-py-LC20" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        """</span></td>
        </tr>
        <tr>
          <td id="file-transformer_ffnn-py-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td>
          <td id="file-transformer_ffnn-py-LC21" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-transformer_ffnn-py-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td>
          <td id="file-transformer_ffnn-py-LC22" class="blob-code blob-code-inner js-file-line">        <span class="pl-en">super</span>().<span class="pl-c1">__init__</span>()</td>
        </tr>
        <tr>
          <td id="file-transformer_ffnn-py-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td>
          <td id="file-transformer_ffnn-py-LC23" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">c_fc</span>    <span class="pl-c1">=</span> <span class="pl-s1">nn</span>.<span class="pl-c1">Linear</span>(<span class="pl-s1">d</span>, <span class="pl-c1">4</span> <span class="pl-c1">*</span> <span class="pl-s1">d</span>, <span class="pl-s1">bias</span><span class="pl-c1">=</span><span class="pl-s1">bias</span>)</td>
        </tr>
        <tr>
          <td id="file-transformer_ffnn-py-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td>
          <td id="file-transformer_ffnn-py-LC24" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">gelu</span>    <span class="pl-c1">=</span> <span class="pl-s1">nn</span>.<span class="pl-c1">GELU</span>()</td>
        </tr>
        <tr>
          <td id="file-transformer_ffnn-py-L25" class="blob-num js-line-number js-blob-rnum" data-line-number="25"></td>
          <td id="file-transformer_ffnn-py-LC25" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">c_proj</span>  <span class="pl-c1">=</span> <span class="pl-s1">nn</span>.<span class="pl-c1">Linear</span>(<span class="pl-c1">4</span> <span class="pl-c1">*</span> <span class="pl-s1">d</span>, <span class="pl-s1">d</span>, <span class="pl-s1">bias</span><span class="pl-c1">=</span><span class="pl-s1">bias</span>)</td>
        </tr>
        <tr>
          <td id="file-transformer_ffnn-py-L26" class="blob-num js-line-number js-blob-rnum" data-line-number="26"></td>
          <td id="file-transformer_ffnn-py-LC26" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">dropout</span> <span class="pl-c1">=</span> <span class="pl-s1">nn</span>.<span class="pl-c1">Dropout</span>(<span class="pl-s1">dropout</span>)</td>
        </tr>
        <tr>
          <td id="file-transformer_ffnn-py-L27" class="blob-num js-line-number js-blob-rnum" data-line-number="27"></td>
          <td id="file-transformer_ffnn-py-LC27" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-transformer_ffnn-py-L28" class="blob-num js-line-number js-blob-rnum" data-line-number="28"></td>
          <td id="file-transformer_ffnn-py-LC28" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">forward</span>(<span class="pl-s1">self</span>, <span class="pl-s1">x</span>):</td>
        </tr>
        <tr>
          <td id="file-transformer_ffnn-py-L29" class="blob-num js-line-number js-blob-rnum" data-line-number="29"></td>
          <td id="file-transformer_ffnn-py-LC29" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">x</span> <span class="pl-c1">=</span> <span class="pl-s1">self</span>.<span class="pl-c1">c_fc</span>(<span class="pl-s1">x</span>)</td>
        </tr>
        <tr>
          <td id="file-transformer_ffnn-py-L30" class="blob-num js-line-number js-blob-rnum" data-line-number="30"></td>
          <td id="file-transformer_ffnn-py-LC30" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">x</span> <span class="pl-c1">=</span> <span class="pl-s1">self</span>.<span class="pl-c1">gelu</span>(<span class="pl-s1">x</span>)</td>
        </tr>
        <tr>
          <td id="file-transformer_ffnn-py-L31" class="blob-num js-line-number js-blob-rnum" data-line-number="31"></td>
          <td id="file-transformer_ffnn-py-LC31" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">x</span> <span class="pl-c1">=</span> <span class="pl-s1">self</span>.<span class="pl-c1">c_proj</span>(<span class="pl-s1">x</span>)</td>
        </tr>
        <tr>
          <td id="file-transformer_ffnn-py-L32" class="blob-num js-line-number js-blob-rnum" data-line-number="32"></td>
          <td id="file-transformer_ffnn-py-LC32" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">x</span> <span class="pl-c1">=</span> <span class="pl-s1">self</span>.<span class="pl-c1">dropout</span>(<span class="pl-s1">x</span>)</td>
        </tr>
        <tr>
          <td id="file-transformer_ffnn-py-L33" class="blob-num js-line-number js-blob-rnum" data-line-number="33"></td>
          <td id="file-transformer_ffnn-py-LC33" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">return</span> <span class="pl-s1">x</span></td>
        </tr>
  </tbody></table>
</div>


    </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/wolfecameron/3ed9274a0297aab403b5e2d2254ee0ac/raw/77e99ec9495603504be2169fa962ffe0a7b9cf31/transformer_ffnn.py" style="float:right" class="Link--inTextBlock">view raw</a>
        <a href="https://gist.github.com/wolfecameron/3ed9274a0297aab403b5e2d2254ee0ac#file-transformer_ffnn-py" class="Link--inTextBlock">
          transformer_ffnn.py
        </a>
        hosted with &#10084; by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
      </div>
    </div>
</div>
</div><h4>Decoder-Only Transformer Block</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xowv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1da32a13-6bcf-4b1a-a276-fad3f4315c58_906x1110.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xowv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1da32a13-6bcf-4b1a-a276-fad3f4315c58_906x1110.png 424w, https://substackcdn.com/image/fetch/$s_!xowv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1da32a13-6bcf-4b1a-a276-fad3f4315c58_906x1110.png 848w, https://substackcdn.com/image/fetch/$s_!xowv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1da32a13-6bcf-4b1a-a276-fad3f4315c58_906x1110.png 1272w, https://substackcdn.com/image/fetch/$s_!xowv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1da32a13-6bcf-4b1a-a276-fad3f4315c58_906x1110.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xowv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1da32a13-6bcf-4b1a-a276-fad3f4315c58_906x1110.png" width="386" height="472.9139072847682" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1da32a13-6bcf-4b1a-a276-fad3f4315c58_906x1110.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1110,&quot;width&quot;:906,&quot;resizeWidth&quot;:386,&quot;bytes&quot;:92079,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/155023686?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1da32a13-6bcf-4b1a-a276-fad3f4315c58_906x1110.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xowv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1da32a13-6bcf-4b1a-a276-fad3f4315c58_906x1110.png 424w, https://substackcdn.com/image/fetch/$s_!xowv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1da32a13-6bcf-4b1a-a276-fad3f4315c58_906x1110.png 848w, https://substackcdn.com/image/fetch/$s_!xowv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1da32a13-6bcf-4b1a-a276-fad3f4315c58_906x1110.png 1272w, https://substackcdn.com/image/fetch/$s_!xowv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1da32a13-6bcf-4b1a-a276-fad3f4315c58_906x1110.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Decoder-only Transformer Block</figcaption></figure></div><p>To construct a decoder-only transformer block, we use both components&#8212;<em>masked self-attention and a feed-forward transformation</em>&#8212;that we have seen so far, as well as place normalization operations and residual connections between components. A depiction of the full decoder-only transformer block<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a> is shown above.</p><p>A <strong>residual connection</strong> [4] simply adds the input for a neural network layer to the output for that layer before passing this representation to the next layer&#8212;<em>as opposed to solely passing the layer&#8217;s output to the next layer without adding the input</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!46M7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20382740-62ff-43e2-b77c-a4cece72fa48_964x546.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!46M7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20382740-62ff-43e2-b77c-a4cece72fa48_964x546.png 424w, https://substackcdn.com/image/fetch/$s_!46M7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20382740-62ff-43e2-b77c-a4cece72fa48_964x546.png 848w, https://substackcdn.com/image/fetch/$s_!46M7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20382740-62ff-43e2-b77c-a4cece72fa48_964x546.png 1272w, https://substackcdn.com/image/fetch/$s_!46M7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20382740-62ff-43e2-b77c-a4cece72fa48_964x546.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!46M7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20382740-62ff-43e2-b77c-a4cece72fa48_964x546.png" width="426" height="241.28215767634856" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/20382740-62ff-43e2-b77c-a4cece72fa48_964x546.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:546,&quot;width&quot;:964,&quot;resizeWidth&quot;:426,&quot;bytes&quot;:63167,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/155023686?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20382740-62ff-43e2-b77c-a4cece72fa48_964x546.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!46M7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20382740-62ff-43e2-b77c-a4cece72fa48_964x546.png 424w, https://substackcdn.com/image/fetch/$s_!46M7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20382740-62ff-43e2-b77c-a4cece72fa48_964x546.png 848w, https://substackcdn.com/image/fetch/$s_!46M7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20382740-62ff-43e2-b77c-a4cece72fa48_964x546.png 1272w, https://substackcdn.com/image/fetch/$s_!46M7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20382740-62ff-43e2-b77c-a4cece72fa48_964x546.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Residual connection in a generic neural network layer</figcaption></figure></div><p>Residual connections are widely used within deep learning and can be applied to any kind of neural network layer<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a>. Adding residual connections helps to avoid issues with <a href="https://www.geeksforgeeks.org/vanishing-and-exploding-gradients-problems-in-deep-learning/">vanishing / exploding gradients</a> and generally improves the stability of training by providing a &#8220;short cut&#8221; that allows gradients to flow freely through the network during backpropagation; see <a href="https://arxiv.org/abs/1712.09913">here</a> for more details. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qaYL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff125254a-28af-43c4-80e0-d2273b1702c9_1888x612.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qaYL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff125254a-28af-43c4-80e0-d2273b1702c9_1888x612.png 424w, https://substackcdn.com/image/fetch/$s_!qaYL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff125254a-28af-43c4-80e0-d2273b1702c9_1888x612.png 848w, https://substackcdn.com/image/fetch/$s_!qaYL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff125254a-28af-43c4-80e0-d2273b1702c9_1888x612.png 1272w, https://substackcdn.com/image/fetch/$s_!qaYL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff125254a-28af-43c4-80e0-d2273b1702c9_1888x612.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qaYL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff125254a-28af-43c4-80e0-d2273b1702c9_1888x612.png" width="490" height="158.84615384615384" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f125254a-28af-43c4-80e0-d2273b1702c9_1888x612.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:472,&quot;width&quot;:1456,&quot;resizeWidth&quot;:490,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qaYL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff125254a-28af-43c4-80e0-d2273b1702c9_1888x612.png 424w, https://substackcdn.com/image/fetch/$s_!qaYL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff125254a-28af-43c4-80e0-d2273b1702c9_1888x612.png 848w, https://substackcdn.com/image/fetch/$s_!qaYL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff125254a-28af-43c4-80e0-d2273b1702c9_1888x612.png 1272w, https://substackcdn.com/image/fetch/$s_!qaYL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff125254a-28af-43c4-80e0-d2273b1702c9_1888x612.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Layer normalization with an affine transformation</figcaption></figure></div><p><strong>Normalizing</strong> the input (or output) of a neural network layer can also aid training stability. Although <a href="https://cameronrwolfe.substack.com/i/142044446/layer-normalization">many types of normalization</a> exist, the most commonly used normalization variant for transformers / LLMs is layer normalization; see above. Here, the normalization operation has two components:</p><ol><li><p>Performing normalization.</p></li><li><p>Applying a (learnable) affine transformation.</p></li></ol><p>In other words, we multiply the normalized values by weight and add a bias instead of directly using the normalized output. Both the weight and bias are learnable parameters that can be trained along with other network parameters. Layer normalization is implemented in PyTorch and easy to use; see <a href="https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html">here</a>.</p><div class="github-gist" data-attrs="{&quot;innerHTML&quot;:&quot;<div id=\&quot;gist128793802\&quot; class=\&quot;gist\&quot;>\n    <div class=\&quot;gist-file\&quot; translate=\&quot;no\&quot; data-color-mode=\&quot;light\&quot; data-light-theme=\&quot;light\&quot;>\n      <div class=\&quot;gist-data\&quot;>\n        <div class=\&quot;js-gist-file-update-container js-task-list-container\&quot;>\n  <div id=\&quot;file-decoder_only_block-py\&quot; class=\&quot;file my-2\&quot;>\n    \n    <div itemprop=\&quot;text\&quot;\n      class=\&quot;Box-body p-0 blob-wrapper data type-python  \&quot;\n      style=\&quot;overflow: auto\&quot; tabindex=\&quot;0\&quot; role=\&quot;region\&quot;\n      aria-label=\&quot;file-decoder_only_block-py\&quot;\n    >\n\n        \n<div class=\&quot;js-check-bidi js-blob-code-container blob-code-content\&quot;>\n\n  <template class=\&quot;js-file-alert-template\&quot;>\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash flash-warn flash-full d-flex flex-items-center\&quot;>\n  <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n    <span>\n      This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.\n      <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.co/hiddenchars\&quot; target=\&quot;_blank\&quot;>Learn more about bidirectional Unicode characters</a>\n    </span>\n\n\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash-action\&quot;>        <a href=\&quot;{{ revealButtonHref }}\&quot; data-view-component=\&quot;true\&quot; class=\&quot;btn-sm btn\&quot;>    Show hidden characters\n</a>\n</div>\n</div></template>\n<template class=\&quot;js-line-alert-template\&quot;>\n  <span aria-label=\&quot;This line has hidden Unicode characters\&quot; data-view-component=\&quot;true\&quot; class=\&quot;line-alert tooltipped tooltipped-e\&quot;>\n    <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n</span></template>\n\n  <table data-hpc class=\&quot;highlight tab-size js-file-line-container\&quot; data-tab-size=\&quot;8\&quot; data-paste-markdown-skip data-tagsearch-path=\&quot;decoder_only_block.py\&quot;>\n        <tr>\n          <td id=\&quot;file-decoder_only_block-py-L1\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;1\&quot;></td>\n          <td id=\&quot;file-decoder_only_block-py-LC1\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-decoder_only_block-py-L2\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;2\&quot;></td>\n          <td id=\&quot;file-decoder_only_block-py-LC2\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>Source: https://github.com/karpathy/nanoGPT/blob/master/model.py</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-decoder_only_block-py-L3\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;3\&quot;></td>\n          <td id=\&quot;file-decoder_only_block-py-LC3\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-decoder_only_block-py-L4\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;4\&quot;></td>\n          <td id=\&quot;file-decoder_only_block-py-LC4\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-decoder_only_block-py-L5\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;5\&quot;></td>\n          <td id=\&quot;file-decoder_only_block-py-LC5\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>from</span> <span class=pl-s1>torch</span> <span class=pl-k>import</span> <span class=pl-s1>nn</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-decoder_only_block-py-L6\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;6\&quot;></td>\n          <td id=\&quot;file-decoder_only_block-py-LC6\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-decoder_only_block-py-L7\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;7\&quot;></td>\n          <td id=\&quot;file-decoder_only_block-py-LC7\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>class</span> <span class=pl-v>Block</span>(<span class=pl-s1>nn</span>.<span class=pl-c1>Module</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-decoder_only_block-py-L8\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;8\&quot;></td>\n          <td id=\&quot;file-decoder_only_block-py-LC8\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>__init__</span>(</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-decoder_only_block-py-L9\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;9\&quot;></td>\n          <td id=\&quot;file-decoder_only_block-py-LC9\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-decoder_only_block-py-L10\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;10\&quot;></td>\n          <td id=\&quot;file-decoder_only_block-py-LC10\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>d</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-decoder_only_block-py-L11\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;11\&quot;></td>\n          <td id=\&quot;file-decoder_only_block-py-LC11\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c1>H</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-decoder_only_block-py-L12\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;12\&quot;></td>\n          <td id=\&quot;file-decoder_only_block-py-LC12\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c1>T</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-decoder_only_block-py-L13\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;13\&quot;></td>\n          <td id=\&quot;file-decoder_only_block-py-LC13\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>bias</span><span class=pl-c1>=</span><span class=pl-c1>False</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-decoder_only_block-py-L14\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;14\&quot;></td>\n          <td id=\&quot;file-decoder_only_block-py-LC14\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>dropout</span><span class=pl-c1>=</span><span class=pl-c1>0.2</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-decoder_only_block-py-L15\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;15\&quot;></td>\n          <td id=\&quot;file-decoder_only_block-py-LC15\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    ):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-decoder_only_block-py-L16\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;16\&quot;></td>\n          <td id=\&quot;file-decoder_only_block-py-LC16\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-decoder_only_block-py-L17\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;17\&quot;></td>\n          <td id=\&quot;file-decoder_only_block-py-LC17\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        Arguments:</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-decoder_only_block-py-L18\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;18\&quot;></td>\n          <td id=\&quot;file-decoder_only_block-py-LC18\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        d: size of embedding dimension</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-decoder_only_block-py-L19\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;19\&quot;></td>\n          <td id=\&quot;file-decoder_only_block-py-LC19\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        H: number of attention heads</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-decoder_only_block-py-L20\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;20\&quot;></td>\n          <td id=\&quot;file-decoder_only_block-py-LC20\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        T: maximum length of input sequences (in tokens)</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-decoder_only_block-py-L21\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;21\&quot;></td>\n          <td id=\&quot;file-decoder_only_block-py-LC21\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        bias: whether or not to use bias in linear layers</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-decoder_only_block-py-L22\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;22\&quot;></td>\n          <td id=\&quot;file-decoder_only_block-py-LC22\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        dropout: probability of dropout</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-decoder_only_block-py-L23\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;23\&quot;></td>\n          <td id=\&quot;file-decoder_only_block-py-LC23\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        &amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-decoder_only_block-py-L24\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;24\&quot;></td>\n          <td id=\&quot;file-decoder_only_block-py-LC24\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-decoder_only_block-py-L25\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;25\&quot;></td>\n          <td id=\&quot;file-decoder_only_block-py-LC25\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-en>super</span>().<span class=pl-c1>__init__</span>()</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-decoder_only_block-py-L26\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;26\&quot;></td>\n          <td id=\&quot;file-decoder_only_block-py-LC26\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>ln_1</span> <span class=pl-c1>=</span> <span class=pl-s1>nn</span>.<span class=pl-c1>LayerNorm</span>(<span class=pl-s1>d</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-decoder_only_block-py-L27\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;27\&quot;></td>\n          <td id=\&quot;file-decoder_only_block-py-LC27\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>attn</span> <span class=pl-c1>=</span> <span class=pl-en>CausalSelfAttention</span>(<span class=pl-s1>d</span>, <span class=pl-c1>H</span>, <span class=pl-c1>T</span>, <span class=pl-s1>bias</span>, <span class=pl-s1>dropout</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-decoder_only_block-py-L28\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;28\&quot;></td>\n          <td id=\&quot;file-decoder_only_block-py-LC28\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>ln_2</span> <span class=pl-c1>=</span> <span class=pl-s1>nn</span>.<span class=pl-c1>LayerNorm</span>(<span class=pl-s1>d</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-decoder_only_block-py-L29\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;29\&quot;></td>\n          <td id=\&quot;file-decoder_only_block-py-LC29\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>ffnn</span> <span class=pl-c1>=</span> <span class=pl-en>MLP</span>(<span class=pl-s1>d</span>, <span class=pl-s1>bias</span>, <span class=pl-s1>dropout</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-decoder_only_block-py-L30\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;30\&quot;></td>\n          <td id=\&quot;file-decoder_only_block-py-LC30\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-decoder_only_block-py-L31\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;31\&quot;></td>\n          <td id=\&quot;file-decoder_only_block-py-LC31\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>forward</span>(<span class=pl-s1>self</span>, <span class=pl-s1>x</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-decoder_only_block-py-L32\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;32\&quot;></td>\n          <td id=\&quot;file-decoder_only_block-py-LC32\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>x</span> <span class=pl-c1>=</span> <span class=pl-s1>x</span> <span class=pl-c1>+</span> <span class=pl-s1>self</span>.<span class=pl-c1>attn</span>(<span class=pl-s1>self</span>.<span class=pl-c1>ln_1</span>(<span class=pl-s1>x</span>))</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-decoder_only_block-py-L33\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;33\&quot;></td>\n          <td id=\&quot;file-decoder_only_block-py-LC33\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>x</span> <span class=pl-c1>=</span> <span class=pl-s1>x</span> <span class=pl-c1>+</span> <span class=pl-s1>self</span>.<span class=pl-c1>ffnn</span>(<span class=pl-s1>self</span>.<span class=pl-c1>ln_2</span>(<span class=pl-s1>x</span>))</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-decoder_only_block-py-L34\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;34\&quot;></td>\n          <td id=\&quot;file-decoder_only_block-py-LC34\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>return</span> <span class=pl-s1>x</span></td>\n        </tr>\n  </table>\n</div>\n\n\n    </div>\n\n  </div>\n</div>\n\n      </div>\n      <div class=\&quot;gist-meta\&quot;>\n        <a href=\&quot;https://gist.github.com/wolfecameron/0ad044748283c90b4d3002bdc5dbc674/raw/a8979bf0de7b5b41f3c39897d581343de3bc05fc/decoder_only_block.py\&quot; style=\&quot;float:right\&quot; class=\&quot;Link--inTextBlock\&quot;>view raw</a>\n        <a href=\&quot;https://gist.github.com/wolfecameron/0ad044748283c90b4d3002bdc5dbc674#file-decoder_only_block-py\&quot; class=\&quot;Link--inTextBlock\&quot;>\n          decoder_only_block.py\n        </a>\n        hosted with &amp;#10084; by <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.com\&quot;>GitHub</a>\n      </div>\n    </div>\n</div>\n&quot;,&quot;stylesheet&quot;:&quot;https://github.githubassets.com/assets/gist-embed-9060cf3ad5bb.css&quot;}" data-component-name="GitgistToDOM"><link rel="stylesheet" href="https://github.githubassets.com/assets/gist-embed-9060cf3ad5bb.css"><div id="gist128793802" class="gist">
    <div class="gist-file" data-color-mode="light" data-light-theme="light">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container">
  <div id="file-decoder_only_block-py" class="file my-2">
    
    <div itemprop="text" class="Box-body p-0 blob-wrapper data type-python  " style="overflow:auto">

        
<div class="js-check-bidi js-blob-code-container blob-code-content">

  
  <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
  
    

    <span>
      This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
    </span>


  <div data-view-component="true" class="flash-action">        <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn">    Show hidden characters
</a>
</div>
</div>

  <span data-view-component="true" class="line-alert tooltipped tooltipped-e">
    
    

</span>

  <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="decoder_only_block.py">
        <tbody><tr>
          <td id="file-decoder_only_block-py-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
          <td id="file-decoder_only_block-py-LC1" class="blob-code blob-code-inner js-file-line"><span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-decoder_only_block-py-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
          <td id="file-decoder_only_block-py-LC2" class="blob-code blob-code-inner js-file-line"><span class="pl-s">Source: https://github.com/karpathy/nanoGPT/blob/master/model.py</span></td>
        </tr>
        <tr>
          <td id="file-decoder_only_block-py-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
          <td id="file-decoder_only_block-py-LC3" class="blob-code blob-code-inner js-file-line"><span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-decoder_only_block-py-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
          <td id="file-decoder_only_block-py-LC4" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-decoder_only_block-py-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
          <td id="file-decoder_only_block-py-LC5" class="blob-code blob-code-inner js-file-line"><span class="pl-k">from</span> <span class="pl-s1">torch</span> <span class="pl-k">import</span> <span class="pl-s1">nn</span></td>
        </tr>
        <tr>
          <td id="file-decoder_only_block-py-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
          <td id="file-decoder_only_block-py-LC6" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-decoder_only_block-py-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
          <td id="file-decoder_only_block-py-LC7" class="blob-code blob-code-inner js-file-line"><span class="pl-k">class</span> <span class="pl-v">Block</span>(<span class="pl-s1">nn</span>.<span class="pl-c1">Module</span>):</td>
        </tr>
        <tr>
          <td id="file-decoder_only_block-py-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
          <td id="file-decoder_only_block-py-LC8" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">__init__</span>(</td>
        </tr>
        <tr>
          <td id="file-decoder_only_block-py-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
          <td id="file-decoder_only_block-py-LC9" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>,</td>
        </tr>
        <tr>
          <td id="file-decoder_only_block-py-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
          <td id="file-decoder_only_block-py-LC10" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">d</span>,</td>
        </tr>
        <tr>
          <td id="file-decoder_only_block-py-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
          <td id="file-decoder_only_block-py-LC11" class="blob-code blob-code-inner js-file-line">        <span class="pl-c1">H</span>,</td>
        </tr>
        <tr>
          <td id="file-decoder_only_block-py-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
          <td id="file-decoder_only_block-py-LC12" class="blob-code blob-code-inner js-file-line">        <span class="pl-c1">T</span>,</td>
        </tr>
        <tr>
          <td id="file-decoder_only_block-py-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
          <td id="file-decoder_only_block-py-LC13" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">bias</span><span class="pl-c1">=</span><span class="pl-c1">False</span>,</td>
        </tr>
        <tr>
          <td id="file-decoder_only_block-py-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
          <td id="file-decoder_only_block-py-LC14" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">dropout</span><span class="pl-c1">=</span><span class="pl-c1">0.2</span>,</td>
        </tr>
        <tr>
          <td id="file-decoder_only_block-py-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
          <td id="file-decoder_only_block-py-LC15" class="blob-code blob-code-inner js-file-line">    ):</td>
        </tr>
        <tr>
          <td id="file-decoder_only_block-py-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
          <td id="file-decoder_only_block-py-LC16" class="blob-code blob-code-inner js-file-line">        <span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-decoder_only_block-py-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
          <td id="file-decoder_only_block-py-LC17" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        Arguments:</span></td>
        </tr>
        <tr>
          <td id="file-decoder_only_block-py-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
          <td id="file-decoder_only_block-py-LC18" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        d: size of embedding dimension</span></td>
        </tr>
        <tr>
          <td id="file-decoder_only_block-py-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
          <td id="file-decoder_only_block-py-LC19" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        H: number of attention heads</span></td>
        </tr>
        <tr>
          <td id="file-decoder_only_block-py-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td>
          <td id="file-decoder_only_block-py-LC20" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        T: maximum length of input sequences (in tokens)</span></td>
        </tr>
        <tr>
          <td id="file-decoder_only_block-py-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td>
          <td id="file-decoder_only_block-py-LC21" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        bias: whether or not to use bias in linear layers</span></td>
        </tr>
        <tr>
          <td id="file-decoder_only_block-py-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td>
          <td id="file-decoder_only_block-py-LC22" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        dropout: probability of dropout</span></td>
        </tr>
        <tr>
          <td id="file-decoder_only_block-py-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td>
          <td id="file-decoder_only_block-py-LC23" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        """</span></td>
        </tr>
        <tr>
          <td id="file-decoder_only_block-py-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td>
          <td id="file-decoder_only_block-py-LC24" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-decoder_only_block-py-L25" class="blob-num js-line-number js-blob-rnum" data-line-number="25"></td>
          <td id="file-decoder_only_block-py-LC25" class="blob-code blob-code-inner js-file-line">        <span class="pl-en">super</span>().<span class="pl-c1">__init__</span>()</td>
        </tr>
        <tr>
          <td id="file-decoder_only_block-py-L26" class="blob-num js-line-number js-blob-rnum" data-line-number="26"></td>
          <td id="file-decoder_only_block-py-LC26" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">ln_1</span> <span class="pl-c1">=</span> <span class="pl-s1">nn</span>.<span class="pl-c1">LayerNorm</span>(<span class="pl-s1">d</span>)</td>
        </tr>
        <tr>
          <td id="file-decoder_only_block-py-L27" class="blob-num js-line-number js-blob-rnum" data-line-number="27"></td>
          <td id="file-decoder_only_block-py-LC27" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">attn</span> <span class="pl-c1">=</span> <span class="pl-en">CausalSelfAttention</span>(<span class="pl-s1">d</span>, <span class="pl-c1">H</span>, <span class="pl-c1">T</span>, <span class="pl-s1">bias</span>, <span class="pl-s1">dropout</span>)</td>
        </tr>
        <tr>
          <td id="file-decoder_only_block-py-L28" class="blob-num js-line-number js-blob-rnum" data-line-number="28"></td>
          <td id="file-decoder_only_block-py-LC28" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">ln_2</span> <span class="pl-c1">=</span> <span class="pl-s1">nn</span>.<span class="pl-c1">LayerNorm</span>(<span class="pl-s1">d</span>)</td>
        </tr>
        <tr>
          <td id="file-decoder_only_block-py-L29" class="blob-num js-line-number js-blob-rnum" data-line-number="29"></td>
          <td id="file-decoder_only_block-py-LC29" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">ffnn</span> <span class="pl-c1">=</span> <span class="pl-en">MLP</span>(<span class="pl-s1">d</span>, <span class="pl-s1">bias</span>, <span class="pl-s1">dropout</span>)</td>
        </tr>
        <tr>
          <td id="file-decoder_only_block-py-L30" class="blob-num js-line-number js-blob-rnum" data-line-number="30"></td>
          <td id="file-decoder_only_block-py-LC30" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-decoder_only_block-py-L31" class="blob-num js-line-number js-blob-rnum" data-line-number="31"></td>
          <td id="file-decoder_only_block-py-LC31" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">forward</span>(<span class="pl-s1">self</span>, <span class="pl-s1">x</span>):</td>
        </tr>
        <tr>
          <td id="file-decoder_only_block-py-L32" class="blob-num js-line-number js-blob-rnum" data-line-number="32"></td>
          <td id="file-decoder_only_block-py-LC32" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">x</span> <span class="pl-c1">=</span> <span class="pl-s1">x</span> <span class="pl-c1">+</span> <span class="pl-s1">self</span>.<span class="pl-c1">attn</span>(<span class="pl-s1">self</span>.<span class="pl-c1">ln_1</span>(<span class="pl-s1">x</span>))</td>
        </tr>
        <tr>
          <td id="file-decoder_only_block-py-L33" class="blob-num js-line-number js-blob-rnum" data-line-number="33"></td>
          <td id="file-decoder_only_block-py-LC33" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">x</span> <span class="pl-c1">=</span> <span class="pl-s1">x</span> <span class="pl-c1">+</span> <span class="pl-s1">self</span>.<span class="pl-c1">ffnn</span>(<span class="pl-s1">self</span>.<span class="pl-c1">ln_2</span>(<span class="pl-s1">x</span>))</td>
        </tr>
        <tr>
          <td id="file-decoder_only_block-py-L34" class="blob-num js-line-number js-blob-rnum" data-line-number="34"></td>
          <td id="file-decoder_only_block-py-LC34" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">return</span> <span class="pl-s1">x</span></td>
        </tr>
  </tbody></table>
</div>


    </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/wolfecameron/0ad044748283c90b4d3002bdc5dbc674/raw/a8979bf0de7b5b41f3c39897d581343de3bc05fc/decoder_only_block.py" style="float:right" class="Link--inTextBlock">view raw</a>
        <a href="https://gist.github.com/wolfecameron/0ad044748283c90b4d3002bdc5dbc674#file-decoder_only_block-py" class="Link--inTextBlock">
          decoder_only_block.py
        </a>
        hosted with &#10084; by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
      </div>
    </div>
</div>
</div><p><strong>Block implementation.</strong> A decoder-only transformer block implementation is provided above. Here, we use our prior attention and feed-forward transformation implementations. By using the modules we have already defined, the decoder-only transformer block implementation actually becomes quite simple! </p><h4>Decoder-only Transformer Architecture</h4><p>Once we grasp the input and block structure of the decoder-only transformer, the rest of the architecture is pretty simple&#8212;<em>we just repeat the same block </em><code>L</code><em> times</em>! For each block, the size of the model&#8217;s input <code>[B, C, d]</code> is maintained, so the output of our <code>L</code>-th decoder-only transformer block is also a tensor of this size; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tePi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98768d59-2bb6-442d-a84d-4fc9e5f1dd9f_1736x934.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tePi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98768d59-2bb6-442d-a84d-4fc9e5f1dd9f_1736x934.png 424w, https://substackcdn.com/image/fetch/$s_!tePi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98768d59-2bb6-442d-a84d-4fc9e5f1dd9f_1736x934.png 848w, https://substackcdn.com/image/fetch/$s_!tePi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98768d59-2bb6-442d-a84d-4fc9e5f1dd9f_1736x934.png 1272w, https://substackcdn.com/image/fetch/$s_!tePi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98768d59-2bb6-442d-a84d-4fc9e5f1dd9f_1736x934.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tePi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98768d59-2bb6-442d-a84d-4fc9e5f1dd9f_1736x934.png" width="1456" height="783" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/98768d59-2bb6-442d-a84d-4fc9e5f1dd9f_1736x934.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:783,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:209399,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/155023686?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98768d59-2bb6-442d-a84d-4fc9e5f1dd9f_1736x934.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tePi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98768d59-2bb6-442d-a84d-4fc9e5f1dd9f_1736x934.png 424w, https://substackcdn.com/image/fetch/$s_!tePi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98768d59-2bb6-442d-a84d-4fc9e5f1dd9f_1736x934.png 848w, https://substackcdn.com/image/fetch/$s_!tePi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98768d59-2bb6-442d-a84d-4fc9e5f1dd9f_1736x934.png 1272w, https://substackcdn.com/image/fetch/$s_!tePi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98768d59-2bb6-442d-a84d-4fc9e5f1dd9f_1736x934.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Predicting next tokens with an LLM</figcaption></figure></div><p>A full implementation of a (GPT-style) decoder-only transformer architecture is provided below. Here, the architecture contains several components, including two embedding layers (i.e., for tokens and positions), all <code>L</code> transformer blocks, and a final classification module&#8212;<em>including layer normalization and a linear layer</em>&#8212;for performing next token prediction given an output token embedding as input. The model operates by just passing its input&#8212;<em>a set of input token IDs with size </em><code>[B, C]</code>&#8212;through each of these components to produce a set of output token IDs. </p><div class="github-gist" data-attrs="{&quot;innerHTML&quot;:&quot;<div id=\&quot;gist128793913\&quot; class=\&quot;gist\&quot;>\n    <div class=\&quot;gist-file\&quot; translate=\&quot;no\&quot; data-color-mode=\&quot;light\&quot; data-light-theme=\&quot;light\&quot;>\n      <div class=\&quot;gist-data\&quot;>\n        <div class=\&quot;js-gist-file-update-container js-task-list-container\&quot;>\n  <div id=\&quot;file-gpt-py\&quot; class=\&quot;file my-2\&quot;>\n    \n    <div itemprop=\&quot;text\&quot;\n      class=\&quot;Box-body p-0 blob-wrapper data type-python  \&quot;\n      style=\&quot;overflow: auto\&quot; tabindex=\&quot;0\&quot; role=\&quot;region\&quot;\n      aria-label=\&quot;file-gpt-py\&quot;\n    >\n\n        \n<div class=\&quot;js-check-bidi js-blob-code-container blob-code-content\&quot;>\n\n  <template class=\&quot;js-file-alert-template\&quot;>\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash flash-warn flash-full d-flex flex-items-center\&quot;>\n  <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n    <span>\n      This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.\n      <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.co/hiddenchars\&quot; target=\&quot;_blank\&quot;>Learn more about bidirectional Unicode characters</a>\n    </span>\n\n\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash-action\&quot;>        <a href=\&quot;{{ revealButtonHref }}\&quot; data-view-component=\&quot;true\&quot; class=\&quot;btn-sm btn\&quot;>    Show hidden characters\n</a>\n</div>\n</div></template>\n<template class=\&quot;js-line-alert-template\&quot;>\n  <span aria-label=\&quot;This line has hidden Unicode characters\&quot; data-view-component=\&quot;true\&quot; class=\&quot;line-alert tooltipped tooltipped-e\&quot;>\n    <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n</span></template>\n\n  <table data-hpc class=\&quot;highlight tab-size js-file-line-container\&quot; data-tab-size=\&quot;8\&quot; data-paste-markdown-skip data-tagsearch-path=\&quot;gpt.py\&quot;>\n        <tr>\n          <td id=\&quot;file-gpt-py-L1\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;1\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC1\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L2\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;2\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC2\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>Source: https://github.com/karpathy/nanoGPT/blob/master/model.py</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L3\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;3\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC3\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L4\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;4\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC4\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L5\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;5\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC5\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>import</span> <span class=pl-s1>torch</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L6\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;6\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC6\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>from</span> <span class=pl-s1>torch</span> <span class=pl-k>import</span> <span class=pl-s1>nn</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L7\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;7\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC7\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>import</span> <span class=pl-s1>torch</span>.<span class=pl-s1>nn</span>.<span class=pl-s1>functional</span> <span class=pl-k>as</span> <span class=pl-c1>F</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L8\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;8\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC8\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L9\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;9\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC9\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>class</span> <span class=pl-c1>GPT</span>(<span class=pl-s1>nn</span>.<span class=pl-c1>Module</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L10\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;10\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC10\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L11\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;11\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC11\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>__init__</span>(<span class=pl-s1>self</span>, </td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L12\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;12\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC12\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>d</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L13\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;13\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC13\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c1>H</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L14\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;14\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC14\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c1>C</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L15\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;15\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC15\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c1>V</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L16\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;16\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC16\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>layers</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L17\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;17\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC17\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>bias</span><span class=pl-c1>=</span><span class=pl-c1>False</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L18\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;18\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC18\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>dropout</span><span class=pl-c1>=</span><span class=pl-c1>0.2</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L19\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;19\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC19\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    ):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L20\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;20\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC20\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L21\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;21\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC21\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        Arguments:</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L22\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;22\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC22\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        d: size of embedding dimension</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L23\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;23\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC23\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        H: number of attention heads</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L24\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;24\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC24\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        C: maximum length of input sequences (in tokens)</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L25\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;25\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC25\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        V: size of the token vocabulary</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L26\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;26\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC26\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        layers: number of decoder-only blocks</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L27\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;27\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC27\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        bias: whether or not to use bias in linear layers</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L28\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;28\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC28\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        dropout: probability of dropout</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L29\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;29\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC29\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        &amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L30\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;30\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC30\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L31\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;31\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC31\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-en>super</span>().<span class=pl-c1>__init__</span>()</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L32\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;32\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC32\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>transformer</span> <span class=pl-c1>=</span> <span class=pl-s1>nn</span>.<span class=pl-c1>ModuleDict</span>(<span class=pl-en>dict</span>(</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L33\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;33\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC33\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>wte</span><span class=pl-c1>=</span><span class=pl-s1>nn</span>.<span class=pl-c1>Embedding</span>(<span class=pl-c1>V</span>, <span class=pl-s1>d</span>), <span class=pl-c># token embeddings</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L34\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;34\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC34\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>wpe</span><span class=pl-c1>=</span><span class=pl-s1>nn</span>.<span class=pl-c1>Embedding</span>(<span class=pl-c1>C</span>, <span class=pl-s1>d</span>), <span class=pl-c># position embeddings</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L35\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;35\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC35\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>drop</span><span class=pl-c1>=</span><span class=pl-s1>nn</span>.<span class=pl-c1>Dropout</span>(<span class=pl-s1>dropout</span>),</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L36\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;36\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC36\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>blocks</span><span class=pl-c1>=</span><span class=pl-s1>nn</span>.<span class=pl-c1>ModuleList</span>([<span class=pl-en>Block</span>(<span class=pl-s1>d</span>, <span class=pl-c1>H</span>, <span class=pl-c1>C</span>, <span class=pl-s1>bias</span>, <span class=pl-s1>dropout</span>) <span class=pl-k>for</span> <span class=pl-s1>_</span> <span class=pl-c1>in</span> <span class=pl-en>range</span>(<span class=pl-s1>layers</span>)]),</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L37\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;37\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC37\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>ln_f</span><span class=pl-c1>=</span><span class=pl-s1>nn</span>.<span class=pl-c1>LayerNorm</span>(<span class=pl-s1>d</span>),</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L38\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;38\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC38\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>head</span><span class=pl-c1>=</span><span class=pl-s1>nn</span>.<span class=pl-c1>Linear</span>(<span class=pl-s1>d</span>, <span class=pl-c1>V</span>, <span class=pl-s1>bias</span><span class=pl-c1>=</span><span class=pl-s1>bias</span>),</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L39\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;39\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC39\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        ))</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L40\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;40\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC40\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L41\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;41\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC41\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>forward</span>(<span class=pl-s1>self</span>, <span class=pl-s1>idx</span>, <span class=pl-s1>targets</span><span class=pl-c1>=</span><span class=pl-c1>None</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L42\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;42\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC42\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># idx is a [B, C] matrix of token indices</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L43\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;43\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC43\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># targets is a [B, C] matrix of target (next) token indices</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L44\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;44\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC44\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>device</span> <span class=pl-c1>=</span> <span class=pl-s1>idx</span>.<span class=pl-c1>device</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L45\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;45\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC45\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>_</span>, <span class=pl-c1>C</span> <span class=pl-c1>=</span> <span class=pl-s1>idx</span>.<span class=pl-c1>size</span>() <span class=pl-c># [B, C]</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L46\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;46\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC46\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>pos</span> <span class=pl-c1>=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>arange</span>(<span class=pl-c1>0</span>, <span class=pl-c1>C</span>, <span class=pl-s1>dtype</span><span class=pl-c1>=</span><span class=pl-s1>torch</span>.<span class=pl-c1>long</span>, <span class=pl-s1>device</span><span class=pl-c1>=</span><span class=pl-s1>device</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L47\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;47\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC47\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L48\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;48\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC48\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># generate token and position embeddings</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L49\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;49\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC49\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>tok_emb</span> <span class=pl-c1>=</span> <span class=pl-s1>self</span>.<span class=pl-c1>transformer</span>.<span class=pl-c1>wte</span>(<span class=pl-s1>idx</span>) <span class=pl-c># [B, C, d]</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L50\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;50\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC50\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>pos_emb</span> <span class=pl-c1>=</span> <span class=pl-s1>self</span>.<span class=pl-c1>transformer</span>.<span class=pl-c1>wpe</span>(<span class=pl-s1>pos</span>) <span class=pl-c># [C, d]</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L51\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;51\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC51\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>x</span> <span class=pl-c1>=</span> <span class=pl-s1>self</span>.<span class=pl-c1>transformer</span>.<span class=pl-c1>drop</span>(<span class=pl-s1>tok_emb</span> <span class=pl-c1>+</span> <span class=pl-s1>pos_emb</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L52\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;52\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC52\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L53\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;53\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC53\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># pass through all decoder-only blocks</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L54\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;54\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC54\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>for</span> <span class=pl-s1>block</span> <span class=pl-c1>in</span> <span class=pl-s1>self</span>.<span class=pl-c1>transformer</span>.<span class=pl-c1>blocks</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L55\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;55\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC55\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>x</span> <span class=pl-c1>=</span> <span class=pl-en>block</span>(<span class=pl-s1>x</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L56\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;56\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC56\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>x</span> <span class=pl-c1>=</span> <span class=pl-s1>self</span>.<span class=pl-c1>transformer</span>.<span class=pl-c1>ln_f</span>(<span class=pl-s1>x</span>) <span class=pl-c># final layer norm</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L57\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;57\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC57\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L58\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;58\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC58\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>if</span> <span class=pl-s1>targets</span> <span class=pl-c1><span class=pl-c1>is</span> <span class=pl-c1>not</span></span> <span class=pl-c1>None</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L59\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;59\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC59\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-c># compute the loss if we are given targets</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L60\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;60\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC60\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>logits</span> <span class=pl-c1>=</span> <span class=pl-s1>self</span>.<span class=pl-c1>transformer</span>.<span class=pl-c1>head</span>(<span class=pl-s1>x</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L61\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;61\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC61\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>loss</span> <span class=pl-c1>=</span> <span class=pl-c1>F</span>.<span class=pl-c1>cross_entropy</span>(</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L62\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;62\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC62\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>                <span class=pl-s1>logits</span>.<span class=pl-c1>view</span>(<span class=pl-c1>-</span><span class=pl-c1>1</span>, <span class=pl-s1>logits</span>.<span class=pl-c1>size</span>(<span class=pl-c1>-</span><span class=pl-c1>1</span>)),</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L63\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;63\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC63\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>                <span class=pl-s1>targets</span>.<span class=pl-c1>view</span>(<span class=pl-c1>-</span><span class=pl-c1>1</span>),</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L64\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;64\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC64\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>                <span class=pl-s1>ignore_index</span><span class=pl-c1>=</span><span class=pl-c1>-</span><span class=pl-c1>1</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L65\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;65\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC65\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            )</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L66\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;66\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC66\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>else</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L67\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;67\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC67\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-c># only look at last token if performing inference</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L68\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;68\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC68\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>logits</span> <span class=pl-c1>=</span> <span class=pl-s1>self</span>.<span class=pl-c1>transformer</span>.<span class=pl-c1>head</span>(<span class=pl-s1>x</span>[:, [<span class=pl-c1>-</span><span class=pl-c1>1</span>], :])</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L69\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;69\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC69\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>loss</span> <span class=pl-c1>=</span> <span class=pl-c1>None</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L70\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;70\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC70\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-gpt-py-L71\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;71\&quot;></td>\n          <td id=\&quot;file-gpt-py-LC71\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>return</span> <span class=pl-s1>logits</span>, <span class=pl-s1>loss</span></td>\n        </tr>\n  </table>\n</div>\n\n\n    </div>\n\n  </div>\n</div>\n\n      </div>\n      <div class=\&quot;gist-meta\&quot;>\n        <a href=\&quot;https://gist.github.com/wolfecameron/f574c5c9a61f3b3a045b2cbd9593cfd7/raw/7b3da75222abaa71427f40e8cc3dc13f03c4adc3/gpt.py\&quot; style=\&quot;float:right\&quot; class=\&quot;Link--inTextBlock\&quot;>view raw</a>\n        <a href=\&quot;https://gist.github.com/wolfecameron/f574c5c9a61f3b3a045b2cbd9593cfd7#file-gpt-py\&quot; class=\&quot;Link--inTextBlock\&quot;>\n          gpt.py\n        </a>\n        hosted with &amp;#10084; by <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.com\&quot;>GitHub</a>\n      </div>\n    </div>\n</div>\n&quot;,&quot;stylesheet&quot;:&quot;https://github.githubassets.com/assets/gist-embed-7b7a1d3fd6f6.css&quot;}" data-component-name="GitgistToDOM"><link rel="stylesheet" href="https://github.githubassets.com/assets/gist-embed-7b7a1d3fd6f6.css"><div id="gist128793913" class="gist">
    <div class="gist-file" data-color-mode="light" data-light-theme="light">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container">
  <div id="file-gpt-py" class="file my-2">
    
    <div itemprop="text" class="Box-body p-0 blob-wrapper data type-python  " style="overflow:auto">

        
<div class="js-check-bidi js-blob-code-container blob-code-content">

  
  <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
  
    

    <span>
      This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
    </span>


  <div data-view-component="true" class="flash-action">        <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn">    Show hidden characters
</a>
</div>
</div>

  <span data-view-component="true" class="line-alert tooltipped tooltipped-e">
    
    

</span>

  <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="gpt.py">
        <tbody><tr>
          <td id="file-gpt-py-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
          <td id="file-gpt-py-LC1" class="blob-code blob-code-inner js-file-line"><span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-gpt-py-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
          <td id="file-gpt-py-LC2" class="blob-code blob-code-inner js-file-line"><span class="pl-s">Source: https://github.com/karpathy/nanoGPT/blob/master/model.py</span></td>
        </tr>
        <tr>
          <td id="file-gpt-py-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
          <td id="file-gpt-py-LC3" class="blob-code blob-code-inner js-file-line"><span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-gpt-py-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
          <td id="file-gpt-py-LC4" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
          <td id="file-gpt-py-LC5" class="blob-code blob-code-inner js-file-line"><span class="pl-k">import</span> <span class="pl-s1">torch</span></td>
        </tr>
        <tr>
          <td id="file-gpt-py-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
          <td id="file-gpt-py-LC6" class="blob-code blob-code-inner js-file-line"><span class="pl-k">from</span> <span class="pl-s1">torch</span> <span class="pl-k">import</span> <span class="pl-s1">nn</span></td>
        </tr>
        <tr>
          <td id="file-gpt-py-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
          <td id="file-gpt-py-LC7" class="blob-code blob-code-inner js-file-line"><span class="pl-k">import</span> <span class="pl-s1">torch</span>.<span class="pl-s1">nn</span>.<span class="pl-s1">functional</span> <span class="pl-k">as</span> <span class="pl-c1">F</span></td>
        </tr>
        <tr>
          <td id="file-gpt-py-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
          <td id="file-gpt-py-LC8" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
          <td id="file-gpt-py-LC9" class="blob-code blob-code-inner js-file-line"><span class="pl-k">class</span> <span class="pl-c1">GPT</span>(<span class="pl-s1">nn</span>.<span class="pl-c1">Module</span>):</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
          <td id="file-gpt-py-LC10" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
          <td id="file-gpt-py-LC11" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">__init__</span>(<span class="pl-s1">self</span>, </td>
        </tr>
        <tr>
          <td id="file-gpt-py-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
          <td id="file-gpt-py-LC12" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">d</span>,</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
          <td id="file-gpt-py-LC13" class="blob-code blob-code-inner js-file-line">        <span class="pl-c1">H</span>,</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
          <td id="file-gpt-py-LC14" class="blob-code blob-code-inner js-file-line">        <span class="pl-c1">C</span>,</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
          <td id="file-gpt-py-LC15" class="blob-code blob-code-inner js-file-line">        <span class="pl-c1">V</span>,</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
          <td id="file-gpt-py-LC16" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">layers</span>,</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
          <td id="file-gpt-py-LC17" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">bias</span><span class="pl-c1">=</span><span class="pl-c1">False</span>,</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
          <td id="file-gpt-py-LC18" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">dropout</span><span class="pl-c1">=</span><span class="pl-c1">0.2</span>,</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
          <td id="file-gpt-py-LC19" class="blob-code blob-code-inner js-file-line">    ):</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td>
          <td id="file-gpt-py-LC20" class="blob-code blob-code-inner js-file-line">        <span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-gpt-py-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td>
          <td id="file-gpt-py-LC21" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        Arguments:</span></td>
        </tr>
        <tr>
          <td id="file-gpt-py-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td>
          <td id="file-gpt-py-LC22" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        d: size of embedding dimension</span></td>
        </tr>
        <tr>
          <td id="file-gpt-py-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td>
          <td id="file-gpt-py-LC23" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        H: number of attention heads</span></td>
        </tr>
        <tr>
          <td id="file-gpt-py-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td>
          <td id="file-gpt-py-LC24" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        C: maximum length of input sequences (in tokens)</span></td>
        </tr>
        <tr>
          <td id="file-gpt-py-L25" class="blob-num js-line-number js-blob-rnum" data-line-number="25"></td>
          <td id="file-gpt-py-LC25" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        V: size of the token vocabulary</span></td>
        </tr>
        <tr>
          <td id="file-gpt-py-L26" class="blob-num js-line-number js-blob-rnum" data-line-number="26"></td>
          <td id="file-gpt-py-LC26" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        layers: number of decoder-only blocks</span></td>
        </tr>
        <tr>
          <td id="file-gpt-py-L27" class="blob-num js-line-number js-blob-rnum" data-line-number="27"></td>
          <td id="file-gpt-py-LC27" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        bias: whether or not to use bias in linear layers</span></td>
        </tr>
        <tr>
          <td id="file-gpt-py-L28" class="blob-num js-line-number js-blob-rnum" data-line-number="28"></td>
          <td id="file-gpt-py-LC28" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        dropout: probability of dropout</span></td>
        </tr>
        <tr>
          <td id="file-gpt-py-L29" class="blob-num js-line-number js-blob-rnum" data-line-number="29"></td>
          <td id="file-gpt-py-LC29" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        """</span></td>
        </tr>
        <tr>
          <td id="file-gpt-py-L30" class="blob-num js-line-number js-blob-rnum" data-line-number="30"></td>
          <td id="file-gpt-py-LC30" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L31" class="blob-num js-line-number js-blob-rnum" data-line-number="31"></td>
          <td id="file-gpt-py-LC31" class="blob-code blob-code-inner js-file-line">        <span class="pl-en">super</span>().<span class="pl-c1">__init__</span>()</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L32" class="blob-num js-line-number js-blob-rnum" data-line-number="32"></td>
          <td id="file-gpt-py-LC32" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">transformer</span> <span class="pl-c1">=</span> <span class="pl-s1">nn</span>.<span class="pl-c1">ModuleDict</span>(<span class="pl-en">dict</span>(</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L33" class="blob-num js-line-number js-blob-rnum" data-line-number="33"></td>
          <td id="file-gpt-py-LC33" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">wte</span><span class="pl-c1">=</span><span class="pl-s1">nn</span>.<span class="pl-c1">Embedding</span>(<span class="pl-c1">V</span>, <span class="pl-s1">d</span>), <span class="pl-c"># token embeddings</span></td>
        </tr>
        <tr>
          <td id="file-gpt-py-L34" class="blob-num js-line-number js-blob-rnum" data-line-number="34"></td>
          <td id="file-gpt-py-LC34" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">wpe</span><span class="pl-c1">=</span><span class="pl-s1">nn</span>.<span class="pl-c1">Embedding</span>(<span class="pl-c1">C</span>, <span class="pl-s1">d</span>), <span class="pl-c"># position embeddings</span></td>
        </tr>
        <tr>
          <td id="file-gpt-py-L35" class="blob-num js-line-number js-blob-rnum" data-line-number="35"></td>
          <td id="file-gpt-py-LC35" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">drop</span><span class="pl-c1">=</span><span class="pl-s1">nn</span>.<span class="pl-c1">Dropout</span>(<span class="pl-s1">dropout</span>),</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L36" class="blob-num js-line-number js-blob-rnum" data-line-number="36"></td>
          <td id="file-gpt-py-LC36" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">blocks</span><span class="pl-c1">=</span><span class="pl-s1">nn</span>.<span class="pl-c1">ModuleList</span>([<span class="pl-en">Block</span>(<span class="pl-s1">d</span>, <span class="pl-c1">H</span>, <span class="pl-c1">C</span>, <span class="pl-s1">bias</span>, <span class="pl-s1">dropout</span>) <span class="pl-k">for</span> <span class="pl-s1">_</span> <span class="pl-c1">in</span> <span class="pl-en">range</span>(<span class="pl-s1">layers</span>)]),</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L37" class="blob-num js-line-number js-blob-rnum" data-line-number="37"></td>
          <td id="file-gpt-py-LC37" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">ln_f</span><span class="pl-c1">=</span><span class="pl-s1">nn</span>.<span class="pl-c1">LayerNorm</span>(<span class="pl-s1">d</span>),</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L38" class="blob-num js-line-number js-blob-rnum" data-line-number="38"></td>
          <td id="file-gpt-py-LC38" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">head</span><span class="pl-c1">=</span><span class="pl-s1">nn</span>.<span class="pl-c1">Linear</span>(<span class="pl-s1">d</span>, <span class="pl-c1">V</span>, <span class="pl-s1">bias</span><span class="pl-c1">=</span><span class="pl-s1">bias</span>),</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L39" class="blob-num js-line-number js-blob-rnum" data-line-number="39"></td>
          <td id="file-gpt-py-LC39" class="blob-code blob-code-inner js-file-line">        ))</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L40" class="blob-num js-line-number js-blob-rnum" data-line-number="40"></td>
          <td id="file-gpt-py-LC40" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L41" class="blob-num js-line-number js-blob-rnum" data-line-number="41"></td>
          <td id="file-gpt-py-LC41" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">forward</span>(<span class="pl-s1">self</span>, <span class="pl-s1">idx</span>, <span class="pl-s1">targets</span><span class="pl-c1">=</span><span class="pl-c1">None</span>):</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L42" class="blob-num js-line-number js-blob-rnum" data-line-number="42"></td>
          <td id="file-gpt-py-LC42" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># idx is a [B, C] matrix of token indices</span></td>
        </tr>
        <tr>
          <td id="file-gpt-py-L43" class="blob-num js-line-number js-blob-rnum" data-line-number="43"></td>
          <td id="file-gpt-py-LC43" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># targets is a [B, C] matrix of target (next) token indices</span></td>
        </tr>
        <tr>
          <td id="file-gpt-py-L44" class="blob-num js-line-number js-blob-rnum" data-line-number="44"></td>
          <td id="file-gpt-py-LC44" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">device</span> <span class="pl-c1">=</span> <span class="pl-s1">idx</span>.<span class="pl-c1">device</span></td>
        </tr>
        <tr>
          <td id="file-gpt-py-L45" class="blob-num js-line-number js-blob-rnum" data-line-number="45"></td>
          <td id="file-gpt-py-LC45" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">_</span>, <span class="pl-c1">C</span> <span class="pl-c1">=</span> <span class="pl-s1">idx</span>.<span class="pl-c1">size</span>() <span class="pl-c"># [B, C]</span></td>
        </tr>
        <tr>
          <td id="file-gpt-py-L46" class="blob-num js-line-number js-blob-rnum" data-line-number="46"></td>
          <td id="file-gpt-py-LC46" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">pos</span> <span class="pl-c1">=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">arange</span>(<span class="pl-c1">0</span>, <span class="pl-c1">C</span>, <span class="pl-s1">dtype</span><span class="pl-c1">=</span><span class="pl-s1">torch</span>.<span class="pl-c1">long</span>, <span class="pl-s1">device</span><span class="pl-c1">=</span><span class="pl-s1">device</span>)</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L47" class="blob-num js-line-number js-blob-rnum" data-line-number="47"></td>
          <td id="file-gpt-py-LC47" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L48" class="blob-num js-line-number js-blob-rnum" data-line-number="48"></td>
          <td id="file-gpt-py-LC48" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># generate token and position embeddings</span></td>
        </tr>
        <tr>
          <td id="file-gpt-py-L49" class="blob-num js-line-number js-blob-rnum" data-line-number="49"></td>
          <td id="file-gpt-py-LC49" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">tok_emb</span> <span class="pl-c1">=</span> <span class="pl-s1">self</span>.<span class="pl-c1">transformer</span>.<span class="pl-c1">wte</span>(<span class="pl-s1">idx</span>) <span class="pl-c"># [B, C, d]</span></td>
        </tr>
        <tr>
          <td id="file-gpt-py-L50" class="blob-num js-line-number js-blob-rnum" data-line-number="50"></td>
          <td id="file-gpt-py-LC50" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">pos_emb</span> <span class="pl-c1">=</span> <span class="pl-s1">self</span>.<span class="pl-c1">transformer</span>.<span class="pl-c1">wpe</span>(<span class="pl-s1">pos</span>) <span class="pl-c"># [C, d]</span></td>
        </tr>
        <tr>
          <td id="file-gpt-py-L51" class="blob-num js-line-number js-blob-rnum" data-line-number="51"></td>
          <td id="file-gpt-py-LC51" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">x</span> <span class="pl-c1">=</span> <span class="pl-s1">self</span>.<span class="pl-c1">transformer</span>.<span class="pl-c1">drop</span>(<span class="pl-s1">tok_emb</span> <span class="pl-c1">+</span> <span class="pl-s1">pos_emb</span>)</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L52" class="blob-num js-line-number js-blob-rnum" data-line-number="52"></td>
          <td id="file-gpt-py-LC52" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L53" class="blob-num js-line-number js-blob-rnum" data-line-number="53"></td>
          <td id="file-gpt-py-LC53" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># pass through all decoder-only blocks</span></td>
        </tr>
        <tr>
          <td id="file-gpt-py-L54" class="blob-num js-line-number js-blob-rnum" data-line-number="54"></td>
          <td id="file-gpt-py-LC54" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">for</span> <span class="pl-s1">block</span> <span class="pl-c1">in</span> <span class="pl-s1">self</span>.<span class="pl-c1">transformer</span>.<span class="pl-c1">blocks</span>:</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L55" class="blob-num js-line-number js-blob-rnum" data-line-number="55"></td>
          <td id="file-gpt-py-LC55" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">x</span> <span class="pl-c1">=</span> <span class="pl-en">block</span>(<span class="pl-s1">x</span>)</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L56" class="blob-num js-line-number js-blob-rnum" data-line-number="56"></td>
          <td id="file-gpt-py-LC56" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">x</span> <span class="pl-c1">=</span> <span class="pl-s1">self</span>.<span class="pl-c1">transformer</span>.<span class="pl-c1">ln_f</span>(<span class="pl-s1">x</span>) <span class="pl-c"># final layer norm</span></td>
        </tr>
        <tr>
          <td id="file-gpt-py-L57" class="blob-num js-line-number js-blob-rnum" data-line-number="57"></td>
          <td id="file-gpt-py-LC57" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L58" class="blob-num js-line-number js-blob-rnum" data-line-number="58"></td>
          <td id="file-gpt-py-LC58" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">if</span> <span class="pl-s1">targets</span> <span class="pl-c1"><span class="pl-c1">is</span> <span class="pl-c1">not</span></span> <span class="pl-c1">None</span>:</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L59" class="blob-num js-line-number js-blob-rnum" data-line-number="59"></td>
          <td id="file-gpt-py-LC59" class="blob-code blob-code-inner js-file-line">            <span class="pl-c"># compute the loss if we are given targets</span></td>
        </tr>
        <tr>
          <td id="file-gpt-py-L60" class="blob-num js-line-number js-blob-rnum" data-line-number="60"></td>
          <td id="file-gpt-py-LC60" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">logits</span> <span class="pl-c1">=</span> <span class="pl-s1">self</span>.<span class="pl-c1">transformer</span>.<span class="pl-c1">head</span>(<span class="pl-s1">x</span>)</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L61" class="blob-num js-line-number js-blob-rnum" data-line-number="61"></td>
          <td id="file-gpt-py-LC61" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">loss</span> <span class="pl-c1">=</span> <span class="pl-c1">F</span>.<span class="pl-c1">cross_entropy</span>(</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L62" class="blob-num js-line-number js-blob-rnum" data-line-number="62"></td>
          <td id="file-gpt-py-LC62" class="blob-code blob-code-inner js-file-line">                <span class="pl-s1">logits</span>.<span class="pl-c1">view</span>(<span class="pl-c1">-</span><span class="pl-c1">1</span>, <span class="pl-s1">logits</span>.<span class="pl-c1">size</span>(<span class="pl-c1">-</span><span class="pl-c1">1</span>)),</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L63" class="blob-num js-line-number js-blob-rnum" data-line-number="63"></td>
          <td id="file-gpt-py-LC63" class="blob-code blob-code-inner js-file-line">                <span class="pl-s1">targets</span>.<span class="pl-c1">view</span>(<span class="pl-c1">-</span><span class="pl-c1">1</span>),</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L64" class="blob-num js-line-number js-blob-rnum" data-line-number="64"></td>
          <td id="file-gpt-py-LC64" class="blob-code blob-code-inner js-file-line">                <span class="pl-s1">ignore_index</span><span class="pl-c1">=</span><span class="pl-c1">-</span><span class="pl-c1">1</span>,</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L65" class="blob-num js-line-number js-blob-rnum" data-line-number="65"></td>
          <td id="file-gpt-py-LC65" class="blob-code blob-code-inner js-file-line">            )</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L66" class="blob-num js-line-number js-blob-rnum" data-line-number="66"></td>
          <td id="file-gpt-py-LC66" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">else</span>:</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L67" class="blob-num js-line-number js-blob-rnum" data-line-number="67"></td>
          <td id="file-gpt-py-LC67" class="blob-code blob-code-inner js-file-line">            <span class="pl-c"># only look at last token if performing inference</span></td>
        </tr>
        <tr>
          <td id="file-gpt-py-L68" class="blob-num js-line-number js-blob-rnum" data-line-number="68"></td>
          <td id="file-gpt-py-LC68" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">logits</span> <span class="pl-c1">=</span> <span class="pl-s1">self</span>.<span class="pl-c1">transformer</span>.<span class="pl-c1">head</span>(<span class="pl-s1">x</span>[:, [<span class="pl-c1">-</span><span class="pl-c1">1</span>], :])</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L69" class="blob-num js-line-number js-blob-rnum" data-line-number="69"></td>
          <td id="file-gpt-py-LC69" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">loss</span> <span class="pl-c1">=</span> <span class="pl-c1">None</span></td>
        </tr>
        <tr>
          <td id="file-gpt-py-L70" class="blob-num js-line-number js-blob-rnum" data-line-number="70"></td>
          <td id="file-gpt-py-LC70" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-gpt-py-L71" class="blob-num js-line-number js-blob-rnum" data-line-number="71"></td>
          <td id="file-gpt-py-LC71" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">return</span> <span class="pl-s1">logits</span>, <span class="pl-s1">loss</span></td>
        </tr>
  </tbody></table>
</div>


    </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/wolfecameron/f574c5c9a61f3b3a045b2cbd9593cfd7/raw/7b3da75222abaa71427f40e8cc3dc13f03c4adc3/gpt.py" style="float:right" class="Link--inTextBlock">view raw</a>
        <a href="https://gist.github.com/wolfecameron/f574c5c9a61f3b3a045b2cbd9593cfd7#file-gpt-py" class="Link--inTextBlock">
          gpt.py
        </a>
        hosted with &#10084; by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
      </div>
    </div>
</div>
</div><p><strong>Generating output (decoding).</strong> LLMs are trained specifically to perform <a href="https://cameronrwolfe.substack.com/i/136638774/understanding-next-token-prediction">next-token prediction</a>. In other words, these models are specialists in predicting the next token given a list of tokens as input. As we have learned, the model&#8217;s output is just a list of output token vectors corresponding to each input token. So, we can predict the next token for any of these inputs tokens by:</p><ol><li><p>Taking the output embedding for a particular token.</p></li><li><p>Passing this embedding through a linear layer, where the output size is the dimension of the model&#8217;s vocabulary.</p></li><li><p>Taking an <a href="https://pytorch.org/docs/main/generated/torch.argmax.html">argmax</a> of the model&#8217;s output to get the maximum token ID. </p></li></ol><p>To generate a sequence of text, we just continue to repeat this process. We ingest a textual prompt as input, pass everything through the decoder-only transformer, take the last token vector in our output sequence, predict the next token, add this next token to our input sequence and repeat. This autoregressive decoding process is used by all LLMs to generate their output; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iP3N!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a5f56ab-06e4-44cd-a67e-9bdcb1637d72_2308x1156.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iP3N!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a5f56ab-06e4-44cd-a67e-9bdcb1637d72_2308x1156.png 424w, https://substackcdn.com/image/fetch/$s_!iP3N!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a5f56ab-06e4-44cd-a67e-9bdcb1637d72_2308x1156.png 848w, https://substackcdn.com/image/fetch/$s_!iP3N!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a5f56ab-06e4-44cd-a67e-9bdcb1637d72_2308x1156.png 1272w, https://substackcdn.com/image/fetch/$s_!iP3N!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a5f56ab-06e4-44cd-a67e-9bdcb1637d72_2308x1156.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iP3N!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a5f56ab-06e4-44cd-a67e-9bdcb1637d72_2308x1156.png" width="1456" height="729" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0a5f56ab-06e4-44cd-a67e-9bdcb1637d72_2308x1156.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:729,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:196538,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/155023686?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a5f56ab-06e4-44cd-a67e-9bdcb1637d72_2308x1156.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iP3N!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a5f56ab-06e4-44cd-a67e-9bdcb1637d72_2308x1156.png 424w, https://substackcdn.com/image/fetch/$s_!iP3N!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a5f56ab-06e4-44cd-a67e-9bdcb1637d72_2308x1156.png 848w, https://substackcdn.com/image/fetch/$s_!iP3N!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a5f56ab-06e4-44cd-a67e-9bdcb1637d72_2308x1156.png 1272w, https://substackcdn.com/image/fetch/$s_!iP3N!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a5f56ab-06e4-44cd-a67e-9bdcb1637d72_2308x1156.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Autoregressive output generation with next token prediction</figcaption></figure></div><p><strong>Why the decoder? </strong>Now that we understand this architecture, we might wonder: <em>Why do LLMs only use the decoder component of the transformer?</em> The key distinction between the encoder and decoder for a transformer is the type of attention that is used. The encoder uses bidirectional self-attention, meaning all tokens in the sequence&#8212;<em>including those before and after a given token</em>&#8212;are considered by the self-attention mechanism. In contrast, the decoder uses masked self-attention, which prevents tokens from attending to those that follow them in the sequence.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hoA4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6b8655b-6153-4098-afa3-ffc7871c281a_1962x836.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hoA4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6b8655b-6153-4098-afa3-ffc7871c281a_1962x836.png 424w, https://substackcdn.com/image/fetch/$s_!hoA4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6b8655b-6153-4098-afa3-ffc7871c281a_1962x836.png 848w, https://substackcdn.com/image/fetch/$s_!hoA4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6b8655b-6153-4098-afa3-ffc7871c281a_1962x836.png 1272w, https://substackcdn.com/image/fetch/$s_!hoA4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6b8655b-6153-4098-afa3-ffc7871c281a_1962x836.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hoA4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6b8655b-6153-4098-afa3-ffc7871c281a_1962x836.png" width="1456" height="620" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f6b8655b-6153-4098-afa3-ffc7871c281a_1962x836.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:620,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:130686,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/155023686?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6b8655b-6153-4098-afa3-ffc7871c281a_1962x836.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hoA4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6b8655b-6153-4098-afa3-ffc7871c281a_1962x836.png 424w, https://substackcdn.com/image/fetch/$s_!hoA4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6b8655b-6153-4098-afa3-ffc7871c281a_1962x836.png 848w, https://substackcdn.com/image/fetch/$s_!hoA4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6b8655b-6153-4098-afa3-ffc7871c281a_1962x836.png 1272w, https://substackcdn.com/image/fetch/$s_!hoA4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6b8655b-6153-4098-afa3-ffc7871c281a_1962x836.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Causal mask for next token prediction</figcaption></figure></div><p>Due to the use of masked self-attention, decoders work well for next token prediction. If each token can look forward in the sequence when crafting its representation, then the model could simply learn to predict next tokens by cheating (i.e., directly copying the next token in the sequence); see above. Masked self-attention forces the model to learn generalizable patterns for predicting next tokens from those that come before them, <em>making the decoder perfect for LLMs</em>. </p><h2>Creating a Mixture-of-Experts (MoE) Model</h2><blockquote><p><em>&#8220;In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) models defy this and instead select different parameters for each incoming example. The result is a sparsely-activated model&#8212;with an outrageous number of parameters&#8212;but a constant computational cost.&#8221;</em> - from [6]</p></blockquote><p>Now that we have an in-depth understanding of decoder-only transformers, we need to create a Mixture-of-Experts (MoE) model. MoE-based LLMs maintain the same decoder-only transformer architecture, but they modify this architecture in a few subtle ways. See the posts below for an in-depth coverage of these ideas.</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;445b8401-37e2-460a-a9c5-06303aeb8cf8&quot;,&quot;caption&quot;:&quot;Modern advancements in large language models (LLMs) are mostly a product of scaling laws [6]. As we increase the size of the underlying model, we see a smooth increase in performance, assuming that the model is trained over a sufficiently large dataset [7]. Such scaling laws eventually led us to the creation of GPT-3, as well as other (more powerful) LL&#8230;&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Mixture-of-Experts (MoE): The Birth and Rise of Conditional Computation&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:29736521,&quot;name&quot;:&quot;Cameron R. Wolfe, Ph.D.&quot;,&quot;bio&quot;:&quot;ML @ Netflix &#8226; Rice University PhD &#8226; I make AI understandable&quot;,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/69aba7df-b571-4609-aa47-fc2d031c11b8_1242x1595.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-03-18T08:33:09.327Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a419dd3a-57da-4a9a-a446-31ce4b001a7d_2398x1346.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://cameronrwolfe.substack.com/p/conditional-computation-the-birth&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:142423094,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:96,&quot;comment_count&quot;:16,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Deep (Learning) Focus&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab9b43fb-52d5-40da-995d-5b7cd3f91064_896x896.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;8f83c9a6-268b-4e10-8102-a038196529c0&quot;,&quot;caption&quot;:&quot;In an area of study that is rapidly changing, the decoder-only transformer architecture has remained one of the few enduring staples in large language model (LLM) research. This architecture has been used since the proposal of the original GPT model and has remained largely unchanged, aside from minor tweaks to improve efficiency. One o&#8230;&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Mixture-of-Experts (MoE) LLMs&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:29736521,&quot;name&quot;:&quot;Cameron R. Wolfe, Ph.D.&quot;,&quot;bio&quot;:&quot;ML @ Netflix &#8226; Rice University PhD &#8226; I make AI understandable&quot;,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/69aba7df-b571-4609-aa47-fc2d031c11b8_1242x1595.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-01-27T10:33:48.037Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3fdf1382-38dc-45fc-a741-b62babfd99c5_2258x1268.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://cameronrwolfe.substack.com/p/moe-llms&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:154340424,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:181,&quot;comment_count&quot;:10,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Deep (Learning) Focus&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab9b43fb-52d5-40da-995d-5b7cd3f91064_896x896.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p>Converting the model architecture to an MoE is not that difficult, but there are a lot of small details that must be implemented correctly for the model to work well. Additionally, training these models properly requires some extra attention and understanding&#8212;<em>MoE models are more difficult to train than a standard LLM</em>. </p><h4>Expert Layers</h4><p>Compared to the standard decoder-only transformer, the main modification made by an MoE model is within the feed-forward component of the transformer block. Usually, this block has one feed-forward network that is applied in a pointwise fashion across all token vectors. Instead of having a single feed-forward network, an MoE creates several feed-forward networks, <em>each with their own independent weights</em>. We refer to each of these networks as an &#8220;expert&#8221;, and a feed-forward layer with several experts is called an &#8220;expert layer&#8221;. If we have <code>N</code> experts in a layer, we can refer to the <code>i</code>-th expert using the notation <code>E_i</code>; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JOdT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a99797b-4392-421b-82b0-62932d968217_684x84.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JOdT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a99797b-4392-421b-82b0-62932d968217_684x84.png 424w, https://substackcdn.com/image/fetch/$s_!JOdT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a99797b-4392-421b-82b0-62932d968217_684x84.png 848w, https://substackcdn.com/image/fetch/$s_!JOdT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a99797b-4392-421b-82b0-62932d968217_684x84.png 1272w, https://substackcdn.com/image/fetch/$s_!JOdT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a99797b-4392-421b-82b0-62932d968217_684x84.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JOdT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a99797b-4392-421b-82b0-62932d968217_684x84.png" width="364" height="44.70175438596491" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2a99797b-4392-421b-82b0-62932d968217_684x84.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:84,&quot;width&quot;:684,&quot;resizeWidth&quot;:364,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JOdT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a99797b-4392-421b-82b0-62932d968217_684x84.png 424w, https://substackcdn.com/image/fetch/$s_!JOdT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a99797b-4392-421b-82b0-62932d968217_684x84.png 848w, https://substackcdn.com/image/fetch/$s_!JOdT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a99797b-4392-421b-82b0-62932d968217_684x84.png 1272w, https://substackcdn.com/image/fetch/$s_!JOdT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a99797b-4392-421b-82b0-62932d968217_684x84.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>PyTorch Implementation.</strong> Implementing an expert layer in PyTorch is not that complicated. As shown below, we just use our same implementation from before, but create several feed-forward networks instead of one. The main complexity to this implementation is that we do not use standard <a href="https://pytorch.org/docs/stable/generated/torch.nn.Linear.html">Linear</a> layers in PyTorch. Instead, we wrap the weights of all experts into several <a href="https://pytorch.org/docs/stable/generated/torch.nn.parameter.Parameter.html">Parameter</a> objects so that we can compute the output of all experts in batch by using the <a href="https://pytorch.org/docs/stable/generated/torch.bmm.html">batch matrix multiplication</a> operator. This implementation avoids having to loop over each expert to compute its output, which drastically improves efficiency.</p><div class="github-gist" data-attrs="{&quot;innerHTML&quot;:&quot;<div id=\&quot;gist136644786\&quot; class=\&quot;gist\&quot;>\n    <div class=\&quot;gist-file\&quot; translate=\&quot;no\&quot; data-color-mode=\&quot;light\&quot; data-light-theme=\&quot;light\&quot;>\n      <div class=\&quot;gist-data\&quot;>\n        <div class=\&quot;js-gist-file-update-container js-task-list-container\&quot;>\n  <div id=\&quot;file-expert_layer-py\&quot; class=\&quot;file my-2\&quot;>\n    \n    <div itemprop=\&quot;text\&quot;\n      class=\&quot;Box-body p-0 blob-wrapper data type-python  \&quot;\n      style=\&quot;overflow: auto\&quot; tabindex=\&quot;0\&quot; role=\&quot;region\&quot;\n      aria-label=\&quot;file-expert_layer-py\&quot;\n    >\n\n        \n<div class=\&quot;js-check-bidi js-blob-code-container blob-code-content\&quot;>\n\n  <template class=\&quot;js-file-alert-template\&quot;>\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash flash-warn flash-full d-flex flex-items-center\&quot;>\n  <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n    <span>\n      This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.\n      <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.co/hiddenchars\&quot; target=\&quot;_blank\&quot;>Learn more about bidirectional Unicode characters</a>\n    </span>\n\n\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash-action\&quot;>        <a href=\&quot;{{ revealButtonHref }}\&quot; data-view-component=\&quot;true\&quot; class=\&quot;btn-sm btn\&quot;>    Show hidden characters\n</a>\n</div>\n</div></template>\n<template class=\&quot;js-line-alert-template\&quot;>\n  <span aria-label=\&quot;This line has hidden Unicode characters\&quot; data-view-component=\&quot;true\&quot; class=\&quot;line-alert tooltipped tooltipped-e\&quot;>\n    <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n</span></template>\n\n  <table data-hpc class=\&quot;highlight tab-size js-file-line-container\&quot; data-tab-size=\&quot;8\&quot; data-paste-markdown-skip data-tagsearch-path=\&quot;expert_layer.py\&quot;>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L1\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;1\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC1\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L2\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;2\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC2\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>Based upon ColossalAI OpenMoE: https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/moe/experts.py</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L3\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;3\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC3\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L4\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;4\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC4\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L5\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;5\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC5\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>import</span> <span class=pl-s1>torch</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L6\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;6\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC6\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>from</span> <span class=pl-s1>torch</span> <span class=pl-k>import</span> <span class=pl-s1>nn</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L7\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;7\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC7\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L8\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;8\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC8\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>class</span> <span class=pl-v>MLPExperts</span>(<span class=pl-s1>nn</span>.<span class=pl-c1>Module</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L9\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;9\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC9\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L10\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;10\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC10\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>__init__</span>(</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L11\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;11\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC11\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L12\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;12\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC12\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>d</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L13\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;13\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC13\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>n_exp</span><span class=pl-c1>=</span><span class=pl-c1>8</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L14\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;14\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC14\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>bias</span><span class=pl-c1>=</span><span class=pl-c1>False</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L15\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;15\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC15\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>dropout</span><span class=pl-c1>=</span><span class=pl-c1>0.2</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L16\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;16\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC16\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    ):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L17\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;17\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC17\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L18\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;18\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC18\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        Arguments:</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L19\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;19\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC19\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        d: size of embedding dimension</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L20\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;20\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC20\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        n_exp: the number of experts to create in the expert layer</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L21\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;21\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC21\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        bias: whether or not to use bias in linear layers</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L22\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;22\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC22\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        dropout: probability of dropout</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L23\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;23\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC23\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        &amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L24\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;24\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC24\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L25\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;25\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC25\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-en>super</span>().<span class=pl-c1>__init__</span>()</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L26\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;26\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC26\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>bias</span> <span class=pl-c1>=</span> <span class=pl-s1>bias</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L27\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;27\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC27\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>c_fc</span> <span class=pl-c1>=</span> <span class=pl-s1>nn</span>.<span class=pl-c1>Parameter</span>(<span class=pl-s1>torch</span>.<span class=pl-c1>empty</span>(<span class=pl-s1>n_exp</span>, <span class=pl-s1>d</span>, <span class=pl-c1>4</span> <span class=pl-c1>*</span> <span class=pl-s1>d</span>))</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L28\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;28\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC28\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>c_proj</span> <span class=pl-c1>=</span> <span class=pl-s1>nn</span>.<span class=pl-c1>Parameter</span>(<span class=pl-s1>torch</span>.<span class=pl-c1>empty</span>(<span class=pl-s1>n_exp</span>, <span class=pl-c1>4</span> <span class=pl-c1>*</span> <span class=pl-s1>d</span>, <span class=pl-s1>d</span>))</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L29\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;29\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC29\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>fc_bias</span> <span class=pl-c1>=</span> <span class=pl-s1>nn</span>.<span class=pl-c1>Parameter</span>(<span class=pl-s1>torch</span>.<span class=pl-c1>empty</span>(<span class=pl-s1>n_exp</span>, <span class=pl-c1>1</span>, <span class=pl-c1>4</span> <span class=pl-c1>*</span> <span class=pl-s1>d</span>)) <span class=pl-k>if</span> <span class=pl-s1>self</span>.<span class=pl-c1>bias</span> <span class=pl-k>else</span> <span class=pl-c1>None</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L30\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;30\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC30\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>proj_bias</span> <span class=pl-c1>=</span> <span class=pl-s1>nn</span>.<span class=pl-c1>Parameter</span>(<span class=pl-s1>torch</span>.<span class=pl-c1>empty</span>(<span class=pl-s1>n_exp</span>, <span class=pl-c1>1</span>, <span class=pl-s1>d</span>)) <span class=pl-k>if</span> <span class=pl-s1>self</span>.<span class=pl-c1>bias</span> <span class=pl-k>else</span> <span class=pl-c1>None</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L31\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;31\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC31\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>gelu</span> <span class=pl-c1>=</span> <span class=pl-s1>nn</span>.<span class=pl-c1>GELU</span>()</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L32\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;32\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC32\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>dropout</span> <span class=pl-c1>=</span> <span class=pl-s1>nn</span>.<span class=pl-c1>Dropout</span>(<span class=pl-s1>dropout</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L33\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;33\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC33\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L34\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;34\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC34\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>forward</span>(<span class=pl-s1>self</span>, <span class=pl-s1>x</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L35\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;35\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC35\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>x</span> <span class=pl-c1>=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>bmm</span>(<span class=pl-s1>x</span>, <span class=pl-s1>self</span>.<span class=pl-c1>c_fc</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L36\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;36\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC36\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>if</span> <span class=pl-s1>self</span>.<span class=pl-c1>bias</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L37\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;37\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC37\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>x</span> <span class=pl-c1>+=</span> <span class=pl-s1>self</span>.<span class=pl-c1>fc_bias</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L38\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;38\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC38\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>x</span> <span class=pl-c1>=</span> <span class=pl-s1>self</span>.<span class=pl-c1>gelu</span>(<span class=pl-s1>x</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L39\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;39\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC39\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>x</span> <span class=pl-c1>=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>bmm</span>(<span class=pl-s1>x</span>, <span class=pl-s1>self</span>.<span class=pl-c1>c_proj</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L40\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;40\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC40\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>if</span> <span class=pl-s1>self</span>.<span class=pl-c1>bias</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L41\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;41\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC41\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>x</span> <span class=pl-c1>+=</span> <span class=pl-s1>self</span>.<span class=pl-c1>proj_bias</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L42\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;42\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC42\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>x</span> <span class=pl-c1>=</span> <span class=pl-s1>self</span>.<span class=pl-c1>dropout</span>(<span class=pl-s1>x</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L43\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;43\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC43\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>return</span> <span class=pl-s1>x</span></td>\n        </tr>\n  </table>\n</div>\n\n\n    </div>\n\n  </div>\n</div>\n\n      </div>\n      <div class=\&quot;gist-meta\&quot;>\n        <a href=\&quot;https://gist.github.com/wolfecameron/5448764d97ceed8a1cb0af9b4e21f48f/raw/0e5f3e9f116fc7ff64f8aa09acaa755ba7854589/expert_layer.py\&quot; style=\&quot;float:right\&quot; class=\&quot;Link--inTextBlock\&quot;>view raw</a>\n        <a href=\&quot;https://gist.github.com/wolfecameron/5448764d97ceed8a1cb0af9b4e21f48f#file-expert_layer-py\&quot; class=\&quot;Link--inTextBlock\&quot;>\n          expert_layer.py\n        </a>\n        hosted with &amp;#10084; by <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.com\&quot;>GitHub</a>\n      </div>\n    </div>\n</div>\n&quot;,&quot;stylesheet&quot;:&quot;https://github.githubassets.com/assets/gist-embed-9060cf3ad5bb.css&quot;}" data-component-name="GitgistToDOM"><link rel="stylesheet" href="https://github.githubassets.com/assets/gist-embed-9060cf3ad5bb.css"><div id="gist136644786" class="gist">
    <div class="gist-file" data-color-mode="light" data-light-theme="light">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container">
  <div id="file-expert_layer-py" class="file my-2">
    
    <div itemprop="text" class="Box-body p-0 blob-wrapper data type-python  " style="overflow:auto">

        
<div class="js-check-bidi js-blob-code-container blob-code-content">

  
  <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
  
    

    <span>
      This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
    </span>


  <div data-view-component="true" class="flash-action">        <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn">    Show hidden characters
</a>
</div>
</div>

  <span data-view-component="true" class="line-alert tooltipped tooltipped-e">
    
    

</span>

  <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="expert_layer.py">
        <tbody><tr>
          <td id="file-expert_layer-py-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
          <td id="file-expert_layer-py-LC1" class="blob-code blob-code-inner js-file-line"><span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
          <td id="file-expert_layer-py-LC2" class="blob-code blob-code-inner js-file-line"><span class="pl-s">Based upon ColossalAI OpenMoE: https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/moe/experts.py</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
          <td id="file-expert_layer-py-LC3" class="blob-code blob-code-inner js-file-line"><span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
          <td id="file-expert_layer-py-LC4" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
          <td id="file-expert_layer-py-LC5" class="blob-code blob-code-inner js-file-line"><span class="pl-k">import</span> <span class="pl-s1">torch</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
          <td id="file-expert_layer-py-LC6" class="blob-code blob-code-inner js-file-line"><span class="pl-k">from</span> <span class="pl-s1">torch</span> <span class="pl-k">import</span> <span class="pl-s1">nn</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
          <td id="file-expert_layer-py-LC7" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
          <td id="file-expert_layer-py-LC8" class="blob-code blob-code-inner js-file-line"><span class="pl-k">class</span> <span class="pl-v">MLPExperts</span>(<span class="pl-s1">nn</span>.<span class="pl-c1">Module</span>):</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
          <td id="file-expert_layer-py-LC9" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
          <td id="file-expert_layer-py-LC10" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">__init__</span>(</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
          <td id="file-expert_layer-py-LC11" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>,</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
          <td id="file-expert_layer-py-LC12" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">d</span>,</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
          <td id="file-expert_layer-py-LC13" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">n_exp</span><span class="pl-c1">=</span><span class="pl-c1">8</span>,</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
          <td id="file-expert_layer-py-LC14" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">bias</span><span class="pl-c1">=</span><span class="pl-c1">False</span>,</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
          <td id="file-expert_layer-py-LC15" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">dropout</span><span class="pl-c1">=</span><span class="pl-c1">0.2</span>,</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
          <td id="file-expert_layer-py-LC16" class="blob-code blob-code-inner js-file-line">    ):</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
          <td id="file-expert_layer-py-LC17" class="blob-code blob-code-inner js-file-line">        <span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
          <td id="file-expert_layer-py-LC18" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        Arguments:</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
          <td id="file-expert_layer-py-LC19" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        d: size of embedding dimension</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td>
          <td id="file-expert_layer-py-LC20" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        n_exp: the number of experts to create in the expert layer</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td>
          <td id="file-expert_layer-py-LC21" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        bias: whether or not to use bias in linear layers</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td>
          <td id="file-expert_layer-py-LC22" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        dropout: probability of dropout</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td>
          <td id="file-expert_layer-py-LC23" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        """</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td>
          <td id="file-expert_layer-py-LC24" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L25" class="blob-num js-line-number js-blob-rnum" data-line-number="25"></td>
          <td id="file-expert_layer-py-LC25" class="blob-code blob-code-inner js-file-line">        <span class="pl-en">super</span>().<span class="pl-c1">__init__</span>()</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L26" class="blob-num js-line-number js-blob-rnum" data-line-number="26"></td>
          <td id="file-expert_layer-py-LC26" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">bias</span> <span class="pl-c1">=</span> <span class="pl-s1">bias</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L27" class="blob-num js-line-number js-blob-rnum" data-line-number="27"></td>
          <td id="file-expert_layer-py-LC27" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">c_fc</span> <span class="pl-c1">=</span> <span class="pl-s1">nn</span>.<span class="pl-c1">Parameter</span>(<span class="pl-s1">torch</span>.<span class="pl-c1">empty</span>(<span class="pl-s1">n_exp</span>, <span class="pl-s1">d</span>, <span class="pl-c1">4</span> <span class="pl-c1">*</span> <span class="pl-s1">d</span>))</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L28" class="blob-num js-line-number js-blob-rnum" data-line-number="28"></td>
          <td id="file-expert_layer-py-LC28" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">c_proj</span> <span class="pl-c1">=</span> <span class="pl-s1">nn</span>.<span class="pl-c1">Parameter</span>(<span class="pl-s1">torch</span>.<span class="pl-c1">empty</span>(<span class="pl-s1">n_exp</span>, <span class="pl-c1">4</span> <span class="pl-c1">*</span> <span class="pl-s1">d</span>, <span class="pl-s1">d</span>))</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L29" class="blob-num js-line-number js-blob-rnum" data-line-number="29"></td>
          <td id="file-expert_layer-py-LC29" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">fc_bias</span> <span class="pl-c1">=</span> <span class="pl-s1">nn</span>.<span class="pl-c1">Parameter</span>(<span class="pl-s1">torch</span>.<span class="pl-c1">empty</span>(<span class="pl-s1">n_exp</span>, <span class="pl-c1">1</span>, <span class="pl-c1">4</span> <span class="pl-c1">*</span> <span class="pl-s1">d</span>)) <span class="pl-k">if</span> <span class="pl-s1">self</span>.<span class="pl-c1">bias</span> <span class="pl-k">else</span> <span class="pl-c1">None</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L30" class="blob-num js-line-number js-blob-rnum" data-line-number="30"></td>
          <td id="file-expert_layer-py-LC30" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">proj_bias</span> <span class="pl-c1">=</span> <span class="pl-s1">nn</span>.<span class="pl-c1">Parameter</span>(<span class="pl-s1">torch</span>.<span class="pl-c1">empty</span>(<span class="pl-s1">n_exp</span>, <span class="pl-c1">1</span>, <span class="pl-s1">d</span>)) <span class="pl-k">if</span> <span class="pl-s1">self</span>.<span class="pl-c1">bias</span> <span class="pl-k">else</span> <span class="pl-c1">None</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L31" class="blob-num js-line-number js-blob-rnum" data-line-number="31"></td>
          <td id="file-expert_layer-py-LC31" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">gelu</span> <span class="pl-c1">=</span> <span class="pl-s1">nn</span>.<span class="pl-c1">GELU</span>()</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L32" class="blob-num js-line-number js-blob-rnum" data-line-number="32"></td>
          <td id="file-expert_layer-py-LC32" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">dropout</span> <span class="pl-c1">=</span> <span class="pl-s1">nn</span>.<span class="pl-c1">Dropout</span>(<span class="pl-s1">dropout</span>)</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L33" class="blob-num js-line-number js-blob-rnum" data-line-number="33"></td>
          <td id="file-expert_layer-py-LC33" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L34" class="blob-num js-line-number js-blob-rnum" data-line-number="34"></td>
          <td id="file-expert_layer-py-LC34" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">forward</span>(<span class="pl-s1">self</span>, <span class="pl-s1">x</span>):</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L35" class="blob-num js-line-number js-blob-rnum" data-line-number="35"></td>
          <td id="file-expert_layer-py-LC35" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">x</span> <span class="pl-c1">=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">bmm</span>(<span class="pl-s1">x</span>, <span class="pl-s1">self</span>.<span class="pl-c1">c_fc</span>)</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L36" class="blob-num js-line-number js-blob-rnum" data-line-number="36"></td>
          <td id="file-expert_layer-py-LC36" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">if</span> <span class="pl-s1">self</span>.<span class="pl-c1">bias</span>:</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L37" class="blob-num js-line-number js-blob-rnum" data-line-number="37"></td>
          <td id="file-expert_layer-py-LC37" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">x</span> <span class="pl-c1">+=</span> <span class="pl-s1">self</span>.<span class="pl-c1">fc_bias</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L38" class="blob-num js-line-number js-blob-rnum" data-line-number="38"></td>
          <td id="file-expert_layer-py-LC38" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">x</span> <span class="pl-c1">=</span> <span class="pl-s1">self</span>.<span class="pl-c1">gelu</span>(<span class="pl-s1">x</span>)</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L39" class="blob-num js-line-number js-blob-rnum" data-line-number="39"></td>
          <td id="file-expert_layer-py-LC39" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">x</span> <span class="pl-c1">=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">bmm</span>(<span class="pl-s1">x</span>, <span class="pl-s1">self</span>.<span class="pl-c1">c_proj</span>)</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L40" class="blob-num js-line-number js-blob-rnum" data-line-number="40"></td>
          <td id="file-expert_layer-py-LC40" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">if</span> <span class="pl-s1">self</span>.<span class="pl-c1">bias</span>:</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L41" class="blob-num js-line-number js-blob-rnum" data-line-number="41"></td>
          <td id="file-expert_layer-py-LC41" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">x</span> <span class="pl-c1">+=</span> <span class="pl-s1">self</span>.<span class="pl-c1">proj_bias</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L42" class="blob-num js-line-number js-blob-rnum" data-line-number="42"></td>
          <td id="file-expert_layer-py-LC42" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">x</span> <span class="pl-c1">=</span> <span class="pl-s1">self</span>.<span class="pl-c1">dropout</span>(<span class="pl-s1">x</span>)</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L43" class="blob-num js-line-number js-blob-rnum" data-line-number="43"></td>
          <td id="file-expert_layer-py-LC43" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">return</span> <span class="pl-s1">x</span></td>
        </tr>
  </tbody></table>
</div>


    </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/wolfecameron/5448764d97ceed8a1cb0af9b4e21f48f/raw/0e5f3e9f116fc7ff64f8aa09acaa755ba7854589/expert_layer.py" style="float:right" class="Link--inTextBlock">view raw</a>
        <a href="https://gist.github.com/wolfecameron/5448764d97ceed8a1cb0af9b4e21f48f#file-expert_layer-py" class="Link--inTextBlock">
          expert_layer.py
        </a>
        hosted with &#10084; by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
      </div>
    </div>
</div>
</div><p><strong>Creating an MoE.</strong> To create an MoE-based decoder-only transformer, we simply convert the transformer&#8217;s feed-forward layers to MoE&#8212;<em>or expert</em>&#8212;layers. Each expert within the MoE layer has an architecture that is identical to the original, feed-forward network from that layer. We just have several independent copies of the original feed-forward network within an expert layer; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tPDR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fbb9a24-440d-4d26-8092-b6d72dafb55e_1482x858.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tPDR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fbb9a24-440d-4d26-8092-b6d72dafb55e_1482x858.png 424w, https://substackcdn.com/image/fetch/$s_!tPDR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fbb9a24-440d-4d26-8092-b6d72dafb55e_1482x858.png 848w, https://substackcdn.com/image/fetch/$s_!tPDR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fbb9a24-440d-4d26-8092-b6d72dafb55e_1482x858.png 1272w, https://substackcdn.com/image/fetch/$s_!tPDR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fbb9a24-440d-4d26-8092-b6d72dafb55e_1482x858.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tPDR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fbb9a24-440d-4d26-8092-b6d72dafb55e_1482x858.png" width="1456" height="843" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8fbb9a24-440d-4d26-8092-b6d72dafb55e_1482x858.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:843,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tPDR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fbb9a24-440d-4d26-8092-b6d72dafb55e_1482x858.png 424w, https://substackcdn.com/image/fetch/$s_!tPDR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fbb9a24-440d-4d26-8092-b6d72dafb55e_1482x858.png 848w, https://substackcdn.com/image/fetch/$s_!tPDR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fbb9a24-440d-4d26-8092-b6d72dafb55e_1482x858.png 1272w, https://substackcdn.com/image/fetch/$s_!tPDR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fbb9a24-440d-4d26-8092-b6d72dafb55e_1482x858.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Adding experts to a decoder-only transformer (from [1])</figcaption></figure></div><p>However, we need not use experts for every feed-forward layer in the transformer. Most MoE-based LLMs use a stride of <code>P</code>, meaning that every <code>P</code>-th layer is converted into an expert layer and other layer are left untouched. </p><blockquote><p><em>&#8220;The ST-MoE models have 32 experts with an expert layer frequency of 1/4 (every fourth FFN layer is replaced by an MoE layer).&#8221; </em>- from [24]</p></blockquote><p>A high-level implementation of this idea is provided in the pseudocode shown below. These &#8220;interleaved&#8221; MoE layers control the total number of experts within the MoE, which is a useful mechanism for balancing performance and efficiency. </p><pre><code>transformer_blocks = []
for i in range(num_blocks):
    use_moe = (i % P) == 0

    # when use_moe = False, this is regular transformer block
    # when use_moe = True, this is an expert layer
    transformer_blocks.append(Block(use_moe=use_moe))</code></pre><h4>Routing Tokens to Experts</h4><p>The primary benefit of MoE-based architectures is their efficiency, but using experts alone does not improve efficiency! In fact, adding more experts to each layer of the model significantly increases the total number parameters&#8212;<em>and the amount of necessary compute</em>&#8212;for the model. To improve efficiency, we need to sparsely select and use only a subset of experts within each layer. By sparsely utilizing experts, we can get the benefits of a much larger model without a significant increase in the computational costs of training and inference. </p><blockquote><p><em>&#8220;Using an MoE architecture makes it possible to attain better tradeoffs between model quality and inference efficiency than dense models typically achieve.&#8221;</em> - <a href="https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm">source</a></p></blockquote><p><strong>Selecting experts.</strong> Let&#8217;s consider a single token&#8212;<em>represented by a </em><code>d</code><em>-dimensional token vector</em>. Our goal is to select a subset of experts (of size <code>k</code>) to process this token. In the MoE literature, <em>we usually say that the token will be &#8220;routed&#8221; to these experts</em>. We need an algorithm to compute and optimize this routing operation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FZCc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1189a50c-ad49-4e09-8fca-b800532e101a_1156x856.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FZCc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1189a50c-ad49-4e09-8fca-b800532e101a_1156x856.png 424w, https://substackcdn.com/image/fetch/$s_!FZCc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1189a50c-ad49-4e09-8fca-b800532e101a_1156x856.png 848w, https://substackcdn.com/image/fetch/$s_!FZCc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1189a50c-ad49-4e09-8fca-b800532e101a_1156x856.png 1272w, https://substackcdn.com/image/fetch/$s_!FZCc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1189a50c-ad49-4e09-8fca-b800532e101a_1156x856.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FZCc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1189a50c-ad49-4e09-8fca-b800532e101a_1156x856.png" width="482" height="356.9134948096886" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1189a50c-ad49-4e09-8fca-b800532e101a_1156x856.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:856,&quot;width&quot;:1156,&quot;resizeWidth&quot;:482,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FZCc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1189a50c-ad49-4e09-8fca-b800532e101a_1156x856.png 424w, https://substackcdn.com/image/fetch/$s_!FZCc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1189a50c-ad49-4e09-8fca-b800532e101a_1156x856.png 848w, https://substackcdn.com/image/fetch/$s_!FZCc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1189a50c-ad49-4e09-8fca-b800532e101a_1156x856.png 1272w, https://substackcdn.com/image/fetch/$s_!FZCc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1189a50c-ad49-4e09-8fca-b800532e101a_1156x856.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Routing mechanism for a single token</figcaption></figure></div><p>The simplest possible routing algorithm would apply a linear transformation to the token vector, forming a vector of size <code>N</code> (i.e., the number of experts). Then, we can apply a <a href="https://en.wikipedia.org/wiki/Softmax_function">softmax</a> function to form a probability distribution over the set of experts for our token; see above. We can use this distribution to choose experts to which our token should be routed by selecting top-<code>K</code> experts in the distribution. The top-<code>K</code> values&#8212;<em>the &#8220;expert probabilities&#8221;</em>&#8212;are also important. </p><p><strong>Simple router implementation.</strong> As described above, this routing mechanism is actually quite simple&#8212;<em>it&#8217;s just a linear layer</em>! An implementation of this softmax router is shown below, where the output of our router is:</p><ol><li><p>A set of top-<code>K</code> expert indices for each token in the input.</p></li><li><p>The top-<code>K</code> expert probabilities (i.e., the probability values for each of the top-<code>K</code> indices) associated with selected experts.</p></li></ol><p>Despite its simplicity, this routing mechanism is effective and serves its purpose well. <em>Most modern MoEs adopt a similar linear routing scheme with softmax</em>.</p><div class="github-gist" data-attrs="{&quot;innerHTML&quot;:&quot;<div id=\&quot;gist136670122\&quot; class=\&quot;gist\&quot;>\n    <div class=\&quot;gist-file\&quot; translate=\&quot;no\&quot; data-color-mode=\&quot;light\&quot; data-light-theme=\&quot;light\&quot;>\n      <div class=\&quot;gist-data\&quot;>\n        <div class=\&quot;js-gist-file-update-container js-task-list-container\&quot;>\n  <div id=\&quot;file-basic_softmax_router-py\&quot; class=\&quot;file my-2\&quot;>\n    \n    <div itemprop=\&quot;text\&quot;\n      class=\&quot;Box-body p-0 blob-wrapper data type-python  \&quot;\n      style=\&quot;overflow: auto\&quot; tabindex=\&quot;0\&quot; role=\&quot;region\&quot;\n      aria-label=\&quot;file-basic_softmax_router-py\&quot;\n    >\n\n        \n<div class=\&quot;js-check-bidi js-blob-code-container blob-code-content\&quot;>\n\n  <template class=\&quot;js-file-alert-template\&quot;>\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash flash-warn flash-full d-flex flex-items-center\&quot;>\n  <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n    <span>\n      This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.\n      <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.co/hiddenchars\&quot; target=\&quot;_blank\&quot;>Learn more about bidirectional Unicode characters</a>\n    </span>\n\n\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash-action\&quot;>        <a href=\&quot;{{ revealButtonHref }}\&quot; data-view-component=\&quot;true\&quot; class=\&quot;btn-sm btn\&quot;>    Show hidden characters\n</a>\n</div>\n</div></template>\n<template class=\&quot;js-line-alert-template\&quot;>\n  <span aria-label=\&quot;This line has hidden Unicode characters\&quot; data-view-component=\&quot;true\&quot; class=\&quot;line-alert tooltipped tooltipped-e\&quot;>\n    <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n</span></template>\n\n  <table data-hpc class=\&quot;highlight tab-size js-file-line-container\&quot; data-tab-size=\&quot;8\&quot; data-paste-markdown-skip data-tagsearch-path=\&quot;basic_softmax_router.py\&quot;>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L1\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;1\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC1\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>import</span> <span class=pl-s1>torch</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L2\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;2\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC2\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>from</span> <span class=pl-s1>torch</span> <span class=pl-k>import</span> <span class=pl-s1>nn</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L3\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;3\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC3\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>from</span> <span class=pl-s1>torch</span>.<span class=pl-s1>nn</span> <span class=pl-k>import</span> <span class=pl-s1>functional</span> <span class=pl-k>as</span> <span class=pl-c1>F</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L4\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;4\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC4\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L5\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;5\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC5\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>class</span> <span class=pl-v>BasicSoftmaxRouter</span>(<span class=pl-s1>nn</span>.<span class=pl-c1>Module</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L6\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;6\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC6\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>__init__</span>(</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L7\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;7\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC7\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L8\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;8\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC8\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>d</span>, </td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L9\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;9\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC9\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>n_exp</span> <span class=pl-c1>=</span> <span class=pl-c1>8</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L10\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;10\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC10\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>top_k</span> <span class=pl-c1>=</span> <span class=pl-c1>2</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L11\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;11\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC11\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>use_noisy_top_k</span> <span class=pl-c1>=</span> <span class=pl-c1>True</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L12\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;12\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC12\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    ):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L13\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;13\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC13\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L14\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;14\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC14\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        Arguments:</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L15\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;15\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC15\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        d: size of embedding dimension</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L16\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;16\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC16\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        n_exp: the number of experts to create in the expert layer</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L17\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;17\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC17\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        top_k: the number of active experts for each token</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L18\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;18\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC18\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        use_noisy_top_k: whether to add noise when computing expert output</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L19\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;19\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC19\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        &amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L20\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;20\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC20\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>      </td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L21\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;21\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC21\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-en>super</span>().<span class=pl-c1>__init__</span>()</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L22\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;22\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC22\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L23\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;23\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC23\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># router settings</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L24\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;24\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC24\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>top_k</span> <span class=pl-c1>=</span> <span class=pl-s1>top_k</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L25\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;25\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC25\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>assert</span> <span class=pl-s1>self</span>.<span class=pl-c1>top_k</span> <span class=pl-c1>&amp;gt;=</span> <span class=pl-c1>1</span> <span class=pl-c1>and</span> <span class=pl-s1>self</span>.<span class=pl-c1>top_k</span> <span class=pl-c1>&amp;lt;=</span> <span class=pl-s1>n_exp</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L26\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;26\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC26\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>use_noisy_top_k</span> <span class=pl-c1>=</span> <span class=pl-s1>use_noisy_top_k</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L27\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;27\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC27\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L28\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;28\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC28\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># linear projection for (noisy) softmax routing</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L29\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;29\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC29\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># no bias used, see page 4 eq (4) in https://arxiv.org/abs/1701.06538</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L30\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;30\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC30\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>w_g</span> <span class=pl-c1>=</span> <span class=pl-s1>nn</span>.<span class=pl-c1>Linear</span>(<span class=pl-s1>d</span>, <span class=pl-s1>n_exp</span>, <span class=pl-s1>bias</span><span class=pl-c1>=</span><span class=pl-c1>False</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L31\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;31\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC31\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>w_noise</span> <span class=pl-c1>=</span> <span class=pl-s1>nn</span>.<span class=pl-c1>Linear</span>(<span class=pl-s1>d</span>, <span class=pl-s1>n_exp</span>, <span class=pl-s1>bias</span><span class=pl-c1>=</span><span class=pl-c1>False</span>) <span class=pl-k>if</span> <span class=pl-s1>self</span>.<span class=pl-c1>use_noisy_top_k</span> <span class=pl-k>else</span> <span class=pl-c1>None</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L32\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;32\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC32\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L33\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;33\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC33\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>forward</span>(<span class=pl-s1>self</span>, <span class=pl-s1>x</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L34\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;34\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC34\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># eq (4) in https://arxiv.org/abs/1701.06538</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L35\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;35\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC35\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>logits</span> <span class=pl-c1>=</span> <span class=pl-s1>self</span>.<span class=pl-c1>w_g</span>(<span class=pl-s1>x</span>)  <span class=pl-c># [B, C, d] -&amp;gt; [B, C, n_exp]</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L36\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;36\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC36\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>if</span> <span class=pl-s1>self</span>.<span class=pl-c1>use_noisy_top_k</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L37\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;37\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC37\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-c># (optionally) add noise into the router</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L38\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;38\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC38\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>noise</span> <span class=pl-c1>=</span> <span class=pl-c1>F</span>.<span class=pl-c1>softplus</span>(<span class=pl-s1>self</span>.<span class=pl-c1>w_noise</span>(<span class=pl-s1>x</span>))</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L39\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;39\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC39\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>noise</span> <span class=pl-c1>*=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>randn_like</span>(<span class=pl-s1>noise</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L40\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;40\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC40\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>logits</span> <span class=pl-c1>+=</span> <span class=pl-s1>noise</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L41\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;41\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC41\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>top_k_logits</span>, <span class=pl-s1>top_k_indices</span> <span class=pl-c1>=</span> <span class=pl-s1>logits</span>.<span class=pl-c1>topk</span>(<span class=pl-s1>self</span>.<span class=pl-c1>top_k</span>, <span class=pl-s1>dim</span><span class=pl-c1>=</span><span class=pl-c1>-</span><span class=pl-c1>1</span>) <span class=pl-c># [B, C, k]</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-basic_softmax_router-py-L42\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;42\&quot;></td>\n          <td id=\&quot;file-basic_softmax_router-py-LC42\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>return</span> <span class=pl-s1>top_k_logits</span>, <span class=pl-s1>top_k_indices</span></td>\n        </tr>\n  </table>\n</div>\n\n\n    </div>\n\n  </div>\n</div>\n\n      </div>\n      <div class=\&quot;gist-meta\&quot;>\n        <a href=\&quot;https://gist.github.com/wolfecameron/46f03d50617f256f4560f299422f7ceb/raw/71a6b6ba20d162028b42f20cbe6172a71fe5b86b/basic_softmax_router.py\&quot; style=\&quot;float:right\&quot; class=\&quot;Link--inTextBlock\&quot;>view raw</a>\n        <a href=\&quot;https://gist.github.com/wolfecameron/46f03d50617f256f4560f299422f7ceb#file-basic_softmax_router-py\&quot; class=\&quot;Link--inTextBlock\&quot;>\n          basic_softmax_router.py\n        </a>\n        hosted with &amp;#10084; by <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.com\&quot;>GitHub</a>\n      </div>\n    </div>\n</div>\n&quot;,&quot;stylesheet&quot;:&quot;https://github.githubassets.com/assets/gist-embed-9060cf3ad5bb.css&quot;}" data-component-name="GitgistToDOM"><link rel="stylesheet" href="https://github.githubassets.com/assets/gist-embed-9060cf3ad5bb.css"><div id="gist136670122" class="gist">
    <div class="gist-file" data-color-mode="light" data-light-theme="light">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container">
  <div id="file-basic_softmax_router-py" class="file my-2">
    
    <div itemprop="text" class="Box-body p-0 blob-wrapper data type-python  " style="overflow:auto">

        
<div class="js-check-bidi js-blob-code-container blob-code-content">

  
  <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
  
    

    <span>
      This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
    </span>


  <div data-view-component="true" class="flash-action">        <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn">    Show hidden characters
</a>
</div>
</div>

  <span data-view-component="true" class="line-alert tooltipped tooltipped-e">
    
    

</span>

  <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="basic_softmax_router.py">
        <tbody><tr>
          <td id="file-basic_softmax_router-py-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
          <td id="file-basic_softmax_router-py-LC1" class="blob-code blob-code-inner js-file-line"><span class="pl-k">import</span> <span class="pl-s1">torch</span></td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
          <td id="file-basic_softmax_router-py-LC2" class="blob-code blob-code-inner js-file-line"><span class="pl-k">from</span> <span class="pl-s1">torch</span> <span class="pl-k">import</span> <span class="pl-s1">nn</span></td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
          <td id="file-basic_softmax_router-py-LC3" class="blob-code blob-code-inner js-file-line"><span class="pl-k">from</span> <span class="pl-s1">torch</span>.<span class="pl-s1">nn</span> <span class="pl-k">import</span> <span class="pl-s1">functional</span> <span class="pl-k">as</span> <span class="pl-c1">F</span></td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
          <td id="file-basic_softmax_router-py-LC4" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
          <td id="file-basic_softmax_router-py-LC5" class="blob-code blob-code-inner js-file-line"><span class="pl-k">class</span> <span class="pl-v">BasicSoftmaxRouter</span>(<span class="pl-s1">nn</span>.<span class="pl-c1">Module</span>):</td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
          <td id="file-basic_softmax_router-py-LC6" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">__init__</span>(</td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
          <td id="file-basic_softmax_router-py-LC7" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>,</td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
          <td id="file-basic_softmax_router-py-LC8" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">d</span>, </td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
          <td id="file-basic_softmax_router-py-LC9" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">n_exp</span> <span class="pl-c1">=</span> <span class="pl-c1">8</span>,</td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
          <td id="file-basic_softmax_router-py-LC10" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">top_k</span> <span class="pl-c1">=</span> <span class="pl-c1">2</span>,</td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
          <td id="file-basic_softmax_router-py-LC11" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">use_noisy_top_k</span> <span class="pl-c1">=</span> <span class="pl-c1">True</span>,</td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
          <td id="file-basic_softmax_router-py-LC12" class="blob-code blob-code-inner js-file-line">    ):</td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
          <td id="file-basic_softmax_router-py-LC13" class="blob-code blob-code-inner js-file-line">        <span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
          <td id="file-basic_softmax_router-py-LC14" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        Arguments:</span></td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
          <td id="file-basic_softmax_router-py-LC15" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        d: size of embedding dimension</span></td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
          <td id="file-basic_softmax_router-py-LC16" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        n_exp: the number of experts to create in the expert layer</span></td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
          <td id="file-basic_softmax_router-py-LC17" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        top_k: the number of active experts for each token</span></td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
          <td id="file-basic_softmax_router-py-LC18" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        use_noisy_top_k: whether to add noise when computing expert output</span></td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
          <td id="file-basic_softmax_router-py-LC19" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        """</span></td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td>
          <td id="file-basic_softmax_router-py-LC20" class="blob-code blob-code-inner js-file-line">      </td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td>
          <td id="file-basic_softmax_router-py-LC21" class="blob-code blob-code-inner js-file-line">        <span class="pl-en">super</span>().<span class="pl-c1">__init__</span>()</td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td>
          <td id="file-basic_softmax_router-py-LC22" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td>
          <td id="file-basic_softmax_router-py-LC23" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># router settings</span></td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td>
          <td id="file-basic_softmax_router-py-LC24" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">top_k</span> <span class="pl-c1">=</span> <span class="pl-s1">top_k</span></td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L25" class="blob-num js-line-number js-blob-rnum" data-line-number="25"></td>
          <td id="file-basic_softmax_router-py-LC25" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">assert</span> <span class="pl-s1">self</span>.<span class="pl-c1">top_k</span> <span class="pl-c1">&gt;=</span> <span class="pl-c1">1</span> <span class="pl-c1">and</span> <span class="pl-s1">self</span>.<span class="pl-c1">top_k</span> <span class="pl-c1">&lt;=</span> <span class="pl-s1">n_exp</span></td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L26" class="blob-num js-line-number js-blob-rnum" data-line-number="26"></td>
          <td id="file-basic_softmax_router-py-LC26" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">use_noisy_top_k</span> <span class="pl-c1">=</span> <span class="pl-s1">use_noisy_top_k</span></td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L27" class="blob-num js-line-number js-blob-rnum" data-line-number="27"></td>
          <td id="file-basic_softmax_router-py-LC27" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L28" class="blob-num js-line-number js-blob-rnum" data-line-number="28"></td>
          <td id="file-basic_softmax_router-py-LC28" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># linear projection for (noisy) softmax routing</span></td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L29" class="blob-num js-line-number js-blob-rnum" data-line-number="29"></td>
          <td id="file-basic_softmax_router-py-LC29" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># no bias used, see page 4 eq (4) in https://arxiv.org/abs/1701.06538</span></td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L30" class="blob-num js-line-number js-blob-rnum" data-line-number="30"></td>
          <td id="file-basic_softmax_router-py-LC30" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">w_g</span> <span class="pl-c1">=</span> <span class="pl-s1">nn</span>.<span class="pl-c1">Linear</span>(<span class="pl-s1">d</span>, <span class="pl-s1">n_exp</span>, <span class="pl-s1">bias</span><span class="pl-c1">=</span><span class="pl-c1">False</span>)</td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L31" class="blob-num js-line-number js-blob-rnum" data-line-number="31"></td>
          <td id="file-basic_softmax_router-py-LC31" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">w_noise</span> <span class="pl-c1">=</span> <span class="pl-s1">nn</span>.<span class="pl-c1">Linear</span>(<span class="pl-s1">d</span>, <span class="pl-s1">n_exp</span>, <span class="pl-s1">bias</span><span class="pl-c1">=</span><span class="pl-c1">False</span>) <span class="pl-k">if</span> <span class="pl-s1">self</span>.<span class="pl-c1">use_noisy_top_k</span> <span class="pl-k">else</span> <span class="pl-c1">None</span></td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L32" class="blob-num js-line-number js-blob-rnum" data-line-number="32"></td>
          <td id="file-basic_softmax_router-py-LC32" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L33" class="blob-num js-line-number js-blob-rnum" data-line-number="33"></td>
          <td id="file-basic_softmax_router-py-LC33" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">forward</span>(<span class="pl-s1">self</span>, <span class="pl-s1">x</span>):</td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L34" class="blob-num js-line-number js-blob-rnum" data-line-number="34"></td>
          <td id="file-basic_softmax_router-py-LC34" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># eq (4) in https://arxiv.org/abs/1701.06538</span></td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L35" class="blob-num js-line-number js-blob-rnum" data-line-number="35"></td>
          <td id="file-basic_softmax_router-py-LC35" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">logits</span> <span class="pl-c1">=</span> <span class="pl-s1">self</span>.<span class="pl-c1">w_g</span>(<span class="pl-s1">x</span>)  <span class="pl-c"># [B, C, d] -&gt; [B, C, n_exp]</span></td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L36" class="blob-num js-line-number js-blob-rnum" data-line-number="36"></td>
          <td id="file-basic_softmax_router-py-LC36" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">if</span> <span class="pl-s1">self</span>.<span class="pl-c1">use_noisy_top_k</span>:</td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L37" class="blob-num js-line-number js-blob-rnum" data-line-number="37"></td>
          <td id="file-basic_softmax_router-py-LC37" class="blob-code blob-code-inner js-file-line">            <span class="pl-c"># (optionally) add noise into the router</span></td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L38" class="blob-num js-line-number js-blob-rnum" data-line-number="38"></td>
          <td id="file-basic_softmax_router-py-LC38" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">noise</span> <span class="pl-c1">=</span> <span class="pl-c1">F</span>.<span class="pl-c1">softplus</span>(<span class="pl-s1">self</span>.<span class="pl-c1">w_noise</span>(<span class="pl-s1">x</span>))</td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L39" class="blob-num js-line-number js-blob-rnum" data-line-number="39"></td>
          <td id="file-basic_softmax_router-py-LC39" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">noise</span> <span class="pl-c1">*=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">randn_like</span>(<span class="pl-s1">noise</span>)</td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L40" class="blob-num js-line-number js-blob-rnum" data-line-number="40"></td>
          <td id="file-basic_softmax_router-py-LC40" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">logits</span> <span class="pl-c1">+=</span> <span class="pl-s1">noise</span></td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L41" class="blob-num js-line-number js-blob-rnum" data-line-number="41"></td>
          <td id="file-basic_softmax_router-py-LC41" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">top_k_logits</span>, <span class="pl-s1">top_k_indices</span> <span class="pl-c1">=</span> <span class="pl-s1">logits</span>.<span class="pl-c1">topk</span>(<span class="pl-s1">self</span>.<span class="pl-c1">top_k</span>, <span class="pl-s1">dim</span><span class="pl-c1">=</span><span class="pl-c1">-</span><span class="pl-c1">1</span>) <span class="pl-c"># [B, C, k]</span></td>
        </tr>
        <tr>
          <td id="file-basic_softmax_router-py-L42" class="blob-num js-line-number js-blob-rnum" data-line-number="42"></td>
          <td id="file-basic_softmax_router-py-LC42" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">return</span> <span class="pl-s1">top_k_logits</span>, <span class="pl-s1">top_k_indices</span></td>
        </tr>
  </tbody></table>
</div>


    </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/wolfecameron/46f03d50617f256f4560f299422f7ceb/raw/71a6b6ba20d162028b42f20cbe6172a71fe5b86b/basic_softmax_router.py" style="float:right" class="Link--inTextBlock">view raw</a>
        <a href="https://gist.github.com/wolfecameron/46f03d50617f256f4560f299422f7ceb#file-basic_softmax_router-py" class="Link--inTextBlock">
          basic_softmax_router.py
        </a>
        hosted with &#10084; by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
      </div>
    </div>
</div>
</div><p>Optionally, we can add noise into the routing mechanism, an approach proposed in [8]&#8212;<em>one of the earliest works on applying MoEs to neural networks</em>. By adding this small amount of (learnable) noise into the output of the routing mechanism (see below for details), we can help to regularize the MoE&#8217;s training process.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LriU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6453620-af80-438f-b824-80a41a86a822_1916x1132.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LriU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6453620-af80-438f-b824-80a41a86a822_1916x1132.png 424w, https://substackcdn.com/image/fetch/$s_!LriU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6453620-af80-438f-b824-80a41a86a822_1916x1132.png 848w, https://substackcdn.com/image/fetch/$s_!LriU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6453620-af80-438f-b824-80a41a86a822_1916x1132.png 1272w, https://substackcdn.com/image/fetch/$s_!LriU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6453620-af80-438f-b824-80a41a86a822_1916x1132.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LriU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6453620-af80-438f-b824-80a41a86a822_1916x1132.png" width="1456" height="860" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e6453620-af80-438f-b824-80a41a86a822_1916x1132.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:860,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:359197,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/155023686?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6453620-af80-438f-b824-80a41a86a822_1916x1132.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LriU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6453620-af80-438f-b824-80a41a86a822_1916x1132.png 424w, https://substackcdn.com/image/fetch/$s_!LriU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6453620-af80-438f-b824-80a41a86a822_1916x1132.png 848w, https://substackcdn.com/image/fetch/$s_!LriU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6453620-af80-438f-b824-80a41a86a822_1916x1132.png 1272w, https://substackcdn.com/image/fetch/$s_!LriU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6453620-af80-438f-b824-80a41a86a822_1916x1132.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Adding noise to top-k softmax routing (from [7])</figcaption></figure></div><p><strong>Active parameters.</strong> Because we only select a subset of experts to process each token within an MoE layer, there is a concept of &#8220;active&#8221; parameters in the MoE literature. Put simply, only a small portion of the MoE model&#8217;s total parameters&#8212;<em>given by the experts selected at each MoE layer</em>&#8212;are active when processing a given token. The total computation performed by the MoE is proportional to the number of active parameters rather than the total number of parameters.</p><h4>Expert Capacity</h4><blockquote><p><em>&#8220;To improve hardware utilization, most implementations of sparse models have static batch sizes for each expert. The expert capacity refers to the number of tokens that can be routed to each expert. If this capacity is exceeded then the overflowed tokens&#8230; are passed to the next layer through a residual connection.&#8221;</em> - from [5]</p></blockquote><p>The computation performed in an expert layer is dynamic. We choose the tokens to be computed by each expert based on the output of the router, which changes depending upon the sequences of tokens provided as input to the MoE. The dynamic nature of the input for each expert can make the implementation of an expert layer somewhat complicated: <em>How can we deal with the fact that each expert&#8217;s input will have a different and unpredictable size?</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Jxdi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde21dffa-d12f-479e-92e5-617f48c9f4d1_2368x934.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Jxdi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde21dffa-d12f-479e-92e5-617f48c9f4d1_2368x934.png 424w, https://substackcdn.com/image/fetch/$s_!Jxdi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde21dffa-d12f-479e-92e5-617f48c9f4d1_2368x934.png 848w, https://substackcdn.com/image/fetch/$s_!Jxdi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde21dffa-d12f-479e-92e5-617f48c9f4d1_2368x934.png 1272w, https://substackcdn.com/image/fetch/$s_!Jxdi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde21dffa-d12f-479e-92e5-617f48c9f4d1_2368x934.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Jxdi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde21dffa-d12f-479e-92e5-617f48c9f4d1_2368x934.png" width="1456" height="574" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de21dffa-d12f-479e-92e5-617f48c9f4d1_2368x934.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:574,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:233378,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/155023686?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde21dffa-d12f-479e-92e5-617f48c9f4d1_2368x934.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Jxdi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde21dffa-d12f-479e-92e5-617f48c9f4d1_2368x934.png 424w, https://substackcdn.com/image/fetch/$s_!Jxdi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde21dffa-d12f-479e-92e5-617f48c9f4d1_2368x934.png 848w, https://substackcdn.com/image/fetch/$s_!Jxdi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde21dffa-d12f-479e-92e5-617f48c9f4d1_2368x934.png 1272w, https://substackcdn.com/image/fetch/$s_!Jxdi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde21dffa-d12f-479e-92e5-617f48c9f4d1_2368x934.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Computing the expert capacity</figcaption></figure></div><p><strong>Expert capacity.</strong> Most practical implementations of MoEs avoid this problem by using fixed batch sizes for each expert&#8212;<em>this is a useful trick for improving hardware utilization</em>. Each expert uses the same static batch size, referred to as &#8220;expert capacity&#8221;. The expert capacity&#8212;<em>defined in the above equation</em>&#8212;dictates the maximum number of tokens in each batch that can be sent to any single expert. </p><p>Expert capacity is controlled via the capacity factor setting. A capacity factor of one means that tokens are routed uniformly, while setting the capacity factor greater than one provides extra buffer to handle imbalanced token routing between experts&#8212;<em>this comes at the cost of higher memory usage and lower efficiency</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vE2b!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F417c5fc8-2524-48e1-a9ef-460b4476d323_1784x1184.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vE2b!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F417c5fc8-2524-48e1-a9ef-460b4476d323_1784x1184.png 424w, https://substackcdn.com/image/fetch/$s_!vE2b!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F417c5fc8-2524-48e1-a9ef-460b4476d323_1784x1184.png 848w, https://substackcdn.com/image/fetch/$s_!vE2b!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F417c5fc8-2524-48e1-a9ef-460b4476d323_1784x1184.png 1272w, https://substackcdn.com/image/fetch/$s_!vE2b!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F417c5fc8-2524-48e1-a9ef-460b4476d323_1784x1184.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vE2b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F417c5fc8-2524-48e1-a9ef-460b4476d323_1784x1184.png" width="1456" height="966" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/417c5fc8-2524-48e1-a9ef-460b4476d323_1784x1184.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:966,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vE2b!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F417c5fc8-2524-48e1-a9ef-460b4476d323_1784x1184.png 424w, https://substackcdn.com/image/fetch/$s_!vE2b!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F417c5fc8-2524-48e1-a9ef-460b4476d323_1784x1184.png 848w, https://substackcdn.com/image/fetch/$s_!vE2b!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F417c5fc8-2524-48e1-a9ef-460b4476d323_1784x1184.png 1272w, https://substackcdn.com/image/fetch/$s_!vE2b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F417c5fc8-2524-48e1-a9ef-460b4476d323_1784x1184.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [6])</figcaption></figure></div><p>If the number of tokens routed to an expert exceeds the expert capacity, then we &#8220;drop&#8221; these extra tokens by performing no computation and letting their representation flow directly to the next layer via the transformer&#8217;s residual connection; see above. MoEs perform well with relatively low capacity factors<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a>, but we should make sure to avoid too many tokens being dropped. The capacity factor can also be different during training and evaluation; e.g., ST-MoE [5] uses a capacity factor of 1.25 and 2.0 during training and evaluation, respectively.</p><p><strong>PyTorch implementation.</strong> Now that we understand expert capacity and the details of routing within an expert layer, we need to implement a fully-functional router. This router will share the same logic as our prior implementation (i.e., a linear layer with softmax), but it will go beyond this implementation by creating the fixed-size input tensors for each of the experts; see below. Given that this is a fully-functional implementation, the router below is more complex than before. However, we can distill this implementation into the following components:</p><ul><li><p><em>Lines 41-47</em>: Compute the output of the (noisy) linear router. </p></li><li><p><em>Lines 49-52</em>: Compute the top-<code>K</code> experts and their associated probabilities.</p></li><li><p><em>Lines 55-58</em>: Compute the expert capacity. </p></li><li><p><em>Lines 60-88</em>: Use fancy PyTorch indexing and tensor manipulation to handle constructing the batch of expert inputs<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-9" href="#footnote-9" target="_self">9</a>.</p></li><li><p><em>Lines 90-93</em>: Construct the final batch of expert inputs. </p></li></ul><div class="github-gist" data-attrs="{&quot;innerHTML&quot;:&quot;<div id=\&quot;gist136689021\&quot; class=\&quot;gist\&quot;>\n    <div class=\&quot;gist-file\&quot; translate=\&quot;no\&quot; data-color-mode=\&quot;light\&quot; data-light-theme=\&quot;light\&quot;>\n      <div class=\&quot;gist-data\&quot;>\n        <div class=\&quot;js-gist-file-update-container js-task-list-container\&quot;>\n  <div id=\&quot;file-full_softmax_router-py\&quot; class=\&quot;file my-2\&quot;>\n    \n    <div itemprop=\&quot;text\&quot;\n      class=\&quot;Box-body p-0 blob-wrapper data type-python  \&quot;\n      style=\&quot;overflow: auto\&quot; tabindex=\&quot;0\&quot; role=\&quot;region\&quot;\n      aria-label=\&quot;file-full_softmax_router-py\&quot;\n    >\n\n        \n<div class=\&quot;js-check-bidi js-blob-code-container blob-code-content\&quot;>\n\n  <template class=\&quot;js-file-alert-template\&quot;>\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash flash-warn flash-full d-flex flex-items-center\&quot;>\n  <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n    <span>\n      This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.\n      <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.co/hiddenchars\&quot; target=\&quot;_blank\&quot;>Learn more about bidirectional Unicode characters</a>\n    </span>\n\n\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash-action\&quot;>        <a href=\&quot;{{ revealButtonHref }}\&quot; data-view-component=\&quot;true\&quot; class=\&quot;btn-sm btn\&quot;>    Show hidden characters\n</a>\n</div>\n</div></template>\n<template class=\&quot;js-line-alert-template\&quot;>\n  <span aria-label=\&quot;This line has hidden Unicode characters\&quot; data-view-component=\&quot;true\&quot; class=\&quot;line-alert tooltipped tooltipped-e\&quot;>\n    <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n</span></template>\n\n  <table data-hpc class=\&quot;highlight tab-size js-file-line-container\&quot; data-tab-size=\&quot;8\&quot; data-paste-markdown-skip data-tagsearch-path=\&quot;full_softmax_router.py\&quot;>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L1\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;1\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC1\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>import</span> <span class=pl-s1>math</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L2\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;2\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC2\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L3\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;3\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC3\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>import</span> <span class=pl-s1>torch</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L4\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;4\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC4\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>from</span> <span class=pl-s1>torch</span> <span class=pl-k>import</span> <span class=pl-s1>nn</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L5\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;5\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC5\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>from</span> <span class=pl-s1>torch</span>.<span class=pl-s1>nn</span> <span class=pl-k>import</span> <span class=pl-s1>functional</span> <span class=pl-k>as</span> <span class=pl-c1>F</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L6\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;6\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC6\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L7\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;7\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC7\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>class</span> <span class=pl-v>Router</span>(<span class=pl-s1>nn</span>.<span class=pl-c1>Module</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L8\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;8\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC8\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>__init__</span>(</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L9\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;9\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC9\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L10\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;10\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC10\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>d</span>, </td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L11\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;11\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC11\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>n_exp</span> <span class=pl-c1>=</span> <span class=pl-c1>8</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L12\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;12\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC12\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>top_k</span> <span class=pl-c1>=</span> <span class=pl-c1>2</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L13\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;13\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC13\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>use_noisy_top_k</span> <span class=pl-c1>=</span> <span class=pl-c1>True</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L14\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;14\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC14\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>capacity_factor</span> <span class=pl-c1>=</span> <span class=pl-c1>1.25</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L15\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;15\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC15\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    ):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L16\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;16\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC16\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L17\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;17\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC17\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        Arguments:</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L18\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;18\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC18\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        d: size of embedding dimension</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L19\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;19\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC19\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        n_exp: the number of experts to create in the expert layer</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L20\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;20\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC20\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        top_k: the number of active experts for each token</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L21\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;21\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC21\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        use_noisy_top_k: whether to add noise when computing expert output</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L22\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;22\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC22\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        capacity_factor: used to compute expert capacity</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L23\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;23\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC23\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        &amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L24\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;24\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC24\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>      </td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L25\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;25\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC25\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-en>super</span>().<span class=pl-c1>__init__</span>()</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L26\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;26\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC26\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L27\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;27\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC27\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>d</span> <span class=pl-c1>=</span> <span class=pl-s1>d</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L28\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;28\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC28\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>n_exp</span> <span class=pl-c1>=</span> <span class=pl-s1>n_exp</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L29\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;29\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC29\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>top_k</span> <span class=pl-c1>=</span> <span class=pl-s1>top_k</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L30\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;30\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC30\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>assert</span> <span class=pl-s1>self</span>.<span class=pl-c1>top_k</span> <span class=pl-c1>&amp;gt;=</span> <span class=pl-c1>1</span> <span class=pl-c1>and</span> <span class=pl-s1>self</span>.<span class=pl-c1>top_k</span> <span class=pl-c1>&amp;lt;=</span> <span class=pl-s1>n_exp</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L31\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;31\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC31\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>use_noisy_top_k</span> <span class=pl-c1>=</span> <span class=pl-s1>use_noisy_top_k</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L32\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;32\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC32\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>capacity_factor</span> <span class=pl-c1>=</span> <span class=pl-s1>capacity_factor</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L33\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;33\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC33\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>w_g</span> <span class=pl-c1>=</span> <span class=pl-s1>nn</span>.<span class=pl-c1>Linear</span>(<span class=pl-s1>d</span>, <span class=pl-s1>n_exp</span>, <span class=pl-s1>bias</span><span class=pl-c1>=</span><span class=pl-c1>False</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L34\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;34\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC34\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>w_noise</span> <span class=pl-c1>=</span> <span class=pl-s1>nn</span>.<span class=pl-c1>Linear</span>(<span class=pl-s1>d</span>, <span class=pl-s1>n_exp</span>, <span class=pl-s1>bias</span><span class=pl-c1>=</span><span class=pl-c1>False</span>) <span class=pl-k>if</span> <span class=pl-s1>self</span>.<span class=pl-c1>use_noisy_top_k</span> <span class=pl-k>else</span> <span class=pl-c1>None</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L35\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;35\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC35\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L36\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;36\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC36\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>forward</span>(<span class=pl-s1>self</span>, <span class=pl-s1>x</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L37\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;37\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC37\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># get the total number of tokens in the batch</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L38\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;38\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC38\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c1>B</span>, <span class=pl-c1>C</span>, <span class=pl-s1>_</span> <span class=pl-c1>=</span> <span class=pl-s1>x</span>.<span class=pl-c1>size</span>()</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L39\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;39\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC39\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>num_tokens</span> <span class=pl-c1>=</span> <span class=pl-c1>B</span> <span class=pl-c1>*</span> <span class=pl-c1>C</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L40\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;40\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC40\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L41\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;41\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC41\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># eq (4) in https://arxiv.org/abs/1701.06538</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L42\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;42\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC42\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>logits</span> <span class=pl-c1>=</span> <span class=pl-s1>self</span>.<span class=pl-c1>w_g</span>(<span class=pl-s1>x</span>)  <span class=pl-c># [B, C, d] -&amp;gt; [B, C, n_exp]</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L43\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;43\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC43\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>if</span> <span class=pl-s1>self</span>.<span class=pl-c1>use_noisy_top_k</span>:</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L44\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;44\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC44\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-c># (optionally) add noise into the router</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L45\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;45\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC45\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>noise</span> <span class=pl-c1>=</span> <span class=pl-c1>F</span>.<span class=pl-c1>softplus</span>(<span class=pl-s1>self</span>.<span class=pl-c1>w_noise</span>(<span class=pl-s1>x</span>))</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L46\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;46\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC46\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>noise</span> <span class=pl-c1>*=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>randn_like</span>(<span class=pl-s1>noise</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L47\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;47\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC47\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>logits</span> <span class=pl-c1>+=</span> <span class=pl-s1>noise</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L48\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;48\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC48\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L49\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;49\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC49\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># top-K expert selection, compute probabilities over active experts</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L50\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;50\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC50\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>top_k_logits</span>, <span class=pl-s1>top_k_indices</span> <span class=pl-c1>=</span> <span class=pl-s1>logits</span>.<span class=pl-c1>topk</span>(<span class=pl-s1>self</span>.<span class=pl-c1>top_k</span>, <span class=pl-s1>dim</span><span class=pl-c1>=</span><span class=pl-c1>-</span><span class=pl-c1>1</span>) <span class=pl-c># [B, C, K]</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L51\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;51\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC51\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>router_probs</span> <span class=pl-c1>=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>full_like</span>(<span class=pl-s1>logits</span>, <span class=pl-en>float</span>(<span class=pl-s>&amp;#39;-inf&amp;#39;</span>))  <span class=pl-c># [B, C, n_exp]</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L52\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;52\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC52\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>router_probs</span>.<span class=pl-c1>scatter_</span>(<span class=pl-c1>-</span><span class=pl-c1>1</span>, <span class=pl-s1>top_k_indices</span>, <span class=pl-s1>top_k_logits</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L53\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;53\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC53\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>router_probs</span> <span class=pl-c1>=</span> <span class=pl-c1>F</span>.<span class=pl-c1>softmax</span>(<span class=pl-s1>router_probs</span>, <span class=pl-s1>dim</span><span class=pl-c1>=</span><span class=pl-c1>-</span><span class=pl-c1>1</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L54\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;54\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC54\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        </td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L55\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;55\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC55\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># compute the expert capacity</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L56\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;56\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC56\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>exp_capacity</span> <span class=pl-c1>=</span> <span class=pl-s1>math</span>.<span class=pl-c1>floor</span>(<span class=pl-s1>self</span>.<span class=pl-c1>top_k</span> <span class=pl-c1>*</span> <span class=pl-s1>self</span>.<span class=pl-c1>capacity_factor</span> <span class=pl-c1>*</span> <span class=pl-s1>num_tokens</span> <span class=pl-c1>/</span> <span class=pl-s1>self</span>.<span class=pl-c1>n_exp</span>)   </td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L57\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;57\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC57\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>exp_capacity</span> <span class=pl-c1>+=</span> <span class=pl-s1>exp_capacity</span> <span class=pl-c1>%</span> <span class=pl-c1>2</span> <span class=pl-c># make sure expert capacity is an even integer</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L58\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;58\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC58\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>exp_capacity</span> <span class=pl-c1>=</span> <span class=pl-en>int</span>(<span class=pl-s1>exp_capacity</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L59\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;59\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC59\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L60\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;60\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC60\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># make a multi-hot mask of chosen experts</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L61\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;61\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC61\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># values are 0 if expert not chosen, 1 if expert chosen</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L62\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;62\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC62\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>exp_mask</span> <span class=pl-c1>=</span> <span class=pl-c1>F</span>.<span class=pl-c1>one_hot</span>(<span class=pl-s1>top_k_indices</span>, <span class=pl-s1>num_classes</span><span class=pl-c1>=</span><span class=pl-s1>self</span>.<span class=pl-c1>n_exp</span>)  <span class=pl-c># [B, C, K, n_exp]</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L63\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;63\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC63\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>exp_mask</span> <span class=pl-c1>=</span> <span class=pl-s1>exp_mask</span>.<span class=pl-c1>view</span>(<span class=pl-s1>num_tokens</span>, <span class=pl-s1>self</span>.<span class=pl-c1>top_k</span>, <span class=pl-s1>self</span>.<span class=pl-c1>n_exp</span>)  <span class=pl-c># [B * C, K, n_exp]</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L64\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;64\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC64\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>exp_mask</span> <span class=pl-c1>=</span> <span class=pl-s1>exp_mask</span>.<span class=pl-c1>permute</span>(<span class=pl-c1>1</span>, <span class=pl-c1>0</span>, <span class=pl-c1>2</span>) <span class=pl-c># [K, B * C, n_exp]</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L65\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;65\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC65\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L66\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;66\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC66\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># compute index for each token in expert batch</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L67\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;67\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC67\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># NOTE: cumsum counts top-1 first, top-2 second, etc.</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L68\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;68\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC68\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># to prioritize top experts when dropping tokens</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L69\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;69\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC69\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>exp_rank</span> <span class=pl-c1>=</span> <span class=pl-s1>exp_mask</span>.<span class=pl-c1>reshape</span>(<span class=pl-s1>self</span>.<span class=pl-c1>top_k</span> <span class=pl-c1>*</span> <span class=pl-s1>num_tokens</span>, <span class=pl-s1>self</span>.<span class=pl-c1>n_exp</span>)  <span class=pl-c># [K * B * C, n_exp]</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L70\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;70\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC70\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>exp_rank</span> <span class=pl-c1>=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>cumsum</span>(<span class=pl-s1>exp_rank</span>, <span class=pl-s1>dim</span><span class=pl-c1>=</span><span class=pl-c1>0</span>) <span class=pl-c1>-</span> <span class=pl-c1>1</span>  <span class=pl-c># cumsum of expert selections [K * B * C, n_exp]</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L71\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;71\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC71\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>exp_rank</span> <span class=pl-c1>=</span> <span class=pl-s1>exp_rank</span>.<span class=pl-c1>reshape</span>(<span class=pl-s1>self</span>.<span class=pl-c1>top_k</span>, <span class=pl-s1>num_tokens</span>, <span class=pl-s1>self</span>.<span class=pl-c1>n_exp</span>)  <span class=pl-c># [K, B * C, n_exp]</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L72\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;72\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC72\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L73\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;73\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC73\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># mask entries beyond expert capacity and compute used capacity</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L74\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;74\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC74\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>exp_mask</span> <span class=pl-c1>*=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>lt</span>(<span class=pl-s1>exp_rank</span>, <span class=pl-s1>exp_capacity</span>) <span class=pl-c># [K, B * C, n_exp]</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L75\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;75\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC75\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L76\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;76\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC76\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># matrix storing token position in batch of corresponding expert </span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L77\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;77\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC77\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>exp_rank</span> <span class=pl-c1>=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>sum</span>(<span class=pl-s1>exp_mask</span> <span class=pl-c1>*</span> <span class=pl-s1>exp_rank</span>, <span class=pl-s1>dim</span><span class=pl-c1>=</span><span class=pl-c1>-</span><span class=pl-c1>1</span>)  <span class=pl-c># [K, B * C]</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L78\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;78\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC78\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L79\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;79\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC79\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># mask probabilities to only include selected experts</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L80\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;80\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC80\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>router_probs</span> <span class=pl-c1>=</span> <span class=pl-s1>router_probs</span>.<span class=pl-c1>view</span>(<span class=pl-s1>num_tokens</span>, <span class=pl-s1>self</span>.<span class=pl-c1>n_exp</span>)[<span class=pl-c1>None</span>, :] <span class=pl-c># [1, B * C, n_exp]</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L81\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;81\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC81\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>exp_weights</span> <span class=pl-c1>=</span> <span class=pl-s1>exp_mask</span> <span class=pl-c1>*</span> <span class=pl-s1>router_probs</span> <span class=pl-c># [K, B * C, n_exp]</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L82\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;82\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC82\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L83\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;83\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC83\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># position of each token within the capacity of the selected expert</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L84\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;84\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC84\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>exp_rank_sc</span> <span class=pl-c1>=</span> <span class=pl-c1>F</span>.<span class=pl-c1>one_hot</span>(<span class=pl-s1>exp_rank</span>, <span class=pl-s1>num_classes</span><span class=pl-c1>=</span><span class=pl-s1>exp_capacity</span>) <span class=pl-c># [K, B * C, exp_capacity]</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L85\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;85\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC85\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L86\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;86\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC86\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># weight of selected expert for each token at position the capacity of that expert </span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L87\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;87\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC87\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>exp_weights</span> <span class=pl-c1>=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>sum</span>(<span class=pl-s1>exp_weights</span>.<span class=pl-c1>unsqueeze</span>(<span class=pl-c1>3</span>) <span class=pl-c1>*</span> <span class=pl-s1>exp_rank_sc</span>.<span class=pl-c1>unsqueeze</span>(<span class=pl-c1>2</span>), <span class=pl-s1>dim</span><span class=pl-c1>=</span><span class=pl-c1>0</span>) <span class=pl-c># [B * C, n_exp, exp_capacity]</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L88\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;88\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC88\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>exp_mask</span> <span class=pl-c1>=</span> <span class=pl-s1>exp_weights</span>.<span class=pl-c1>bool</span>() <span class=pl-c># binary mask of selected experts for each token</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L89\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;89\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC89\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L90\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;90\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC90\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># reshape tokens into batches for each expert, return both weights and batches</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L91\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;91\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC91\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># [n_exp, exp_capacity, B * C] * [B * C, d] -&amp;gt; [n_exp, exp_capacity, n_embd]</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L92\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;92\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC92\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>x</span> <span class=pl-c1>=</span> <span class=pl-s1>x</span>.<span class=pl-c1>view</span>(<span class=pl-s1>num_tokens</span>, <span class=pl-s1>self</span>.<span class=pl-c1>d</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L93\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;93\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC93\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>expert_batches</span> <span class=pl-c1>=</span> <span class=pl-s1>exp_mask</span>.<span class=pl-c1>permute</span>(<span class=pl-c1>1</span>, <span class=pl-c1>2</span>, <span class=pl-c1>0</span>).<span class=pl-c1>type_as</span>(<span class=pl-s1>x</span>) @ <span class=pl-s1>x</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-full_softmax_router-py-L94\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;94\&quot;></td>\n          <td id=\&quot;file-full_softmax_router-py-LC94\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>return</span> <span class=pl-s1>exp_weights</span>, <span class=pl-s1>exp_mask</span>, <span class=pl-s1>expert_batches</span></td>\n        </tr>\n  </table>\n</div>\n\n\n    </div>\n\n  </div>\n</div>\n\n      </div>\n      <div class=\&quot;gist-meta\&quot;>\n        <a href=\&quot;https://gist.github.com/wolfecameron/6cc8a81c546537e903521356a3a60675/raw/b0fa54d901c05c9b9383c43d547fd94af597a40a/full_softmax_router.py\&quot; style=\&quot;float:right\&quot; class=\&quot;Link--inTextBlock\&quot;>view raw</a>\n        <a href=\&quot;https://gist.github.com/wolfecameron/6cc8a81c546537e903521356a3a60675#file-full_softmax_router-py\&quot; class=\&quot;Link--inTextBlock\&quot;>\n          full_softmax_router.py\n        </a>\n        hosted with &amp;#10084; by <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.com\&quot;>GitHub</a>\n      </div>\n    </div>\n</div>\n&quot;,&quot;stylesheet&quot;:&quot;https://github.githubassets.com/assets/gist-embed-7b7a1d3fd6f6.css&quot;}" data-component-name="GitgistToDOM"><link rel="stylesheet" href="https://github.githubassets.com/assets/gist-embed-7b7a1d3fd6f6.css"><div id="gist136689021" class="gist">
    <div class="gist-file" data-color-mode="light" data-light-theme="light">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container">
  <div id="file-full_softmax_router-py" class="file my-2">
    
    <div itemprop="text" class="Box-body p-0 blob-wrapper data type-python  " style="overflow:auto">

        
<div class="js-check-bidi js-blob-code-container blob-code-content">

  
  <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
  
    

    <span>
      This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
    </span>


  <div data-view-component="true" class="flash-action">        <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn">    Show hidden characters
</a>
</div>
</div>

  <span data-view-component="true" class="line-alert tooltipped tooltipped-e">
    
    

</span>

  <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="full_softmax_router.py">
        <tbody><tr>
          <td id="file-full_softmax_router-py-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
          <td id="file-full_softmax_router-py-LC1" class="blob-code blob-code-inner js-file-line"><span class="pl-k">import</span> <span class="pl-s1">math</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
          <td id="file-full_softmax_router-py-LC2" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
          <td id="file-full_softmax_router-py-LC3" class="blob-code blob-code-inner js-file-line"><span class="pl-k">import</span> <span class="pl-s1">torch</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
          <td id="file-full_softmax_router-py-LC4" class="blob-code blob-code-inner js-file-line"><span class="pl-k">from</span> <span class="pl-s1">torch</span> <span class="pl-k">import</span> <span class="pl-s1">nn</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
          <td id="file-full_softmax_router-py-LC5" class="blob-code blob-code-inner js-file-line"><span class="pl-k">from</span> <span class="pl-s1">torch</span>.<span class="pl-s1">nn</span> <span class="pl-k">import</span> <span class="pl-s1">functional</span> <span class="pl-k">as</span> <span class="pl-c1">F</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
          <td id="file-full_softmax_router-py-LC6" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
          <td id="file-full_softmax_router-py-LC7" class="blob-code blob-code-inner js-file-line"><span class="pl-k">class</span> <span class="pl-v">Router</span>(<span class="pl-s1">nn</span>.<span class="pl-c1">Module</span>):</td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
          <td id="file-full_softmax_router-py-LC8" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">__init__</span>(</td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
          <td id="file-full_softmax_router-py-LC9" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>,</td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
          <td id="file-full_softmax_router-py-LC10" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">d</span>, </td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
          <td id="file-full_softmax_router-py-LC11" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">n_exp</span> <span class="pl-c1">=</span> <span class="pl-c1">8</span>,</td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
          <td id="file-full_softmax_router-py-LC12" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">top_k</span> <span class="pl-c1">=</span> <span class="pl-c1">2</span>,</td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
          <td id="file-full_softmax_router-py-LC13" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">use_noisy_top_k</span> <span class="pl-c1">=</span> <span class="pl-c1">True</span>,</td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
          <td id="file-full_softmax_router-py-LC14" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">capacity_factor</span> <span class="pl-c1">=</span> <span class="pl-c1">1.25</span>,</td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
          <td id="file-full_softmax_router-py-LC15" class="blob-code blob-code-inner js-file-line">    ):</td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
          <td id="file-full_softmax_router-py-LC16" class="blob-code blob-code-inner js-file-line">        <span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
          <td id="file-full_softmax_router-py-LC17" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        Arguments:</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
          <td id="file-full_softmax_router-py-LC18" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        d: size of embedding dimension</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
          <td id="file-full_softmax_router-py-LC19" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        n_exp: the number of experts to create in the expert layer</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td>
          <td id="file-full_softmax_router-py-LC20" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        top_k: the number of active experts for each token</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td>
          <td id="file-full_softmax_router-py-LC21" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        use_noisy_top_k: whether to add noise when computing expert output</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td>
          <td id="file-full_softmax_router-py-LC22" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        capacity_factor: used to compute expert capacity</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td>
          <td id="file-full_softmax_router-py-LC23" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        """</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td>
          <td id="file-full_softmax_router-py-LC24" class="blob-code blob-code-inner js-file-line">      </td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L25" class="blob-num js-line-number js-blob-rnum" data-line-number="25"></td>
          <td id="file-full_softmax_router-py-LC25" class="blob-code blob-code-inner js-file-line">        <span class="pl-en">super</span>().<span class="pl-c1">__init__</span>()</td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L26" class="blob-num js-line-number js-blob-rnum" data-line-number="26"></td>
          <td id="file-full_softmax_router-py-LC26" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L27" class="blob-num js-line-number js-blob-rnum" data-line-number="27"></td>
          <td id="file-full_softmax_router-py-LC27" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">d</span> <span class="pl-c1">=</span> <span class="pl-s1">d</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L28" class="blob-num js-line-number js-blob-rnum" data-line-number="28"></td>
          <td id="file-full_softmax_router-py-LC28" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">n_exp</span> <span class="pl-c1">=</span> <span class="pl-s1">n_exp</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L29" class="blob-num js-line-number js-blob-rnum" data-line-number="29"></td>
          <td id="file-full_softmax_router-py-LC29" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">top_k</span> <span class="pl-c1">=</span> <span class="pl-s1">top_k</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L30" class="blob-num js-line-number js-blob-rnum" data-line-number="30"></td>
          <td id="file-full_softmax_router-py-LC30" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">assert</span> <span class="pl-s1">self</span>.<span class="pl-c1">top_k</span> <span class="pl-c1">&gt;=</span> <span class="pl-c1">1</span> <span class="pl-c1">and</span> <span class="pl-s1">self</span>.<span class="pl-c1">top_k</span> <span class="pl-c1">&lt;=</span> <span class="pl-s1">n_exp</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L31" class="blob-num js-line-number js-blob-rnum" data-line-number="31"></td>
          <td id="file-full_softmax_router-py-LC31" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">use_noisy_top_k</span> <span class="pl-c1">=</span> <span class="pl-s1">use_noisy_top_k</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L32" class="blob-num js-line-number js-blob-rnum" data-line-number="32"></td>
          <td id="file-full_softmax_router-py-LC32" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">capacity_factor</span> <span class="pl-c1">=</span> <span class="pl-s1">capacity_factor</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L33" class="blob-num js-line-number js-blob-rnum" data-line-number="33"></td>
          <td id="file-full_softmax_router-py-LC33" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">w_g</span> <span class="pl-c1">=</span> <span class="pl-s1">nn</span>.<span class="pl-c1">Linear</span>(<span class="pl-s1">d</span>, <span class="pl-s1">n_exp</span>, <span class="pl-s1">bias</span><span class="pl-c1">=</span><span class="pl-c1">False</span>)</td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L34" class="blob-num js-line-number js-blob-rnum" data-line-number="34"></td>
          <td id="file-full_softmax_router-py-LC34" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">w_noise</span> <span class="pl-c1">=</span> <span class="pl-s1">nn</span>.<span class="pl-c1">Linear</span>(<span class="pl-s1">d</span>, <span class="pl-s1">n_exp</span>, <span class="pl-s1">bias</span><span class="pl-c1">=</span><span class="pl-c1">False</span>) <span class="pl-k">if</span> <span class="pl-s1">self</span>.<span class="pl-c1">use_noisy_top_k</span> <span class="pl-k">else</span> <span class="pl-c1">None</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L35" class="blob-num js-line-number js-blob-rnum" data-line-number="35"></td>
          <td id="file-full_softmax_router-py-LC35" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L36" class="blob-num js-line-number js-blob-rnum" data-line-number="36"></td>
          <td id="file-full_softmax_router-py-LC36" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">forward</span>(<span class="pl-s1">self</span>, <span class="pl-s1">x</span>):</td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L37" class="blob-num js-line-number js-blob-rnum" data-line-number="37"></td>
          <td id="file-full_softmax_router-py-LC37" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># get the total number of tokens in the batch</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L38" class="blob-num js-line-number js-blob-rnum" data-line-number="38"></td>
          <td id="file-full_softmax_router-py-LC38" class="blob-code blob-code-inner js-file-line">        <span class="pl-c1">B</span>, <span class="pl-c1">C</span>, <span class="pl-s1">_</span> <span class="pl-c1">=</span> <span class="pl-s1">x</span>.<span class="pl-c1">size</span>()</td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L39" class="blob-num js-line-number js-blob-rnum" data-line-number="39"></td>
          <td id="file-full_softmax_router-py-LC39" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">num_tokens</span> <span class="pl-c1">=</span> <span class="pl-c1">B</span> <span class="pl-c1">*</span> <span class="pl-c1">C</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L40" class="blob-num js-line-number js-blob-rnum" data-line-number="40"></td>
          <td id="file-full_softmax_router-py-LC40" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L41" class="blob-num js-line-number js-blob-rnum" data-line-number="41"></td>
          <td id="file-full_softmax_router-py-LC41" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># eq (4) in https://arxiv.org/abs/1701.06538</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L42" class="blob-num js-line-number js-blob-rnum" data-line-number="42"></td>
          <td id="file-full_softmax_router-py-LC42" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">logits</span> <span class="pl-c1">=</span> <span class="pl-s1">self</span>.<span class="pl-c1">w_g</span>(<span class="pl-s1">x</span>)  <span class="pl-c"># [B, C, d] -&gt; [B, C, n_exp]</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L43" class="blob-num js-line-number js-blob-rnum" data-line-number="43"></td>
          <td id="file-full_softmax_router-py-LC43" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">if</span> <span class="pl-s1">self</span>.<span class="pl-c1">use_noisy_top_k</span>:</td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L44" class="blob-num js-line-number js-blob-rnum" data-line-number="44"></td>
          <td id="file-full_softmax_router-py-LC44" class="blob-code blob-code-inner js-file-line">            <span class="pl-c"># (optionally) add noise into the router</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L45" class="blob-num js-line-number js-blob-rnum" data-line-number="45"></td>
          <td id="file-full_softmax_router-py-LC45" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">noise</span> <span class="pl-c1">=</span> <span class="pl-c1">F</span>.<span class="pl-c1">softplus</span>(<span class="pl-s1">self</span>.<span class="pl-c1">w_noise</span>(<span class="pl-s1">x</span>))</td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L46" class="blob-num js-line-number js-blob-rnum" data-line-number="46"></td>
          <td id="file-full_softmax_router-py-LC46" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">noise</span> <span class="pl-c1">*=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">randn_like</span>(<span class="pl-s1">noise</span>)</td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L47" class="blob-num js-line-number js-blob-rnum" data-line-number="47"></td>
          <td id="file-full_softmax_router-py-LC47" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">logits</span> <span class="pl-c1">+=</span> <span class="pl-s1">noise</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L48" class="blob-num js-line-number js-blob-rnum" data-line-number="48"></td>
          <td id="file-full_softmax_router-py-LC48" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L49" class="blob-num js-line-number js-blob-rnum" data-line-number="49"></td>
          <td id="file-full_softmax_router-py-LC49" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># top-K expert selection, compute probabilities over active experts</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L50" class="blob-num js-line-number js-blob-rnum" data-line-number="50"></td>
          <td id="file-full_softmax_router-py-LC50" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">top_k_logits</span>, <span class="pl-s1">top_k_indices</span> <span class="pl-c1">=</span> <span class="pl-s1">logits</span>.<span class="pl-c1">topk</span>(<span class="pl-s1">self</span>.<span class="pl-c1">top_k</span>, <span class="pl-s1">dim</span><span class="pl-c1">=</span><span class="pl-c1">-</span><span class="pl-c1">1</span>) <span class="pl-c"># [B, C, K]</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L51" class="blob-num js-line-number js-blob-rnum" data-line-number="51"></td>
          <td id="file-full_softmax_router-py-LC51" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">router_probs</span> <span class="pl-c1">=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">full_like</span>(<span class="pl-s1">logits</span>, <span class="pl-en">float</span>(<span class="pl-s">'-inf'</span>))  <span class="pl-c"># [B, C, n_exp]</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L52" class="blob-num js-line-number js-blob-rnum" data-line-number="52"></td>
          <td id="file-full_softmax_router-py-LC52" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">router_probs</span>.<span class="pl-c1">scatter_</span>(<span class="pl-c1">-</span><span class="pl-c1">1</span>, <span class="pl-s1">top_k_indices</span>, <span class="pl-s1">top_k_logits</span>)</td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L53" class="blob-num js-line-number js-blob-rnum" data-line-number="53"></td>
          <td id="file-full_softmax_router-py-LC53" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">router_probs</span> <span class="pl-c1">=</span> <span class="pl-c1">F</span>.<span class="pl-c1">softmax</span>(<span class="pl-s1">router_probs</span>, <span class="pl-s1">dim</span><span class="pl-c1">=</span><span class="pl-c1">-</span><span class="pl-c1">1</span>)</td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L54" class="blob-num js-line-number js-blob-rnum" data-line-number="54"></td>
          <td id="file-full_softmax_router-py-LC54" class="blob-code blob-code-inner js-file-line">        </td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L55" class="blob-num js-line-number js-blob-rnum" data-line-number="55"></td>
          <td id="file-full_softmax_router-py-LC55" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># compute the expert capacity</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L56" class="blob-num js-line-number js-blob-rnum" data-line-number="56"></td>
          <td id="file-full_softmax_router-py-LC56" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">exp_capacity</span> <span class="pl-c1">=</span> <span class="pl-s1">math</span>.<span class="pl-c1">floor</span>(<span class="pl-s1">self</span>.<span class="pl-c1">top_k</span> <span class="pl-c1">*</span> <span class="pl-s1">self</span>.<span class="pl-c1">capacity_factor</span> <span class="pl-c1">*</span> <span class="pl-s1">num_tokens</span> <span class="pl-c1">/</span> <span class="pl-s1">self</span>.<span class="pl-c1">n_exp</span>)   </td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L57" class="blob-num js-line-number js-blob-rnum" data-line-number="57"></td>
          <td id="file-full_softmax_router-py-LC57" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">exp_capacity</span> <span class="pl-c1">+=</span> <span class="pl-s1">exp_capacity</span> <span class="pl-c1">%</span> <span class="pl-c1">2</span> <span class="pl-c"># make sure expert capacity is an even integer</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L58" class="blob-num js-line-number js-blob-rnum" data-line-number="58"></td>
          <td id="file-full_softmax_router-py-LC58" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">exp_capacity</span> <span class="pl-c1">=</span> <span class="pl-en">int</span>(<span class="pl-s1">exp_capacity</span>)</td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L59" class="blob-num js-line-number js-blob-rnum" data-line-number="59"></td>
          <td id="file-full_softmax_router-py-LC59" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L60" class="blob-num js-line-number js-blob-rnum" data-line-number="60"></td>
          <td id="file-full_softmax_router-py-LC60" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># make a multi-hot mask of chosen experts</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L61" class="blob-num js-line-number js-blob-rnum" data-line-number="61"></td>
          <td id="file-full_softmax_router-py-LC61" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># values are 0 if expert not chosen, 1 if expert chosen</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L62" class="blob-num js-line-number js-blob-rnum" data-line-number="62"></td>
          <td id="file-full_softmax_router-py-LC62" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">exp_mask</span> <span class="pl-c1">=</span> <span class="pl-c1">F</span>.<span class="pl-c1">one_hot</span>(<span class="pl-s1">top_k_indices</span>, <span class="pl-s1">num_classes</span><span class="pl-c1">=</span><span class="pl-s1">self</span>.<span class="pl-c1">n_exp</span>)  <span class="pl-c"># [B, C, K, n_exp]</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L63" class="blob-num js-line-number js-blob-rnum" data-line-number="63"></td>
          <td id="file-full_softmax_router-py-LC63" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">exp_mask</span> <span class="pl-c1">=</span> <span class="pl-s1">exp_mask</span>.<span class="pl-c1">view</span>(<span class="pl-s1">num_tokens</span>, <span class="pl-s1">self</span>.<span class="pl-c1">top_k</span>, <span class="pl-s1">self</span>.<span class="pl-c1">n_exp</span>)  <span class="pl-c"># [B * C, K, n_exp]</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L64" class="blob-num js-line-number js-blob-rnum" data-line-number="64"></td>
          <td id="file-full_softmax_router-py-LC64" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">exp_mask</span> <span class="pl-c1">=</span> <span class="pl-s1">exp_mask</span>.<span class="pl-c1">permute</span>(<span class="pl-c1">1</span>, <span class="pl-c1">0</span>, <span class="pl-c1">2</span>) <span class="pl-c"># [K, B * C, n_exp]</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L65" class="blob-num js-line-number js-blob-rnum" data-line-number="65"></td>
          <td id="file-full_softmax_router-py-LC65" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L66" class="blob-num js-line-number js-blob-rnum" data-line-number="66"></td>
          <td id="file-full_softmax_router-py-LC66" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># compute index for each token in expert batch</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L67" class="blob-num js-line-number js-blob-rnum" data-line-number="67"></td>
          <td id="file-full_softmax_router-py-LC67" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># NOTE: cumsum counts top-1 first, top-2 second, etc.</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L68" class="blob-num js-line-number js-blob-rnum" data-line-number="68"></td>
          <td id="file-full_softmax_router-py-LC68" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># to prioritize top experts when dropping tokens</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L69" class="blob-num js-line-number js-blob-rnum" data-line-number="69"></td>
          <td id="file-full_softmax_router-py-LC69" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">exp_rank</span> <span class="pl-c1">=</span> <span class="pl-s1">exp_mask</span>.<span class="pl-c1">reshape</span>(<span class="pl-s1">self</span>.<span class="pl-c1">top_k</span> <span class="pl-c1">*</span> <span class="pl-s1">num_tokens</span>, <span class="pl-s1">self</span>.<span class="pl-c1">n_exp</span>)  <span class="pl-c"># [K * B * C, n_exp]</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L70" class="blob-num js-line-number js-blob-rnum" data-line-number="70"></td>
          <td id="file-full_softmax_router-py-LC70" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">exp_rank</span> <span class="pl-c1">=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">cumsum</span>(<span class="pl-s1">exp_rank</span>, <span class="pl-s1">dim</span><span class="pl-c1">=</span><span class="pl-c1">0</span>) <span class="pl-c1">-</span> <span class="pl-c1">1</span>  <span class="pl-c"># cumsum of expert selections [K * B * C, n_exp]</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L71" class="blob-num js-line-number js-blob-rnum" data-line-number="71"></td>
          <td id="file-full_softmax_router-py-LC71" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">exp_rank</span> <span class="pl-c1">=</span> <span class="pl-s1">exp_rank</span>.<span class="pl-c1">reshape</span>(<span class="pl-s1">self</span>.<span class="pl-c1">top_k</span>, <span class="pl-s1">num_tokens</span>, <span class="pl-s1">self</span>.<span class="pl-c1">n_exp</span>)  <span class="pl-c"># [K, B * C, n_exp]</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L72" class="blob-num js-line-number js-blob-rnum" data-line-number="72"></td>
          <td id="file-full_softmax_router-py-LC72" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L73" class="blob-num js-line-number js-blob-rnum" data-line-number="73"></td>
          <td id="file-full_softmax_router-py-LC73" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># mask entries beyond expert capacity and compute used capacity</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L74" class="blob-num js-line-number js-blob-rnum" data-line-number="74"></td>
          <td id="file-full_softmax_router-py-LC74" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">exp_mask</span> <span class="pl-c1">*=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">lt</span>(<span class="pl-s1">exp_rank</span>, <span class="pl-s1">exp_capacity</span>) <span class="pl-c"># [K, B * C, n_exp]</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L75" class="blob-num js-line-number js-blob-rnum" data-line-number="75"></td>
          <td id="file-full_softmax_router-py-LC75" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L76" class="blob-num js-line-number js-blob-rnum" data-line-number="76"></td>
          <td id="file-full_softmax_router-py-LC76" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># matrix storing token position in batch of corresponding expert </span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L77" class="blob-num js-line-number js-blob-rnum" data-line-number="77"></td>
          <td id="file-full_softmax_router-py-LC77" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">exp_rank</span> <span class="pl-c1">=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">sum</span>(<span class="pl-s1">exp_mask</span> <span class="pl-c1">*</span> <span class="pl-s1">exp_rank</span>, <span class="pl-s1">dim</span><span class="pl-c1">=</span><span class="pl-c1">-</span><span class="pl-c1">1</span>)  <span class="pl-c"># [K, B * C]</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L78" class="blob-num js-line-number js-blob-rnum" data-line-number="78"></td>
          <td id="file-full_softmax_router-py-LC78" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L79" class="blob-num js-line-number js-blob-rnum" data-line-number="79"></td>
          <td id="file-full_softmax_router-py-LC79" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># mask probabilities to only include selected experts</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L80" class="blob-num js-line-number js-blob-rnum" data-line-number="80"></td>
          <td id="file-full_softmax_router-py-LC80" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">router_probs</span> <span class="pl-c1">=</span> <span class="pl-s1">router_probs</span>.<span class="pl-c1">view</span>(<span class="pl-s1">num_tokens</span>, <span class="pl-s1">self</span>.<span class="pl-c1">n_exp</span>)[<span class="pl-c1">None</span>, :] <span class="pl-c"># [1, B * C, n_exp]</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L81" class="blob-num js-line-number js-blob-rnum" data-line-number="81"></td>
          <td id="file-full_softmax_router-py-LC81" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">exp_weights</span> <span class="pl-c1">=</span> <span class="pl-s1">exp_mask</span> <span class="pl-c1">*</span> <span class="pl-s1">router_probs</span> <span class="pl-c"># [K, B * C, n_exp]</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L82" class="blob-num js-line-number js-blob-rnum" data-line-number="82"></td>
          <td id="file-full_softmax_router-py-LC82" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L83" class="blob-num js-line-number js-blob-rnum" data-line-number="83"></td>
          <td id="file-full_softmax_router-py-LC83" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># position of each token within the capacity of the selected expert</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L84" class="blob-num js-line-number js-blob-rnum" data-line-number="84"></td>
          <td id="file-full_softmax_router-py-LC84" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">exp_rank_sc</span> <span class="pl-c1">=</span> <span class="pl-c1">F</span>.<span class="pl-c1">one_hot</span>(<span class="pl-s1">exp_rank</span>, <span class="pl-s1">num_classes</span><span class="pl-c1">=</span><span class="pl-s1">exp_capacity</span>) <span class="pl-c"># [K, B * C, exp_capacity]</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L85" class="blob-num js-line-number js-blob-rnum" data-line-number="85"></td>
          <td id="file-full_softmax_router-py-LC85" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L86" class="blob-num js-line-number js-blob-rnum" data-line-number="86"></td>
          <td id="file-full_softmax_router-py-LC86" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># weight of selected expert for each token at position the capacity of that expert </span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L87" class="blob-num js-line-number js-blob-rnum" data-line-number="87"></td>
          <td id="file-full_softmax_router-py-LC87" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">exp_weights</span> <span class="pl-c1">=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">sum</span>(<span class="pl-s1">exp_weights</span>.<span class="pl-c1">unsqueeze</span>(<span class="pl-c1">3</span>) <span class="pl-c1">*</span> <span class="pl-s1">exp_rank_sc</span>.<span class="pl-c1">unsqueeze</span>(<span class="pl-c1">2</span>), <span class="pl-s1">dim</span><span class="pl-c1">=</span><span class="pl-c1">0</span>) <span class="pl-c"># [B * C, n_exp, exp_capacity]</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L88" class="blob-num js-line-number js-blob-rnum" data-line-number="88"></td>
          <td id="file-full_softmax_router-py-LC88" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">exp_mask</span> <span class="pl-c1">=</span> <span class="pl-s1">exp_weights</span>.<span class="pl-c1">bool</span>() <span class="pl-c"># binary mask of selected experts for each token</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L89" class="blob-num js-line-number js-blob-rnum" data-line-number="89"></td>
          <td id="file-full_softmax_router-py-LC89" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L90" class="blob-num js-line-number js-blob-rnum" data-line-number="90"></td>
          <td id="file-full_softmax_router-py-LC90" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># reshape tokens into batches for each expert, return both weights and batches</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L91" class="blob-num js-line-number js-blob-rnum" data-line-number="91"></td>
          <td id="file-full_softmax_router-py-LC91" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># [n_exp, exp_capacity, B * C] * [B * C, d] -&gt; [n_exp, exp_capacity, n_embd]</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L92" class="blob-num js-line-number js-blob-rnum" data-line-number="92"></td>
          <td id="file-full_softmax_router-py-LC92" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">x</span> <span class="pl-c1">=</span> <span class="pl-s1">x</span>.<span class="pl-c1">view</span>(<span class="pl-s1">num_tokens</span>, <span class="pl-s1">self</span>.<span class="pl-c1">d</span>)</td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L93" class="blob-num js-line-number js-blob-rnum" data-line-number="93"></td>
          <td id="file-full_softmax_router-py-LC93" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">expert_batches</span> <span class="pl-c1">=</span> <span class="pl-s1">exp_mask</span>.<span class="pl-c1">permute</span>(<span class="pl-c1">1</span>, <span class="pl-c1">2</span>, <span class="pl-c1">0</span>).<span class="pl-c1">type_as</span>(<span class="pl-s1">x</span>) @ <span class="pl-s1">x</span></td>
        </tr>
        <tr>
          <td id="file-full_softmax_router-py-L94" class="blob-num js-line-number js-blob-rnum" data-line-number="94"></td>
          <td id="file-full_softmax_router-py-LC94" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">return</span> <span class="pl-s1">exp_weights</span>, <span class="pl-s1">exp_mask</span>, <span class="pl-s1">expert_batches</span></td>
        </tr>
  </tbody></table>
</div>


    </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/wolfecameron/6cc8a81c546537e903521356a3a60675/raw/b0fa54d901c05c9b9383c43d547fd94af597a40a/full_softmax_router.py" style="float:right" class="Link--inTextBlock">view raw</a>
        <a href="https://gist.github.com/wolfecameron/6cc8a81c546537e903521356a3a60675#file-full_softmax_router-py" class="Link--inTextBlock">
          full_softmax_router.py
        </a>
        hosted with &#10084; by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
      </div>
    </div>
</div>
</div><h4>Load Balancing and Auxiliary Losses</h4><blockquote><p><em>&#8220;The gating network tends to converge to a state where it always produces large weights for the same few experts. This imbalance is self-reinforcing, as the favored experts are trained more rapidly and thus are selected even more by the gating network.&#8221;</em> - from [7]</p></blockquote><p>So far, the routing system we have devised does not explicitly encourage a balanced selection of experts in each layer. As a result, the model will converge to a state of repeatedly selecting the same few experts for every token instead of fully utilizing its experts. This phenomenon, which is explained in the quote above, is commonly referred to as &#8220;routing collapse&#8221;.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HmXE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F644cec45-8dff-491e-9d41-e53ee4b0c7df_1574x764.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HmXE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F644cec45-8dff-491e-9d41-e53ee4b0c7df_1574x764.png 424w, https://substackcdn.com/image/fetch/$s_!HmXE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F644cec45-8dff-491e-9d41-e53ee4b0c7df_1574x764.png 848w, https://substackcdn.com/image/fetch/$s_!HmXE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F644cec45-8dff-491e-9d41-e53ee4b0c7df_1574x764.png 1272w, https://substackcdn.com/image/fetch/$s_!HmXE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F644cec45-8dff-491e-9d41-e53ee4b0c7df_1574x764.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HmXE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F644cec45-8dff-491e-9d41-e53ee4b0c7df_1574x764.png" width="1456" height="707" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/644cec45-8dff-491e-9d41-e53ee4b0c7df_1574x764.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:707,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HmXE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F644cec45-8dff-491e-9d41-e53ee4b0c7df_1574x764.png 424w, https://substackcdn.com/image/fetch/$s_!HmXE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F644cec45-8dff-491e-9d41-e53ee4b0c7df_1574x764.png 848w, https://substackcdn.com/image/fetch/$s_!HmXE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F644cec45-8dff-491e-9d41-e53ee4b0c7df_1574x764.png 1272w, https://substackcdn.com/image/fetch/$s_!HmXE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F644cec45-8dff-491e-9d41-e53ee4b0c7df_1574x764.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [6])</figcaption></figure></div><p><strong>Load balancing loss.</strong> To encourage a balanced selection of experts during training, we can simply add an additional component to the training loss that rewards the model for uniformly leveraging its experts.  More specifically, we create the auxiliary loss term shown above, which measures expert importance (i.e., the probability assigned to each expert) and load balancing (i.e., the number of tokens sent to each expert). Such an approach is proposed in [2], where authors create a loss that considers two quantities:</p><ol><li><p>The fraction of router probability allocated to each expert<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-10" href="#footnote-10" target="_self">10</a>.</p></li><li><p>The fraction of tokens dispatched to each expert.</p></li></ol><p>If we store both of these quantities in their own <code>N</code>-dimensional vectors, we can create a single loss term by taking the <a href="https://en.wikipedia.org/wiki/Dot_product">dot product</a><sup> </sup>of these two vectors. This loss is minimized when experts receive uniform probability and load balancing.</p><p>An implementation of this load balancing loss in PyTorch is provided below. This implementation has the following key components:</p><ul><li><p><em>Lines 9-17</em>: define all constants and input tensors used for computing the load balancing loss.</p></li><li><p><em>Lines 19-24</em>: compute the ratio or fraction of tokens sent to each expert. </p></li><li><p><em>Lines 26-27</em>: compute the fraction of probability allocated to each expert. </p></li><li><p>Lines 29-31: take a (scaled) dot product between the ratio of tokens and probability for each expert<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-11" href="#footnote-11" target="_self">11</a>.</p></li></ul><div class="github-gist" data-attrs="{&quot;innerHTML&quot;:&quot;<div id=\&quot;gist136705054\&quot; class=\&quot;gist\&quot;>\n    <div class=\&quot;gist-file\&quot; translate=\&quot;no\&quot; data-color-mode=\&quot;light\&quot; data-light-theme=\&quot;light\&quot;>\n      <div class=\&quot;gist-data\&quot;>\n        <div class=\&quot;js-gist-file-update-container js-task-list-container\&quot;>\n  <div id=\&quot;file-load_balancing_loss-py\&quot; class=\&quot;file my-2\&quot;>\n    \n    <div itemprop=\&quot;text\&quot;\n      class=\&quot;Box-body p-0 blob-wrapper data type-python  \&quot;\n      style=\&quot;overflow: auto\&quot; tabindex=\&quot;0\&quot; role=\&quot;region\&quot;\n      aria-label=\&quot;file-load_balancing_loss-py\&quot;\n    >\n\n        \n<div class=\&quot;js-check-bidi js-blob-code-container blob-code-content\&quot;>\n\n  <template class=\&quot;js-file-alert-template\&quot;>\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash flash-warn flash-full d-flex flex-items-center\&quot;>\n  <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n    <span>\n      This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.\n      <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.co/hiddenchars\&quot; target=\&quot;_blank\&quot;>Learn more about bidirectional Unicode characters</a>\n    </span>\n\n\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash-action\&quot;>        <a href=\&quot;{{ revealButtonHref }}\&quot; data-view-component=\&quot;true\&quot; class=\&quot;btn-sm btn\&quot;>    Show hidden characters\n</a>\n</div>\n</div></template>\n<template class=\&quot;js-line-alert-template\&quot;>\n  <span aria-label=\&quot;This line has hidden Unicode characters\&quot; data-view-component=\&quot;true\&quot; class=\&quot;line-alert tooltipped tooltipped-e\&quot;>\n    <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n</span></template>\n\n  <table data-hpc class=\&quot;highlight tab-size js-file-line-container\&quot; data-tab-size=\&quot;8\&quot; data-paste-markdown-skip data-tagsearch-path=\&quot;load_balancing_loss.py\&quot;>\n        <tr>\n          <td id=\&quot;file-load_balancing_loss-py-L1\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;1\&quot;></td>\n          <td id=\&quot;file-load_balancing_loss-py-LC1\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-load_balancing_loss-py-L2\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;2\&quot;></td>\n          <td id=\&quot;file-load_balancing_loss-py-LC2\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>Computes Switch Transformer auxiliary loss (https://arxiv.org/abs/2101.03961)</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-load_balancing_loss-py-L3\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;3\&quot;></td>\n          <td id=\&quot;file-load_balancing_loss-py-LC3\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>See equations (4)-(6) on page 7</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-load_balancing_loss-py-L4\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;4\&quot;></td>\n          <td id=\&quot;file-load_balancing_loss-py-LC4\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-load_balancing_loss-py-L5\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;5\&quot;></td>\n          <td id=\&quot;file-load_balancing_loss-py-LC5\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-load_balancing_loss-py-L6\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;6\&quot;></td>\n          <td id=\&quot;file-load_balancing_loss-py-LC6\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>import</span> <span class=pl-s1>torch</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-load_balancing_loss-py-L7\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;7\&quot;></td>\n          <td id=\&quot;file-load_balancing_loss-py-LC7\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>import</span> <span class=pl-s1>torch</span>.<span class=pl-s1>nn</span>.<span class=pl-s1>functional</span> <span class=pl-k>as</span> <span class=pl-c1>F</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-load_balancing_loss-py-L8\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;8\&quot;></td>\n          <td id=\&quot;file-load_balancing_loss-py-LC8\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-load_balancing_loss-py-L9\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;9\&quot;></td>\n          <td id=\&quot;file-load_balancing_loss-py-LC9\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># constants</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-load_balancing_loss-py-L10\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;10\&quot;></td>\n          <td id=\&quot;file-load_balancing_loss-py-LC10\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c1>B</span> <span class=pl-c1>=</span> <span class=pl-c1>16</span>     <span class=pl-c># batch size</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-load_balancing_loss-py-L11\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;11\&quot;></td>\n          <td id=\&quot;file-load_balancing_loss-py-LC11\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c1>C</span> <span class=pl-c1>=</span> <span class=pl-c1>256</span>    <span class=pl-c># sequence length</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-load_balancing_loss-py-L12\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;12\&quot;></td>\n          <td id=\&quot;file-load_balancing_loss-py-LC12\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>n_exp</span> <span class=pl-c1>=</span> <span class=pl-c1>8</span>  <span class=pl-c># number of experts</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-load_balancing_loss-py-L13\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;13\&quot;></td>\n          <td id=\&quot;file-load_balancing_loss-py-LC13\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c1>K</span> <span class=pl-c1>=</span> <span class=pl-c1>2</span>      <span class=pl-c># number of active expert</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-load_balancing_loss-py-L14\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;14\&quot;></td>\n          <td id=\&quot;file-load_balancing_loss-py-LC14\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-load_balancing_loss-py-L15\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;15\&quot;></td>\n          <td id=\&quot;file-load_balancing_loss-py-LC15\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># define tensors needed to compute load balancing loss</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-load_balancing_loss-py-L16\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;16\&quot;></td>\n          <td id=\&quot;file-load_balancing_loss-py-LC16\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>indices</span> <span class=pl-c1>=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>randint</span>(<span class=pl-c1>1</span>, <span class=pl-s1>n_exp</span> <span class=pl-c1>+</span> <span class=pl-c1>1</span>, (<span class=pl-c1>B</span>, <span class=pl-c1>C</span>, <span class=pl-c1>K</span>)) <span class=pl-c># top-K indices ([B, C, K])</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-load_balancing_loss-py-L17\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;17\&quot;></td>\n          <td id=\&quot;file-load_balancing_loss-py-LC17\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>expert_probs</span> <span class=pl-c1>=</span> <span class=pl-c1>F</span>.<span class=pl-c1>softmax</span>(<span class=pl-s1>torch</span>.<span class=pl-c1>rand</span>(<span class=pl-c1>B</span>, <span class=pl-c1>C</span>, <span class=pl-s1>n_exp</span>), <span class=pl-s1>dim</span><span class=pl-c1>=</span><span class=pl-c1>2</span>) <span class=pl-c># expert probabilities ([B, C, n_exp])</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-load_balancing_loss-py-L18\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;18\&quot;></td>\n          <td id=\&quot;file-load_balancing_loss-py-LC18\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-load_balancing_loss-py-L19\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;19\&quot;></td>\n          <td id=\&quot;file-load_balancing_loss-py-LC19\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># equation (5): compute ratio of tokens allocated to each expert</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-load_balancing_loss-py-L20\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;20\&quot;></td>\n          <td id=\&quot;file-load_balancing_loss-py-LC20\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># total number of tokens is defined as total tokens in batch * K</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-load_balancing_loss-py-L21\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;21\&quot;></td>\n          <td id=\&quot;file-load_balancing_loss-py-LC21\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>with</span> <span class=pl-s1>torch</span>.<span class=pl-c1>no_grad</span>():</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-load_balancing_loss-py-L22\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;22\&quot;></td>\n          <td id=\&quot;file-load_balancing_loss-py-LC22\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>one_hot_indices</span> <span class=pl-c1>=</span> <span class=pl-c1>F</span>.<span class=pl-c1>one_hot</span>(<span class=pl-s1>indices</span>, <span class=pl-s1>num_classes</span><span class=pl-c1>=</span><span class=pl-s1>n_exp</span>)  <span class=pl-c># [B, C, K, n_exp]</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-load_balancing_loss-py-L23\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;23\&quot;></td>\n          <td id=\&quot;file-load_balancing_loss-py-LC23\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>one_hot_indices</span> <span class=pl-c1>=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>sum</span>(<span class=pl-s1>one_hot_indices</span>.<span class=pl-c1>float</span>(), <span class=pl-s1>dim</span><span class=pl-c1>=</span><span class=pl-c1>2</span>)  <span class=pl-c># [B, C, n_exp] (sum over K dimension)</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-load_balancing_loss-py-L24\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;24\&quot;></td>\n          <td id=\&quot;file-load_balancing_loss-py-LC24\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-s1>tokens_per_expert</span> <span class=pl-c1>=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>mean</span>(<span class=pl-s1>one_hot_indices</span>.<span class=pl-c1>float</span>(), <span class=pl-s1>dim</span><span class=pl-c1>=</span>(<span class=pl-c1>0</span>, <span class=pl-c1>1</span>))</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-load_balancing_loss-py-L25\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;25\&quot;></td>\n          <td id=\&quot;file-load_balancing_loss-py-LC25\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-load_balancing_loss-py-L26\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;26\&quot;></td>\n          <td id=\&quot;file-load_balancing_loss-py-LC26\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># equation (6): compute ratio of router probability allocated to each expert</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-load_balancing_loss-py-L27\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;27\&quot;></td>\n          <td id=\&quot;file-load_balancing_loss-py-LC27\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>prob_per_expert</span> <span class=pl-c1>=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>mean</span>(<span class=pl-s1>expert_probs</span>.<span class=pl-c1>float</span>(), <span class=pl-s1>dim</span><span class=pl-c1>=</span>(<span class=pl-c1>0</span>, <span class=pl-c1>1</span>))</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-load_balancing_loss-py-L28\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;28\&quot;></td>\n          <td id=\&quot;file-load_balancing_loss-py-LC28\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-load_balancing_loss-py-L29\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;29\&quot;></td>\n          <td id=\&quot;file-load_balancing_loss-py-LC29\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># equation (4): take a scaled dot product between prob / token allocation vectors</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-load_balancing_loss-py-L30\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;30\&quot;></td>\n          <td id=\&quot;file-load_balancing_loss-py-LC30\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># multiply the result by the number of experts</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-load_balancing_loss-py-L31\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;31\&quot;></td>\n          <td id=\&quot;file-load_balancing_loss-py-LC31\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>load_balance_loss</span> <span class=pl-c1>=</span> <span class=pl-s1>n_exp</span> <span class=pl-c1>*</span> <span class=pl-s1>torch</span>.<span class=pl-c1>sum</span>(<span class=pl-s1>prob_per_expert</span> <span class=pl-c1>*</span> <span class=pl-s1>tokens_per_expert</span>)</td>\n        </tr>\n  </table>\n</div>\n\n\n    </div>\n\n  </div>\n</div>\n\n      </div>\n      <div class=\&quot;gist-meta\&quot;>\n        <a href=\&quot;https://gist.github.com/wolfecameron/12219c5293853610fc46785d8518cb45/raw/c815079211554b79df8d6f87a59d1afe637f1c71/load_balancing_loss.py\&quot; style=\&quot;float:right\&quot; class=\&quot;Link--inTextBlock\&quot;>view raw</a>\n        <a href=\&quot;https://gist.github.com/wolfecameron/12219c5293853610fc46785d8518cb45#file-load_balancing_loss-py\&quot; class=\&quot;Link--inTextBlock\&quot;>\n          load_balancing_loss.py\n        </a>\n        hosted with &amp;#10084; by <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.com\&quot;>GitHub</a>\n      </div>\n    </div>\n</div>\n&quot;,&quot;stylesheet&quot;:&quot;https://github.githubassets.com/assets/gist-embed-7b7a1d3fd6f6.css&quot;}" data-component-name="GitgistToDOM"><link rel="stylesheet" href="https://github.githubassets.com/assets/gist-embed-7b7a1d3fd6f6.css"><div id="gist136705054" class="gist">
    <div class="gist-file" data-color-mode="light" data-light-theme="light">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container">
  <div id="file-load_balancing_loss-py" class="file my-2">
    
    <div itemprop="text" class="Box-body p-0 blob-wrapper data type-python  " style="overflow:auto">

        
<div class="js-check-bidi js-blob-code-container blob-code-content">

  
  <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
  
    

    <span>
      This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
    </span>


  <div data-view-component="true" class="flash-action">        <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn">    Show hidden characters
</a>
</div>
</div>

  <span data-view-component="true" class="line-alert tooltipped tooltipped-e">
    
    

</span>

  <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="load_balancing_loss.py">
        <tbody><tr>
          <td id="file-load_balancing_loss-py-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
          <td id="file-load_balancing_loss-py-LC1" class="blob-code blob-code-inner js-file-line"><span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-load_balancing_loss-py-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
          <td id="file-load_balancing_loss-py-LC2" class="blob-code blob-code-inner js-file-line"><span class="pl-s">Computes Switch Transformer auxiliary loss (https://arxiv.org/abs/2101.03961)</span></td>
        </tr>
        <tr>
          <td id="file-load_balancing_loss-py-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
          <td id="file-load_balancing_loss-py-LC3" class="blob-code blob-code-inner js-file-line"><span class="pl-s">See equations (4)-(6) on page 7</span></td>
        </tr>
        <tr>
          <td id="file-load_balancing_loss-py-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
          <td id="file-load_balancing_loss-py-LC4" class="blob-code blob-code-inner js-file-line"><span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-load_balancing_loss-py-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
          <td id="file-load_balancing_loss-py-LC5" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-load_balancing_loss-py-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
          <td id="file-load_balancing_loss-py-LC6" class="blob-code blob-code-inner js-file-line"><span class="pl-k">import</span> <span class="pl-s1">torch</span></td>
        </tr>
        <tr>
          <td id="file-load_balancing_loss-py-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
          <td id="file-load_balancing_loss-py-LC7" class="blob-code blob-code-inner js-file-line"><span class="pl-k">import</span> <span class="pl-s1">torch</span>.<span class="pl-s1">nn</span>.<span class="pl-s1">functional</span> <span class="pl-k">as</span> <span class="pl-c1">F</span></td>
        </tr>
        <tr>
          <td id="file-load_balancing_loss-py-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
          <td id="file-load_balancing_loss-py-LC8" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-load_balancing_loss-py-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
          <td id="file-load_balancing_loss-py-LC9" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># constants</span></td>
        </tr>
        <tr>
          <td id="file-load_balancing_loss-py-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
          <td id="file-load_balancing_loss-py-LC10" class="blob-code blob-code-inner js-file-line"><span class="pl-c1">B</span> <span class="pl-c1">=</span> <span class="pl-c1">16</span>     <span class="pl-c"># batch size</span></td>
        </tr>
        <tr>
          <td id="file-load_balancing_loss-py-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
          <td id="file-load_balancing_loss-py-LC11" class="blob-code blob-code-inner js-file-line"><span class="pl-c1">C</span> <span class="pl-c1">=</span> <span class="pl-c1">256</span>    <span class="pl-c"># sequence length</span></td>
        </tr>
        <tr>
          <td id="file-load_balancing_loss-py-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
          <td id="file-load_balancing_loss-py-LC12" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">n_exp</span> <span class="pl-c1">=</span> <span class="pl-c1">8</span>  <span class="pl-c"># number of experts</span></td>
        </tr>
        <tr>
          <td id="file-load_balancing_loss-py-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
          <td id="file-load_balancing_loss-py-LC13" class="blob-code blob-code-inner js-file-line"><span class="pl-c1">K</span> <span class="pl-c1">=</span> <span class="pl-c1">2</span>      <span class="pl-c"># number of active expert</span></td>
        </tr>
        <tr>
          <td id="file-load_balancing_loss-py-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
          <td id="file-load_balancing_loss-py-LC14" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-load_balancing_loss-py-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
          <td id="file-load_balancing_loss-py-LC15" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># define tensors needed to compute load balancing loss</span></td>
        </tr>
        <tr>
          <td id="file-load_balancing_loss-py-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
          <td id="file-load_balancing_loss-py-LC16" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">indices</span> <span class="pl-c1">=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">randint</span>(<span class="pl-c1">1</span>, <span class="pl-s1">n_exp</span> <span class="pl-c1">+</span> <span class="pl-c1">1</span>, (<span class="pl-c1">B</span>, <span class="pl-c1">C</span>, <span class="pl-c1">K</span>)) <span class="pl-c"># top-K indices ([B, C, K])</span></td>
        </tr>
        <tr>
          <td id="file-load_balancing_loss-py-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
          <td id="file-load_balancing_loss-py-LC17" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">expert_probs</span> <span class="pl-c1">=</span> <span class="pl-c1">F</span>.<span class="pl-c1">softmax</span>(<span class="pl-s1">torch</span>.<span class="pl-c1">rand</span>(<span class="pl-c1">B</span>, <span class="pl-c1">C</span>, <span class="pl-s1">n_exp</span>), <span class="pl-s1">dim</span><span class="pl-c1">=</span><span class="pl-c1">2</span>) <span class="pl-c"># expert probabilities ([B, C, n_exp])</span></td>
        </tr>
        <tr>
          <td id="file-load_balancing_loss-py-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
          <td id="file-load_balancing_loss-py-LC18" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-load_balancing_loss-py-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
          <td id="file-load_balancing_loss-py-LC19" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># equation (5): compute ratio of tokens allocated to each expert</span></td>
        </tr>
        <tr>
          <td id="file-load_balancing_loss-py-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td>
          <td id="file-load_balancing_loss-py-LC20" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># total number of tokens is defined as total tokens in batch * K</span></td>
        </tr>
        <tr>
          <td id="file-load_balancing_loss-py-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td>
          <td id="file-load_balancing_loss-py-LC21" class="blob-code blob-code-inner js-file-line"><span class="pl-k">with</span> <span class="pl-s1">torch</span>.<span class="pl-c1">no_grad</span>():</td>
        </tr>
        <tr>
          <td id="file-load_balancing_loss-py-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td>
          <td id="file-load_balancing_loss-py-LC22" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">one_hot_indices</span> <span class="pl-c1">=</span> <span class="pl-c1">F</span>.<span class="pl-c1">one_hot</span>(<span class="pl-s1">indices</span>, <span class="pl-s1">num_classes</span><span class="pl-c1">=</span><span class="pl-s1">n_exp</span>)  <span class="pl-c"># [B, C, K, n_exp]</span></td>
        </tr>
        <tr>
          <td id="file-load_balancing_loss-py-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td>
          <td id="file-load_balancing_loss-py-LC23" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">one_hot_indices</span> <span class="pl-c1">=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">sum</span>(<span class="pl-s1">one_hot_indices</span>.<span class="pl-c1">float</span>(), <span class="pl-s1">dim</span><span class="pl-c1">=</span><span class="pl-c1">2</span>)  <span class="pl-c"># [B, C, n_exp] (sum over K dimension)</span></td>
        </tr>
        <tr>
          <td id="file-load_balancing_loss-py-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td>
          <td id="file-load_balancing_loss-py-LC24" class="blob-code blob-code-inner js-file-line">    <span class="pl-s1">tokens_per_expert</span> <span class="pl-c1">=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">mean</span>(<span class="pl-s1">one_hot_indices</span>.<span class="pl-c1">float</span>(), <span class="pl-s1">dim</span><span class="pl-c1">=</span>(<span class="pl-c1">0</span>, <span class="pl-c1">1</span>))</td>
        </tr>
        <tr>
          <td id="file-load_balancing_loss-py-L25" class="blob-num js-line-number js-blob-rnum" data-line-number="25"></td>
          <td id="file-load_balancing_loss-py-LC25" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-load_balancing_loss-py-L26" class="blob-num js-line-number js-blob-rnum" data-line-number="26"></td>
          <td id="file-load_balancing_loss-py-LC26" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># equation (6): compute ratio of router probability allocated to each expert</span></td>
        </tr>
        <tr>
          <td id="file-load_balancing_loss-py-L27" class="blob-num js-line-number js-blob-rnum" data-line-number="27"></td>
          <td id="file-load_balancing_loss-py-LC27" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">prob_per_expert</span> <span class="pl-c1">=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">mean</span>(<span class="pl-s1">expert_probs</span>.<span class="pl-c1">float</span>(), <span class="pl-s1">dim</span><span class="pl-c1">=</span>(<span class="pl-c1">0</span>, <span class="pl-c1">1</span>))</td>
        </tr>
        <tr>
          <td id="file-load_balancing_loss-py-L28" class="blob-num js-line-number js-blob-rnum" data-line-number="28"></td>
          <td id="file-load_balancing_loss-py-LC28" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-load_balancing_loss-py-L29" class="blob-num js-line-number js-blob-rnum" data-line-number="29"></td>
          <td id="file-load_balancing_loss-py-LC29" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># equation (4): take a scaled dot product between prob / token allocation vectors</span></td>
        </tr>
        <tr>
          <td id="file-load_balancing_loss-py-L30" class="blob-num js-line-number js-blob-rnum" data-line-number="30"></td>
          <td id="file-load_balancing_loss-py-LC30" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># multiply the result by the number of experts</span></td>
        </tr>
        <tr>
          <td id="file-load_balancing_loss-py-L31" class="blob-num js-line-number js-blob-rnum" data-line-number="31"></td>
          <td id="file-load_balancing_loss-py-LC31" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">load_balance_loss</span> <span class="pl-c1">=</span> <span class="pl-s1">n_exp</span> <span class="pl-c1">*</span> <span class="pl-s1">torch</span>.<span class="pl-c1">sum</span>(<span class="pl-s1">prob_per_expert</span> <span class="pl-c1">*</span> <span class="pl-s1">tokens_per_expert</span>)</td>
        </tr>
  </tbody></table>
</div>


    </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/wolfecameron/12219c5293853610fc46785d8518cb45/raw/c815079211554b79df8d6f87a59d1afe637f1c71/load_balancing_loss.py" style="float:right" class="Link--inTextBlock">view raw</a>
        <a href="https://gist.github.com/wolfecameron/12219c5293853610fc46785d8518cb45#file-load_balancing_loss-py" class="Link--inTextBlock">
          load_balancing_loss.py
        </a>
        hosted with &#10084; by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
      </div>
    </div>
</div>
</div><p><strong>Router z-loss.</strong> To complement the load balancing loss, authors in [3] propose an extra auxiliary loss term, <em>called the router z-loss</em>. The router z-loss constrains the size of the <a href="https://wandb.ai/amanarora/Written-Reports/reports/Understanding-Logits-Sigmoid-Softmax-and-Cross-Entropy-Loss-in-Deep-Learning--Vmlldzo0NDMzNTU3#logits">logits</a>&#8212;<em>not probabilities, this is before softmax is applied</em>&#8212;predicted by the routing mechanism; see below for the formulation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gPGQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1790688e-5328-45f2-98c0-717ba6041470_2090x636.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gPGQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1790688e-5328-45f2-98c0-717ba6041470_2090x636.png 424w, https://substackcdn.com/image/fetch/$s_!gPGQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1790688e-5328-45f2-98c0-717ba6041470_2090x636.png 848w, https://substackcdn.com/image/fetch/$s_!gPGQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1790688e-5328-45f2-98c0-717ba6041470_2090x636.png 1272w, https://substackcdn.com/image/fetch/$s_!gPGQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1790688e-5328-45f2-98c0-717ba6041470_2090x636.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gPGQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1790688e-5328-45f2-98c0-717ba6041470_2090x636.png" width="1456" height="443" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1790688e-5328-45f2-98c0-717ba6041470_2090x636.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:443,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gPGQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1790688e-5328-45f2-98c0-717ba6041470_2090x636.png 424w, https://substackcdn.com/image/fetch/$s_!gPGQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1790688e-5328-45f2-98c0-717ba6041470_2090x636.png 848w, https://substackcdn.com/image/fetch/$s_!gPGQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1790688e-5328-45f2-98c0-717ba6041470_2090x636.png 1272w, https://substackcdn.com/image/fetch/$s_!gPGQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1790688e-5328-45f2-98c0-717ba6041470_2090x636.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We do not want these logits to be too large due to the fact that the router contains an (exponential) softmax function. However, these logits can become very large during training, which can lead to <a href="https://en.wikipedia.org/wiki/Round-off_error">round-off</a> errors that destabilize the training process&#8212;<em>even when using full (</em><code>float32</code><em>) precision</em>. The router z-loss encourages the MoE to keep these logits small and, in turn, avoid these round-off errors. </p><blockquote><p><em>&#8220;The router computes the probability distribution over the experts in float32 precision. However, at the largest scales, we find this is insufficient to yield reliable training.&#8221;</em> - from [3]</p></blockquote><p>An implementation of the router z-loss is provided below, which contains three key steps:</p><ol><li><p><em>Lines 8-14</em>: Create the input tensor needed to compute the router z-loss (i.e., logits from the routing mechanism).</p></li><li><p><em>Line 21</em>: Take a squared <a href="https://pytorch.org/docs/stable/generated/torch.logsumexp.html">logsumexp</a> of router logits. This is a numerically stable shorthand for applying the exponential, sum, and log operations in sequence.</p></li><li><p><em>Line 24</em>: Sum the result of the above operation over all tokens and divide by the total number of tokens (i.e., take an average). </p></li></ol><div class="github-gist" data-attrs="{&quot;innerHTML&quot;:&quot;<div id=\&quot;gist136705390\&quot; class=\&quot;gist\&quot;>\n    <div class=\&quot;gist-file\&quot; translate=\&quot;no\&quot; data-color-mode=\&quot;light\&quot; data-light-theme=\&quot;light\&quot;>\n      <div class=\&quot;gist-data\&quot;>\n        <div class=\&quot;js-gist-file-update-container js-task-list-container\&quot;>\n  <div id=\&quot;file-router_z_loss-py\&quot; class=\&quot;file my-2\&quot;>\n    \n    <div itemprop=\&quot;text\&quot;\n      class=\&quot;Box-body p-0 blob-wrapper data type-python  \&quot;\n      style=\&quot;overflow: auto\&quot; tabindex=\&quot;0\&quot; role=\&quot;region\&quot;\n      aria-label=\&quot;file-router_z_loss-py\&quot;\n    >\n\n        \n<div class=\&quot;js-check-bidi js-blob-code-container blob-code-content\&quot;>\n\n  <template class=\&quot;js-file-alert-template\&quot;>\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash flash-warn flash-full d-flex flex-items-center\&quot;>\n  <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n    <span>\n      This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.\n      <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.co/hiddenchars\&quot; target=\&quot;_blank\&quot;>Learn more about bidirectional Unicode characters</a>\n    </span>\n\n\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash-action\&quot;>        <a href=\&quot;{{ revealButtonHref }}\&quot; data-view-component=\&quot;true\&quot; class=\&quot;btn-sm btn\&quot;>    Show hidden characters\n</a>\n</div>\n</div></template>\n<template class=\&quot;js-line-alert-template\&quot;>\n  <span aria-label=\&quot;This line has hidden Unicode characters\&quot; data-view-component=\&quot;true\&quot; class=\&quot;line-alert tooltipped tooltipped-e\&quot;>\n    <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n</span></template>\n\n  <table data-hpc class=\&quot;highlight tab-size js-file-line-container\&quot; data-tab-size=\&quot;8\&quot; data-paste-markdown-skip data-tagsearch-path=\&quot;router_z_loss.py\&quot;>\n        <tr>\n          <td id=\&quot;file-router_z_loss-py-L1\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;1\&quot;></td>\n          <td id=\&quot;file-router_z_loss-py-LC1\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-router_z_loss-py-L2\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;2\&quot;></td>\n          <td id=\&quot;file-router_z_loss-py-LC2\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>Computes ST-MoE router z loss (https://arxiv.org/abs/2202.08906)</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-router_z_loss-py-L3\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;3\&quot;></td>\n          <td id=\&quot;file-router_z_loss-py-LC3\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>See equation (5) on page 7</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-router_z_loss-py-L4\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;4\&quot;></td>\n          <td id=\&quot;file-router_z_loss-py-LC4\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-router_z_loss-py-L5\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;5\&quot;></td>\n          <td id=\&quot;file-router_z_loss-py-LC5\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-router_z_loss-py-L6\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;6\&quot;></td>\n          <td id=\&quot;file-router_z_loss-py-LC6\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>import</span> <span class=pl-s1>torch</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-router_z_loss-py-L7\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;7\&quot;></td>\n          <td id=\&quot;file-router_z_loss-py-LC7\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-router_z_loss-py-L8\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;8\&quot;></td>\n          <td id=\&quot;file-router_z_loss-py-LC8\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># constants</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-router_z_loss-py-L9\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;9\&quot;></td>\n          <td id=\&quot;file-router_z_loss-py-LC9\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c1>B</span> <span class=pl-c1>=</span> <span class=pl-c1>16</span>     <span class=pl-c># batch size</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-router_z_loss-py-L10\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;10\&quot;></td>\n          <td id=\&quot;file-router_z_loss-py-LC10\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c1>C</span> <span class=pl-c1>=</span> <span class=pl-c1>256</span>    <span class=pl-c># sequence length</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-router_z_loss-py-L11\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;11\&quot;></td>\n          <td id=\&quot;file-router_z_loss-py-LC11\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>n_exp</span> <span class=pl-c1>=</span> <span class=pl-c1>8</span>  <span class=pl-c># number of experts</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-router_z_loss-py-L12\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;12\&quot;></td>\n          <td id=\&quot;file-router_z_loss-py-LC12\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-router_z_loss-py-L13\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;13\&quot;></td>\n          <td id=\&quot;file-router_z_loss-py-LC13\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># create input tensor for router z-loss</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-router_z_loss-py-L14\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;14\&quot;></td>\n          <td id=\&quot;file-router_z_loss-py-LC14\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>router_logits</span> <span class=pl-c1>=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>rand</span>(<span class=pl-c1>B</span>, <span class=pl-c1>C</span>, <span class=pl-s1>n_exp</span>) <span class=pl-c># [B, C, n_exp]</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-router_z_loss-py-L15\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;15\&quot;></td>\n          <td id=\&quot;file-router_z_loss-py-LC15\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-router_z_loss-py-L16\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;16\&quot;></td>\n          <td id=\&quot;file-router_z_loss-py-LC16\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># exponentiate logits, sum logits of each expert, take log, and square</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-router_z_loss-py-L17\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;17\&quot;></td>\n          <td id=\&quot;file-router_z_loss-py-LC17\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># code below is equivalent to the following:</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-router_z_loss-py-L18\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;18\&quot;></td>\n          <td id=\&quot;file-router_z_loss-py-LC18\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># z_loss = torch.exp(router_logits)</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-router_z_loss-py-L19\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;19\&quot;></td>\n          <td id=\&quot;file-router_z_loss-py-LC19\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># z_loss = torch.sum(z_loss, dim=-1)</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-router_z_loss-py-L20\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;20\&quot;></td>\n          <td id=\&quot;file-router_z_loss-py-LC20\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># z_loss = torch.log(z_loss) ** 2.0</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-router_z_loss-py-L21\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;21\&quot;></td>\n          <td id=\&quot;file-router_z_loss-py-LC21\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>router_z_loss</span> <span class=pl-c1>=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>logsumexp</span>(<span class=pl-s1>router_logits</span>, <span class=pl-s1>dim</span><span class=pl-c1>=</span><span class=pl-c1>-</span><span class=pl-c1>1</span>) <span class=pl-c1>**</span> <span class=pl-c1>2.0</span>  <span class=pl-c># [B, C]</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-router_z_loss-py-L22\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;22\&quot;></td>\n          <td id=\&quot;file-router_z_loss-py-LC22\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-router_z_loss-py-L23\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;23\&quot;></td>\n          <td id=\&quot;file-router_z_loss-py-LC23\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-c># sum over all tokens and divide by total number of tokens</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-router_z_loss-py-L24\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;24\&quot;></td>\n          <td id=\&quot;file-router_z_loss-py-LC24\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s1>router_z_loss</span> <span class=pl-c1>=</span> <span class=pl-s1>torch</span>.<span class=pl-c1>mean</span>(<span class=pl-s1>router_z_loss</span>)</td>\n        </tr>\n  </table>\n</div>\n\n\n    </div>\n\n  </div>\n</div>\n\n      </div>\n      <div class=\&quot;gist-meta\&quot;>\n        <a href=\&quot;https://gist.github.com/wolfecameron/2305c8c9ccc6d2c2906ba4577d801ccc/raw/f6bace49819b77106e881f9a80d331e6d6067fd9/router_z_loss.py\&quot; style=\&quot;float:right\&quot; class=\&quot;Link--inTextBlock\&quot;>view raw</a>\n        <a href=\&quot;https://gist.github.com/wolfecameron/2305c8c9ccc6d2c2906ba4577d801ccc#file-router_z_loss-py\&quot; class=\&quot;Link--inTextBlock\&quot;>\n          router_z_loss.py\n        </a>\n        hosted with &amp;#10084; by <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.com\&quot;>GitHub</a>\n      </div>\n    </div>\n</div>\n&quot;,&quot;stylesheet&quot;:&quot;https://github.githubassets.com/assets/gist-embed-7b7a1d3fd6f6.css&quot;}" data-component-name="GitgistToDOM"><link rel="stylesheet" href="https://github.githubassets.com/assets/gist-embed-7b7a1d3fd6f6.css"><div id="gist136705390" class="gist">
    <div class="gist-file" data-color-mode="light" data-light-theme="light">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container">
  <div id="file-router_z_loss-py" class="file my-2">
    
    <div itemprop="text" class="Box-body p-0 blob-wrapper data type-python  " style="overflow:auto">

        
<div class="js-check-bidi js-blob-code-container blob-code-content">

  
  <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
  
    

    <span>
      This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
    </span>


  <div data-view-component="true" class="flash-action">        <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn">    Show hidden characters
</a>
</div>
</div>

  <span data-view-component="true" class="line-alert tooltipped tooltipped-e">
    
    

</span>

  <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="router_z_loss.py">
        <tbody><tr>
          <td id="file-router_z_loss-py-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
          <td id="file-router_z_loss-py-LC1" class="blob-code blob-code-inner js-file-line"><span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-router_z_loss-py-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
          <td id="file-router_z_loss-py-LC2" class="blob-code blob-code-inner js-file-line"><span class="pl-s">Computes ST-MoE router z loss (https://arxiv.org/abs/2202.08906)</span></td>
        </tr>
        <tr>
          <td id="file-router_z_loss-py-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
          <td id="file-router_z_loss-py-LC3" class="blob-code blob-code-inner js-file-line"><span class="pl-s">See equation (5) on page 7</span></td>
        </tr>
        <tr>
          <td id="file-router_z_loss-py-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
          <td id="file-router_z_loss-py-LC4" class="blob-code blob-code-inner js-file-line"><span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-router_z_loss-py-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
          <td id="file-router_z_loss-py-LC5" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-router_z_loss-py-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
          <td id="file-router_z_loss-py-LC6" class="blob-code blob-code-inner js-file-line"><span class="pl-k">import</span> <span class="pl-s1">torch</span></td>
        </tr>
        <tr>
          <td id="file-router_z_loss-py-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
          <td id="file-router_z_loss-py-LC7" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-router_z_loss-py-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
          <td id="file-router_z_loss-py-LC8" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># constants</span></td>
        </tr>
        <tr>
          <td id="file-router_z_loss-py-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
          <td id="file-router_z_loss-py-LC9" class="blob-code blob-code-inner js-file-line"><span class="pl-c1">B</span> <span class="pl-c1">=</span> <span class="pl-c1">16</span>     <span class="pl-c"># batch size</span></td>
        </tr>
        <tr>
          <td id="file-router_z_loss-py-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
          <td id="file-router_z_loss-py-LC10" class="blob-code blob-code-inner js-file-line"><span class="pl-c1">C</span> <span class="pl-c1">=</span> <span class="pl-c1">256</span>    <span class="pl-c"># sequence length</span></td>
        </tr>
        <tr>
          <td id="file-router_z_loss-py-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
          <td id="file-router_z_loss-py-LC11" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">n_exp</span> <span class="pl-c1">=</span> <span class="pl-c1">8</span>  <span class="pl-c"># number of experts</span></td>
        </tr>
        <tr>
          <td id="file-router_z_loss-py-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
          <td id="file-router_z_loss-py-LC12" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-router_z_loss-py-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
          <td id="file-router_z_loss-py-LC13" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># create input tensor for router z-loss</span></td>
        </tr>
        <tr>
          <td id="file-router_z_loss-py-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
          <td id="file-router_z_loss-py-LC14" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">router_logits</span> <span class="pl-c1">=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">rand</span>(<span class="pl-c1">B</span>, <span class="pl-c1">C</span>, <span class="pl-s1">n_exp</span>) <span class="pl-c"># [B, C, n_exp]</span></td>
        </tr>
        <tr>
          <td id="file-router_z_loss-py-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
          <td id="file-router_z_loss-py-LC15" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-router_z_loss-py-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
          <td id="file-router_z_loss-py-LC16" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># exponentiate logits, sum logits of each expert, take log, and square</span></td>
        </tr>
        <tr>
          <td id="file-router_z_loss-py-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
          <td id="file-router_z_loss-py-LC17" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># code below is equivalent to the following:</span></td>
        </tr>
        <tr>
          <td id="file-router_z_loss-py-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
          <td id="file-router_z_loss-py-LC18" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># z_loss = torch.exp(router_logits)</span></td>
        </tr>
        <tr>
          <td id="file-router_z_loss-py-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
          <td id="file-router_z_loss-py-LC19" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># z_loss = torch.sum(z_loss, dim=-1)</span></td>
        </tr>
        <tr>
          <td id="file-router_z_loss-py-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td>
          <td id="file-router_z_loss-py-LC20" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># z_loss = torch.log(z_loss) ** 2.0</span></td>
        </tr>
        <tr>
          <td id="file-router_z_loss-py-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td>
          <td id="file-router_z_loss-py-LC21" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">router_z_loss</span> <span class="pl-c1">=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">logsumexp</span>(<span class="pl-s1">router_logits</span>, <span class="pl-s1">dim</span><span class="pl-c1">=</span><span class="pl-c1">-</span><span class="pl-c1">1</span>) <span class="pl-c1">**</span> <span class="pl-c1">2.0</span>  <span class="pl-c"># [B, C]</span></td>
        </tr>
        <tr>
          <td id="file-router_z_loss-py-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td>
          <td id="file-router_z_loss-py-LC22" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-router_z_loss-py-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td>
          <td id="file-router_z_loss-py-LC23" class="blob-code blob-code-inner js-file-line"><span class="pl-c"># sum over all tokens and divide by total number of tokens</span></td>
        </tr>
        <tr>
          <td id="file-router_z_loss-py-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td>
          <td id="file-router_z_loss-py-LC24" class="blob-code blob-code-inner js-file-line"><span class="pl-s1">router_z_loss</span> <span class="pl-c1">=</span> <span class="pl-s1">torch</span>.<span class="pl-c1">mean</span>(<span class="pl-s1">router_z_loss</span>)</td>
        </tr>
  </tbody></table>
</div>


    </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/wolfecameron/2305c8c9ccc6d2c2906ba4577d801ccc/raw/f6bace49819b77106e881f9a80d331e6d6067fd9/router_z_loss.py" style="float:right" class="Link--inTextBlock">view raw</a>
        <a href="https://gist.github.com/wolfecameron/2305c8c9ccc6d2c2906ba4577d801ccc#file-router_z_loss-py" class="Link--inTextBlock">
          router_z_loss.py
        </a>
        hosted with &#10084; by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
      </div>
    </div>
</div>
</div><p><strong>Combining auxiliary losses.</strong> Given that several auxiliary losses exist, we might wonder which of them we should use in practice. The answer is:<em> all of them</em>! We can just add each of these losses to our <a href="https://cameronrwolfe.substack.com/i/136638774/understanding-next-token-prediction">standard language modeling loss</a> during training. Each auxiliary loss will have a scaling factor by which it is multiplied, then we sum all of the (scaled) losses together; see below. Default scaling factors for load balancing and router z-losses are <code>0.001</code> and <code>0.01</code>, respectively. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oxpH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F726a1e49-0aaa-45dd-a9d0-5386edc2ecc1_2522x288.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oxpH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F726a1e49-0aaa-45dd-a9d0-5386edc2ecc1_2522x288.png 424w, https://substackcdn.com/image/fetch/$s_!oxpH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F726a1e49-0aaa-45dd-a9d0-5386edc2ecc1_2522x288.png 848w, https://substackcdn.com/image/fetch/$s_!oxpH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F726a1e49-0aaa-45dd-a9d0-5386edc2ecc1_2522x288.png 1272w, https://substackcdn.com/image/fetch/$s_!oxpH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F726a1e49-0aaa-45dd-a9d0-5386edc2ecc1_2522x288.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oxpH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F726a1e49-0aaa-45dd-a9d0-5386edc2ecc1_2522x288.png" width="1456" height="166" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/726a1e49-0aaa-45dd-a9d0-5386edc2ecc1_2522x288.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:166,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oxpH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F726a1e49-0aaa-45dd-a9d0-5386edc2ecc1_2522x288.png 424w, https://substackcdn.com/image/fetch/$s_!oxpH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F726a1e49-0aaa-45dd-a9d0-5386edc2ecc1_2522x288.png 848w, https://substackcdn.com/image/fetch/$s_!oxpH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F726a1e49-0aaa-45dd-a9d0-5386edc2ecc1_2522x288.png 1272w, https://substackcdn.com/image/fetch/$s_!oxpH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F726a1e49-0aaa-45dd-a9d0-5386edc2ecc1_2522x288.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>Current research.</strong> As we will see, the auxiliary losses that we have learned about in this section work quite well. However, recent research [8] has shown that&#8212;<em>depending upon how the scaling factors are set</em>&#8212;such auxiliary losses might sacrifice model performance for training stability in some cases. As such, the optimal process and strategies for training MoEs is still a (very) active research area. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Tdh2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F551dc85c-ee09-412d-b6d2-922a60c8badb_1036x310.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Tdh2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F551dc85c-ee09-412d-b6d2-922a60c8badb_1036x310.png 424w, https://substackcdn.com/image/fetch/$s_!Tdh2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F551dc85c-ee09-412d-b6d2-922a60c8badb_1036x310.png 848w, https://substackcdn.com/image/fetch/$s_!Tdh2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F551dc85c-ee09-412d-b6d2-922a60c8badb_1036x310.png 1272w, https://substackcdn.com/image/fetch/$s_!Tdh2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F551dc85c-ee09-412d-b6d2-922a60c8badb_1036x310.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Tdh2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F551dc85c-ee09-412d-b6d2-922a60c8badb_1036x310.png" width="386" height="115.5019305019305" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/551dc85c-ee09-412d-b6d2-922a60c8badb_1036x310.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:310,&quot;width&quot;:1036,&quot;resizeWidth&quot;:386,&quot;bytes&quot;:78608,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/155023686?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F551dc85c-ee09-412d-b6d2-922a60c8badb_1036x310.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Tdh2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F551dc85c-ee09-412d-b6d2-922a60c8badb_1036x310.png 424w, https://substackcdn.com/image/fetch/$s_!Tdh2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F551dc85c-ee09-412d-b6d2-922a60c8badb_1036x310.png 848w, https://substackcdn.com/image/fetch/$s_!Tdh2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F551dc85c-ee09-412d-b6d2-922a60c8badb_1036x310.png 1272w, https://substackcdn.com/image/fetch/$s_!Tdh2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F551dc85c-ee09-412d-b6d2-922a60c8badb_1036x310.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Auxiliary-loss-free load balancing from DeepSeek-v3 [8]</figcaption></figure></div><p>For example, the recently-proposed DeepSeek-v3 [8] model&#8212;<em>the base model used to create the <a href="https://cameronrwolfe.substack.com/p/demystifying-reasoning-models">DeepSeek-R1 reasoning model</a></em>&#8212;uses an auxiliary-loss-free load balancing strategy, which simply adds a dynamic bias to the router output when selecting top-<code>K</code> experts; see above. This bias is increased for experts that are not selected enough and decreased for experts that are selected too much, <em>thus increasing the chance that under-utilized experts will be selected</em>. This dynamic bias is found to improve load balancing without sacrificing model performance. However, load balancing losses are still used in [8] (just with a smaller scaling factor). </p><blockquote><p><em>&#8220;We keep monitoring the expert load on the whole batch of each training step. At the end of each step, we will decrease the bias term by &#120574; if its corresponding expert is overloaded, and increase it by &#120574; if its corresponding expert is underloaded, where &#120574; is a hyper-parameter called bias update speed.&#8221;</em> - from [8] </p></blockquote><h4>Decoder-Only MoE Implementation</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_BFS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51369997-34e1-41d9-a3de-83a3edcde279_2192x912.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_BFS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51369997-34e1-41d9-a3de-83a3edcde279_2192x912.png 424w, https://substackcdn.com/image/fetch/$s_!_BFS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51369997-34e1-41d9-a3de-83a3edcde279_2192x912.png 848w, https://substackcdn.com/image/fetch/$s_!_BFS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51369997-34e1-41d9-a3de-83a3edcde279_2192x912.png 1272w, https://substackcdn.com/image/fetch/$s_!_BFS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51369997-34e1-41d9-a3de-83a3edcde279_2192x912.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_BFS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51369997-34e1-41d9-a3de-83a3edcde279_2192x912.png" width="1456" height="606" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/51369997-34e1-41d9-a3de-83a3edcde279_2192x912.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:606,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:247131,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/155023686?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51369997-34e1-41d9-a3de-83a3edcde279_2192x912.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_BFS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51369997-34e1-41d9-a3de-83a3edcde279_2192x912.png 424w, https://substackcdn.com/image/fetch/$s_!_BFS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51369997-34e1-41d9-a3de-83a3edcde279_2192x912.png 848w, https://substackcdn.com/image/fetch/$s_!_BFS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51369997-34e1-41d9-a3de-83a3edcde279_2192x912.png 1272w, https://substackcdn.com/image/fetch/$s_!_BFS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51369997-34e1-41d9-a3de-83a3edcde279_2192x912.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">MoE-based Decoder-only Transformer Architecture</figcaption></figure></div><p>We now understand all of the major components of an expert layer. So, let&#8217;s put these concepts together to create a full MoE-based decoder-only architecture. The MoE blocks within this model (shown above) will contain:</p><ul><li><p>A regular (masked) self-attention layer</p></li><li><p>An expert layer&#8212;<em>instead of the normal feed-forward layer&#8212;</em>for every <code>P</code>-th layer of the model.</p></li></ul><p>This block structure is similar to that of a standard, decoder-only transformer, but we replace the feed-forward layer with an expert layer&#8212;<em>forming an MoE block</em>&#8212;in a portion of the model&#8217;s layers. First, let&#8217;s cover a few remaining details regarding how the final output of an expert layer is computed. Then, we will present a full implementation of the MoE-based decoder-only transformer.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Udnc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe03c2af3-0014-41f8-9c2b-d79bbb75265e_1674x1188.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Udnc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe03c2af3-0014-41f8-9c2b-d79bbb75265e_1674x1188.png 424w, https://substackcdn.com/image/fetch/$s_!Udnc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe03c2af3-0014-41f8-9c2b-d79bbb75265e_1674x1188.png 848w, https://substackcdn.com/image/fetch/$s_!Udnc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe03c2af3-0014-41f8-9c2b-d79bbb75265e_1674x1188.png 1272w, https://substackcdn.com/image/fetch/$s_!Udnc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe03c2af3-0014-41f8-9c2b-d79bbb75265e_1674x1188.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Udnc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe03c2af3-0014-41f8-9c2b-d79bbb75265e_1674x1188.png" width="1456" height="1033" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e03c2af3-0014-41f8-9c2b-d79bbb75265e_1674x1188.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1033,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Udnc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe03c2af3-0014-41f8-9c2b-d79bbb75265e_1674x1188.png 424w, https://substackcdn.com/image/fetch/$s_!Udnc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe03c2af3-0014-41f8-9c2b-d79bbb75265e_1674x1188.png 848w, https://substackcdn.com/image/fetch/$s_!Udnc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe03c2af3-0014-41f8-9c2b-d79bbb75265e_1674x1188.png 1272w, https://substackcdn.com/image/fetch/$s_!Udnc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe03c2af3-0014-41f8-9c2b-d79bbb75265e_1674x1188.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Computing expert layer output.</strong> Once we have used the routing mechanism to determine the set of active experts for a given token, we can compute the final output for this expert layer as follows:</p><ol><li><p>Send the tokens to their active experts.</p></li><li><p>Compute the output of the active experts for these tokens.</p></li><li><p>Take a weighted average of expert outputs for each token, where the weights are simply the probabilities assigned to each active expert by the router.</p></li></ol><p>This process is depicted for a single token in the figure above. Recent research on MoEs has also introduced the idea of &#8220;shared&#8221; experts, which are always active for all tokens. Shared experts slightly modify the routing logic, but the same core ideas outlined above still apply; see <a href="https://cameronrwolfe.substack.com/i/154340424/computing-the-output-of-an-moe-layer">here</a> for more details on this topic.</p><p>An implementation of a full expert layer is provided below, where we see these ideas applied in PyTorch. On line 49, we get the batches of data for each expert&#8212;<em>and the associated expert probabilities for each token</em>&#8212;from our router. We then pass these batches through our expert feed-forward networks (line 52) to get the output of each expert. Finally, we multiply each expert&#8217;s output by the associated probability in lines 54-58, thus forming the final output of the expert layer. </p><div class="github-gist" data-attrs="{&quot;innerHTML&quot;:&quot;<div id=\&quot;gist136707311\&quot; class=\&quot;gist\&quot;>\n    <div class=\&quot;gist-file\&quot; translate=\&quot;no\&quot; data-color-mode=\&quot;light\&quot; data-light-theme=\&quot;light\&quot;>\n      <div class=\&quot;gist-data\&quot;>\n        <div class=\&quot;js-gist-file-update-container js-task-list-container\&quot;>\n  <div id=\&quot;file-expert_layer-py\&quot; class=\&quot;file my-2\&quot;>\n    \n    <div itemprop=\&quot;text\&quot;\n      class=\&quot;Box-body p-0 blob-wrapper data type-python  \&quot;\n      style=\&quot;overflow: auto\&quot; tabindex=\&quot;0\&quot; role=\&quot;region\&quot;\n      aria-label=\&quot;file-expert_layer-py\&quot;\n    >\n\n        \n<div class=\&quot;js-check-bidi js-blob-code-container blob-code-content\&quot;>\n\n  <template class=\&quot;js-file-alert-template\&quot;>\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash flash-warn flash-full d-flex flex-items-center\&quot;>\n  <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n    <span>\n      This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.\n      <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.co/hiddenchars\&quot; target=\&quot;_blank\&quot;>Learn more about bidirectional Unicode characters</a>\n    </span>\n\n\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash-action\&quot;>        <a href=\&quot;{{ revealButtonHref }}\&quot; data-view-component=\&quot;true\&quot; class=\&quot;btn-sm btn\&quot;>    Show hidden characters\n</a>\n</div>\n</div></template>\n<template class=\&quot;js-line-alert-template\&quot;>\n  <span aria-label=\&quot;This line has hidden Unicode characters\&quot; data-view-component=\&quot;true\&quot; class=\&quot;line-alert tooltipped tooltipped-e\&quot;>\n    <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n</span></template>\n\n  <table data-hpc class=\&quot;highlight tab-size js-file-line-container\&quot; data-tab-size=\&quot;8\&quot; data-paste-markdown-skip data-tagsearch-path=\&quot;expert_layer.py\&quot;>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L1\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;1\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC1\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L2\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;2\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC2\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>Based upon ColossalAI OpenMoE</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L3\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;3\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC3\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L4\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;4\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC4\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L5\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;5\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC5\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>from</span> <span class=pl-s1>torch</span> <span class=pl-k>import</span> <span class=pl-s1>nn</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L6\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;6\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC6\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L7\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;7\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC7\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>class</span> <span class=pl-v>MOELayer</span>(<span class=pl-s1>nn</span>.<span class=pl-c1>Module</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L8\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;8\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC8\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>__init__</span>(</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L9\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;9\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC9\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L10\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;10\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC10\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>d</span>, </td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L11\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;11\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC11\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>n_exp</span> <span class=pl-c1>=</span> <span class=pl-c1>8</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L12\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;12\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC12\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>top_k</span> <span class=pl-c1>=</span> <span class=pl-c1>2</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L13\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;13\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC13\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>use_noisy_top_k</span> <span class=pl-c1>=</span> <span class=pl-c1>True</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L14\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;14\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC14\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>capacity_factor</span> <span class=pl-c1>=</span> <span class=pl-c1>1.25</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L15\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;15\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC15\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>bias</span><span class=pl-c1>=</span><span class=pl-c1>False</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L16\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;16\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC16\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>dropout</span><span class=pl-c1>=</span><span class=pl-c1>0.2</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L17\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;17\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC17\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    ):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L18\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;18\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC18\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L19\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;19\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC19\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        Arguments:</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L20\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;20\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC20\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        d: size of embedding dimension</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L21\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;21\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC21\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        n_exp: the number of experts to create in the expert layer</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L22\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;22\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC22\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        top_k: the number of active experts for each token</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L23\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;23\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC23\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        use_noisy_top_k: whether to add noise when computing expert output</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L24\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;24\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC24\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        capacity_factor: used to compute expert capacity</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L25\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;25\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC25\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        bias: whether or not to use bias in linear layers</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L26\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;26\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC26\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        dropout: probability of dropout</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L27\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;27\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC27\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        &amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L28\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;28\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC28\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L29\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;29\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC29\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-en>super</span>().<span class=pl-c1>__init__</span>()</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L30\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;30\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC30\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>router</span> <span class=pl-c1>=</span> <span class=pl-en>Router</span>(  <span class=pl-c># (noisy) top k router</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L31\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;31\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC31\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>d</span><span class=pl-c1>=</span><span class=pl-s1>d</span>, </td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L32\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;32\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC32\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>n_exp</span><span class=pl-c1>=</span><span class=pl-s1>n_exp</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L33\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;33\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC33\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>top_k</span><span class=pl-c1>=</span><span class=pl-s1>top_k</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L34\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;34\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC34\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>use_noisy_top_k</span><span class=pl-c1>=</span><span class=pl-s1>use_noisy_top_k</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L35\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;35\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC35\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>capacity_factor</span><span class=pl-c1>=</span><span class=pl-s1>capacity_factor</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L36\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;36\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC36\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        )</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L37\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;37\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC37\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>experts</span> <span class=pl-c1>=</span> <span class=pl-en>MLPExperts</span>(  <span class=pl-c># group of MLPs (experts)</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L38\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;38\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC38\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>d</span><span class=pl-c1>=</span><span class=pl-s1>d</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L39\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;39\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC39\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>n_exp</span><span class=pl-c1>=</span><span class=pl-s1>n_exp</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L40\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;40\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC40\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>bias</span><span class=pl-c1>=</span><span class=pl-s1>bias</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L41\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;41\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC41\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>dropout</span><span class=pl-c1>=</span><span class=pl-s1>dropout</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L42\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;42\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC42\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        )</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L43\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;43\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC43\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L44\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;44\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC44\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>forward</span>(<span class=pl-s1>self</span>, <span class=pl-s1>x</span>: <span class=pl-s1>torch</span>.<span class=pl-c1>Tensor</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L45\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;45\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC45\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c1>B</span>, <span class=pl-c1>C</span>, <span class=pl-s1>d</span> <span class=pl-c1>=</span> <span class=pl-s1>x</span>.<span class=pl-c1>size</span>() <span class=pl-c># track original shape of input</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L46\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;46\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC46\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>num_tokens</span> <span class=pl-c1>=</span> (<span class=pl-c1>B</span> <span class=pl-c1>*</span> <span class=pl-c1>C</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L47\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;47\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC47\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L48\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;48\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC48\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># pass each token through the router</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L49\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;49\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC49\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>exp_weight</span>, <span class=pl-s1>exp_mask</span>, <span class=pl-s1>exp_batches</span> <span class=pl-c1>=</span> <span class=pl-s1>self</span>.<span class=pl-c1>router</span>(<span class=pl-s1>x</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L50\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;50\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC50\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L51\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;51\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC51\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># compute expert output</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L52\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;52\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC52\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>exp_out</span> <span class=pl-c1>=</span> <span class=pl-s1>self</span>.<span class=pl-c1>experts</span>(<span class=pl-s1>exp_batches</span>) <span class=pl-c># [n_exp, exp_capacity, d]</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L53\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;53\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC53\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L54\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;54\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC54\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># aggregate expert outputs based on router weights</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L55\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;55\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC55\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># eq (2) on page 4 of ST-MoE (https://arxiv.org/abs/2202.08906)</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L56\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;56\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC56\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>exp_weight</span> <span class=pl-c1>=</span> <span class=pl-s1>exp_weight</span>.<span class=pl-c1>view</span>(<span class=pl-s1>num_tokens</span>, <span class=pl-c1>-</span><span class=pl-c1>1</span>) <span class=pl-c># [B * C, n_exp * exp_capacity]</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L57\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;57\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC57\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>exp_out</span> <span class=pl-c1>=</span> <span class=pl-s1>exp_out</span>.<span class=pl-c1>view</span>(<span class=pl-c1>-</span><span class=pl-c1>1</span>, <span class=pl-s1>d</span>) <span class=pl-c># [n_exp * exp_capacity, d] </span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L58\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;58\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC58\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>output</span> <span class=pl-c1>=</span> <span class=pl-s1>exp_weight</span> @ <span class=pl-s1>exp_out</span> <span class=pl-c># [B * C, d]</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L59\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;59\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC59\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        </td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L60\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;60\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC60\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c># resize output before return</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-expert_layer-py-L61\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;61\&quot;></td>\n          <td id=\&quot;file-expert_layer-py-LC61\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>return</span> <span class=pl-s1>output</span>.<span class=pl-c1>view</span>(<span class=pl-c1>B</span>, <span class=pl-c1>T</span>, <span class=pl-s1>d</span>)</td>\n        </tr>\n  </table>\n</div>\n\n\n    </div>\n\n  </div>\n</div>\n\n      </div>\n      <div class=\&quot;gist-meta\&quot;>\n        <a href=\&quot;https://gist.github.com/wolfecameron/67851367036bf1cb4e0524607bc90c91/raw/d215df81ecc2d3a3a42204f962cebba6a332e616/expert_layer.py\&quot; style=\&quot;float:right\&quot; class=\&quot;Link--inTextBlock\&quot;>view raw</a>\n        <a href=\&quot;https://gist.github.com/wolfecameron/67851367036bf1cb4e0524607bc90c91#file-expert_layer-py\&quot; class=\&quot;Link--inTextBlock\&quot;>\n          expert_layer.py\n        </a>\n        hosted with &amp;#10084; by <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.com\&quot;>GitHub</a>\n      </div>\n    </div>\n</div>\n&quot;,&quot;stylesheet&quot;:&quot;https://github.githubassets.com/assets/gist-embed-7b7a1d3fd6f6.css&quot;}" data-component-name="GitgistToDOM"><link rel="stylesheet" href="https://github.githubassets.com/assets/gist-embed-7b7a1d3fd6f6.css"><div id="gist136707311" class="gist">
    <div class="gist-file" data-color-mode="light" data-light-theme="light">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container">
  <div id="file-expert_layer-py" class="file my-2">
    
    <div itemprop="text" class="Box-body p-0 blob-wrapper data type-python  " style="overflow:auto">

        
<div class="js-check-bidi js-blob-code-container blob-code-content">

  
  <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
  
    

    <span>
      This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
    </span>


  <div data-view-component="true" class="flash-action">        <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn">    Show hidden characters
</a>
</div>
</div>

  <span data-view-component="true" class="line-alert tooltipped tooltipped-e">
    
    

</span>

  <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="expert_layer.py">
        <tbody><tr>
          <td id="file-expert_layer-py-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
          <td id="file-expert_layer-py-LC1" class="blob-code blob-code-inner js-file-line"><span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
          <td id="file-expert_layer-py-LC2" class="blob-code blob-code-inner js-file-line"><span class="pl-s">Based upon ColossalAI OpenMoE</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
          <td id="file-expert_layer-py-LC3" class="blob-code blob-code-inner js-file-line"><span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
          <td id="file-expert_layer-py-LC4" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
          <td id="file-expert_layer-py-LC5" class="blob-code blob-code-inner js-file-line"><span class="pl-k">from</span> <span class="pl-s1">torch</span> <span class="pl-k">import</span> <span class="pl-s1">nn</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
          <td id="file-expert_layer-py-LC6" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
          <td id="file-expert_layer-py-LC7" class="blob-code blob-code-inner js-file-line"><span class="pl-k">class</span> <span class="pl-v">MOELayer</span>(<span class="pl-s1">nn</span>.<span class="pl-c1">Module</span>):</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
          <td id="file-expert_layer-py-LC8" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">__init__</span>(</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
          <td id="file-expert_layer-py-LC9" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>,</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
          <td id="file-expert_layer-py-LC10" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">d</span>, </td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
          <td id="file-expert_layer-py-LC11" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">n_exp</span> <span class="pl-c1">=</span> <span class="pl-c1">8</span>,</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
          <td id="file-expert_layer-py-LC12" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">top_k</span> <span class="pl-c1">=</span> <span class="pl-c1">2</span>,</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
          <td id="file-expert_layer-py-LC13" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">use_noisy_top_k</span> <span class="pl-c1">=</span> <span class="pl-c1">True</span>,</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
          <td id="file-expert_layer-py-LC14" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">capacity_factor</span> <span class="pl-c1">=</span> <span class="pl-c1">1.25</span>,</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
          <td id="file-expert_layer-py-LC15" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">bias</span><span class="pl-c1">=</span><span class="pl-c1">False</span>,</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
          <td id="file-expert_layer-py-LC16" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">dropout</span><span class="pl-c1">=</span><span class="pl-c1">0.2</span>,</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
          <td id="file-expert_layer-py-LC17" class="blob-code blob-code-inner js-file-line">    ):</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
          <td id="file-expert_layer-py-LC18" class="blob-code blob-code-inner js-file-line">        <span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
          <td id="file-expert_layer-py-LC19" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        Arguments:</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td>
          <td id="file-expert_layer-py-LC20" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        d: size of embedding dimension</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td>
          <td id="file-expert_layer-py-LC21" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        n_exp: the number of experts to create in the expert layer</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td>
          <td id="file-expert_layer-py-LC22" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        top_k: the number of active experts for each token</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td>
          <td id="file-expert_layer-py-LC23" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        use_noisy_top_k: whether to add noise when computing expert output</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td>
          <td id="file-expert_layer-py-LC24" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        capacity_factor: used to compute expert capacity</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L25" class="blob-num js-line-number js-blob-rnum" data-line-number="25"></td>
          <td id="file-expert_layer-py-LC25" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        bias: whether or not to use bias in linear layers</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L26" class="blob-num js-line-number js-blob-rnum" data-line-number="26"></td>
          <td id="file-expert_layer-py-LC26" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        dropout: probability of dropout</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L27" class="blob-num js-line-number js-blob-rnum" data-line-number="27"></td>
          <td id="file-expert_layer-py-LC27" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        """</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L28" class="blob-num js-line-number js-blob-rnum" data-line-number="28"></td>
          <td id="file-expert_layer-py-LC28" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L29" class="blob-num js-line-number js-blob-rnum" data-line-number="29"></td>
          <td id="file-expert_layer-py-LC29" class="blob-code blob-code-inner js-file-line">        <span class="pl-en">super</span>().<span class="pl-c1">__init__</span>()</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L30" class="blob-num js-line-number js-blob-rnum" data-line-number="30"></td>
          <td id="file-expert_layer-py-LC30" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">router</span> <span class="pl-c1">=</span> <span class="pl-en">Router</span>(  <span class="pl-c"># (noisy) top k router</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L31" class="blob-num js-line-number js-blob-rnum" data-line-number="31"></td>
          <td id="file-expert_layer-py-LC31" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">d</span><span class="pl-c1">=</span><span class="pl-s1">d</span>, </td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L32" class="blob-num js-line-number js-blob-rnum" data-line-number="32"></td>
          <td id="file-expert_layer-py-LC32" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">n_exp</span><span class="pl-c1">=</span><span class="pl-s1">n_exp</span>,</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L33" class="blob-num js-line-number js-blob-rnum" data-line-number="33"></td>
          <td id="file-expert_layer-py-LC33" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">top_k</span><span class="pl-c1">=</span><span class="pl-s1">top_k</span>,</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L34" class="blob-num js-line-number js-blob-rnum" data-line-number="34"></td>
          <td id="file-expert_layer-py-LC34" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">use_noisy_top_k</span><span class="pl-c1">=</span><span class="pl-s1">use_noisy_top_k</span>,</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L35" class="blob-num js-line-number js-blob-rnum" data-line-number="35"></td>
          <td id="file-expert_layer-py-LC35" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">capacity_factor</span><span class="pl-c1">=</span><span class="pl-s1">capacity_factor</span>,</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L36" class="blob-num js-line-number js-blob-rnum" data-line-number="36"></td>
          <td id="file-expert_layer-py-LC36" class="blob-code blob-code-inner js-file-line">        )</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L37" class="blob-num js-line-number js-blob-rnum" data-line-number="37"></td>
          <td id="file-expert_layer-py-LC37" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">experts</span> <span class="pl-c1">=</span> <span class="pl-en">MLPExperts</span>(  <span class="pl-c"># group of MLPs (experts)</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L38" class="blob-num js-line-number js-blob-rnum" data-line-number="38"></td>
          <td id="file-expert_layer-py-LC38" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">d</span><span class="pl-c1">=</span><span class="pl-s1">d</span>,</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L39" class="blob-num js-line-number js-blob-rnum" data-line-number="39"></td>
          <td id="file-expert_layer-py-LC39" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">n_exp</span><span class="pl-c1">=</span><span class="pl-s1">n_exp</span>,</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L40" class="blob-num js-line-number js-blob-rnum" data-line-number="40"></td>
          <td id="file-expert_layer-py-LC40" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">bias</span><span class="pl-c1">=</span><span class="pl-s1">bias</span>,</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L41" class="blob-num js-line-number js-blob-rnum" data-line-number="41"></td>
          <td id="file-expert_layer-py-LC41" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">dropout</span><span class="pl-c1">=</span><span class="pl-s1">dropout</span>,</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L42" class="blob-num js-line-number js-blob-rnum" data-line-number="42"></td>
          <td id="file-expert_layer-py-LC42" class="blob-code blob-code-inner js-file-line">        )</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L43" class="blob-num js-line-number js-blob-rnum" data-line-number="43"></td>
          <td id="file-expert_layer-py-LC43" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L44" class="blob-num js-line-number js-blob-rnum" data-line-number="44"></td>
          <td id="file-expert_layer-py-LC44" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">forward</span>(<span class="pl-s1">self</span>, <span class="pl-s1">x</span>: <span class="pl-s1">torch</span>.<span class="pl-c1">Tensor</span>):</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L45" class="blob-num js-line-number js-blob-rnum" data-line-number="45"></td>
          <td id="file-expert_layer-py-LC45" class="blob-code blob-code-inner js-file-line">        <span class="pl-c1">B</span>, <span class="pl-c1">C</span>, <span class="pl-s1">d</span> <span class="pl-c1">=</span> <span class="pl-s1">x</span>.<span class="pl-c1">size</span>() <span class="pl-c"># track original shape of input</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L46" class="blob-num js-line-number js-blob-rnum" data-line-number="46"></td>
          <td id="file-expert_layer-py-LC46" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">num_tokens</span> <span class="pl-c1">=</span> (<span class="pl-c1">B</span> <span class="pl-c1">*</span> <span class="pl-c1">C</span>)</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L47" class="blob-num js-line-number js-blob-rnum" data-line-number="47"></td>
          <td id="file-expert_layer-py-LC47" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L48" class="blob-num js-line-number js-blob-rnum" data-line-number="48"></td>
          <td id="file-expert_layer-py-LC48" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># pass each token through the router</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L49" class="blob-num js-line-number js-blob-rnum" data-line-number="49"></td>
          <td id="file-expert_layer-py-LC49" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">exp_weight</span>, <span class="pl-s1">exp_mask</span>, <span class="pl-s1">exp_batches</span> <span class="pl-c1">=</span> <span class="pl-s1">self</span>.<span class="pl-c1">router</span>(<span class="pl-s1">x</span>)</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L50" class="blob-num js-line-number js-blob-rnum" data-line-number="50"></td>
          <td id="file-expert_layer-py-LC50" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L51" class="blob-num js-line-number js-blob-rnum" data-line-number="51"></td>
          <td id="file-expert_layer-py-LC51" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># compute expert output</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L52" class="blob-num js-line-number js-blob-rnum" data-line-number="52"></td>
          <td id="file-expert_layer-py-LC52" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">exp_out</span> <span class="pl-c1">=</span> <span class="pl-s1">self</span>.<span class="pl-c1">experts</span>(<span class="pl-s1">exp_batches</span>) <span class="pl-c"># [n_exp, exp_capacity, d]</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L53" class="blob-num js-line-number js-blob-rnum" data-line-number="53"></td>
          <td id="file-expert_layer-py-LC53" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L54" class="blob-num js-line-number js-blob-rnum" data-line-number="54"></td>
          <td id="file-expert_layer-py-LC54" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># aggregate expert outputs based on router weights</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L55" class="blob-num js-line-number js-blob-rnum" data-line-number="55"></td>
          <td id="file-expert_layer-py-LC55" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># eq (2) on page 4 of ST-MoE (https://arxiv.org/abs/2202.08906)</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L56" class="blob-num js-line-number js-blob-rnum" data-line-number="56"></td>
          <td id="file-expert_layer-py-LC56" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">exp_weight</span> <span class="pl-c1">=</span> <span class="pl-s1">exp_weight</span>.<span class="pl-c1">view</span>(<span class="pl-s1">num_tokens</span>, <span class="pl-c1">-</span><span class="pl-c1">1</span>) <span class="pl-c"># [B * C, n_exp * exp_capacity]</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L57" class="blob-num js-line-number js-blob-rnum" data-line-number="57"></td>
          <td id="file-expert_layer-py-LC57" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">exp_out</span> <span class="pl-c1">=</span> <span class="pl-s1">exp_out</span>.<span class="pl-c1">view</span>(<span class="pl-c1">-</span><span class="pl-c1">1</span>, <span class="pl-s1">d</span>) <span class="pl-c"># [n_exp * exp_capacity, d] </span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L58" class="blob-num js-line-number js-blob-rnum" data-line-number="58"></td>
          <td id="file-expert_layer-py-LC58" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">output</span> <span class="pl-c1">=</span> <span class="pl-s1">exp_weight</span> @ <span class="pl-s1">exp_out</span> <span class="pl-c"># [B * C, d]</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L59" class="blob-num js-line-number js-blob-rnum" data-line-number="59"></td>
          <td id="file-expert_layer-py-LC59" class="blob-code blob-code-inner js-file-line">        </td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L60" class="blob-num js-line-number js-blob-rnum" data-line-number="60"></td>
          <td id="file-expert_layer-py-LC60" class="blob-code blob-code-inner js-file-line">        <span class="pl-c"># resize output before return</span></td>
        </tr>
        <tr>
          <td id="file-expert_layer-py-L61" class="blob-num js-line-number js-blob-rnum" data-line-number="61"></td>
          <td id="file-expert_layer-py-LC61" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">return</span> <span class="pl-s1">output</span>.<span class="pl-c1">view</span>(<span class="pl-c1">B</span>, <span class="pl-c1">T</span>, <span class="pl-s1">d</span>)</td>
        </tr>
  </tbody></table>
</div>


    </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/wolfecameron/67851367036bf1cb4e0524607bc90c91/raw/d215df81ecc2d3a3a42204f962cebba6a332e616/expert_layer.py" style="float:right" class="Link--inTextBlock">view raw</a>
        <a href="https://gist.github.com/wolfecameron/67851367036bf1cb4e0524607bc90c91#file-expert_layer-py" class="Link--inTextBlock">
          expert_layer.py
        </a>
        hosted with &#10084; by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
      </div>
    </div>
</div>
</div><p><strong>MoE in PyTorch.</strong> Now, we can modify the decoder-only transformer block to optionally use an expert layer in place of the usual feed-forward layer. This is accomplished in the code below, where we do a drop-in replacement of our <code>MLP</code> module with the new <code>MoELayer</code>, forming an <code>MoEBlock</code>.</p><div class="github-gist" data-attrs="{&quot;innerHTML&quot;:&quot;<div id=\&quot;gist136708058\&quot; class=\&quot;gist\&quot;>\n    <div class=\&quot;gist-file\&quot; translate=\&quot;no\&quot; data-color-mode=\&quot;light\&quot; data-light-theme=\&quot;light\&quot;>\n      <div class=\&quot;gist-data\&quot;>\n        <div class=\&quot;js-gist-file-update-container js-task-list-container\&quot;>\n  <div id=\&quot;file-moe_block-py\&quot; class=\&quot;file my-2\&quot;>\n    \n    <div itemprop=\&quot;text\&quot;\n      class=\&quot;Box-body p-0 blob-wrapper data type-python  \&quot;\n      style=\&quot;overflow: auto\&quot; tabindex=\&quot;0\&quot; role=\&quot;region\&quot;\n      aria-label=\&quot;file-moe_block-py\&quot;\n    >\n\n        \n<div class=\&quot;js-check-bidi js-blob-code-container blob-code-content\&quot;>\n\n  <template class=\&quot;js-file-alert-template\&quot;>\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash flash-warn flash-full d-flex flex-items-center\&quot;>\n  <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n    <span>\n      This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.\n      <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.co/hiddenchars\&quot; target=\&quot;_blank\&quot;>Learn more about bidirectional Unicode characters</a>\n    </span>\n\n\n  <div data-view-component=\&quot;true\&quot; class=\&quot;flash-action\&quot;>        <a href=\&quot;{{ revealButtonHref }}\&quot; data-view-component=\&quot;true\&quot; class=\&quot;btn-sm btn\&quot;>    Show hidden characters\n</a>\n</div>\n</div></template>\n<template class=\&quot;js-line-alert-template\&quot;>\n  <span aria-label=\&quot;This line has hidden Unicode characters\&quot; data-view-component=\&quot;true\&quot; class=\&quot;line-alert tooltipped tooltipped-e\&quot;>\n    <svg aria-hidden=\&quot;true\&quot; height=\&quot;16\&quot; viewBox=\&quot;0 0 16 16\&quot; version=\&quot;1.1\&quot; width=\&quot;16\&quot; data-view-component=\&quot;true\&quot; class=\&quot;octicon octicon-alert\&quot;>\n    <path d=\&quot;M6.457 1.047c.659-1.234 2.427-1.234 3.086 0l6.082 11.378A1.75 1.75 0 0 1 14.082 15H1.918a1.75 1.75 0 0 1-1.543-2.575Zm1.763.707a.25.25 0 0 0-.44 0L1.698 13.132a.25.25 0 0 0 .22.368h12.164a.25.25 0 0 0 .22-.368Zm.53 3.996v2.5a.75.75 0 0 1-1.5 0v-2.5a.75.75 0 0 1 1.5 0ZM9 11a1 1 0 1 1-2 0 1 1 0 0 1 2 0Z\&quot;></path>\n</svg>\n</span></template>\n\n  <table data-hpc class=\&quot;highlight tab-size js-file-line-container\&quot; data-tab-size=\&quot;8\&quot; data-paste-markdown-skip data-tagsearch-path=\&quot;moe_block.py\&quot;>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L1\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;1\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC1\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>from</span> <span class=pl-s1>torch</span> <span class=pl-k>import</span> <span class=pl-s1>nn</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L2\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;2\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC2\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L3\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;3\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC3\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-k>class</span> <span class=pl-v>MoEBlock</span>(<span class=pl-s1>nn</span>.<span class=pl-c1>Module</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L4\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;4\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC4\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L5\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;5\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC5\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>__init__</span>(</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L6\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;6\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC6\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L7\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;7\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC7\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>d</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L8\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;8\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC8\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c1>H</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L9\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;9\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC9\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-c1>C</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L10\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;10\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC10\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>n_exp</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L11\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;11\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC11\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>top_k</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L12\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;12\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC12\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>use_noisy_top_k</span> <span class=pl-c1>=</span> <span class=pl-c1>True</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L13\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;13\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC13\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>capacity_factor</span> <span class=pl-c1>=</span> <span class=pl-c1>1.25</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L14\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;14\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC14\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>bias</span> <span class=pl-c1>=</span> <span class=pl-c1>False</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L15\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;15\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC15\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>dropout</span> <span class=pl-c1>=</span> <span class=pl-c1>0.2</span>,   </td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L16\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;16\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC16\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    ):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L17\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;17\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC17\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s>&amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L18\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;18\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC18\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        Arguments:</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L19\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;19\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC19\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        d: size of embedding dimension</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L20\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;20\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC20\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        H: number of attention heads</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L21\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;21\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC21\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        C: maximum length of input sequences (in tokens)</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L22\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;22\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC22\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        n_exp: the number of experts to create in the expert layer</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L23\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;23\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC23\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        top_k: the number of active experts for each token</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L24\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;24\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC24\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        use_noisy_top_k: whether to add noise when computing expert output</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L25\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;25\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC25\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        capacity_factor: used to compute expert capacity</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L26\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;26\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC26\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        bias: whether or not to use bias in linear layers</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L27\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;27\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC27\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        dropout: probability of dropout</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L28\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;28\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC28\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;><span class=pl-s>        &amp;quot;&amp;quot;&amp;quot;</span></td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L29\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;29\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC29\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L30\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;30\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC30\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-en>super</span>().<span class=pl-c1>__init__</span>()</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L31\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;31\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC31\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>ln_1</span> <span class=pl-c1>=</span> <span class=pl-s1>nn</span>.<span class=pl-c1>LayerNorm</span>(<span class=pl-s1>d</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L32\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;32\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC32\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>attn</span> <span class=pl-c1>=</span> <span class=pl-en>CausalSelfAttention</span>(<span class=pl-s1>d</span>, <span class=pl-c1>H</span>, <span class=pl-c1>T</span>, <span class=pl-s1>bias</span>, <span class=pl-s1>dropout</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L33\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;33\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC33\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>ln_2</span> <span class=pl-c1>=</span> <span class=pl-s1>nn</span>.<span class=pl-c1>LayerNorm</span>(<span class=pl-s1>d</span>)</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L34\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;34\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC34\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>self</span>.<span class=pl-c1>mlp</span> <span class=pl-c1>=</span> <span class=pl-en>MOELayer</span>(</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L35\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;35\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC35\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>d</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L36\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;36\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC36\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>n_exp</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L37\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;37\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC37\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>top_k</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L38\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;38\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC38\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>use_noisy_top_k</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L39\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;39\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC39\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>capacity_factor</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L40\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;40\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC40\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>bias</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L41\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;41\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC41\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>            <span class=pl-s1>dropout</span>,</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L42\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;42\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC42\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        )</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L43\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;43\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC43\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>\n</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L44\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;44\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC44\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>    <span class=pl-k>def</span> <span class=pl-en>forward</span>(<span class=pl-s1>self</span>, <span class=pl-s1>x</span>):</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L45\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;45\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC45\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>x</span> <span class=pl-c1>=</span> <span class=pl-s1>x</span> <span class=pl-c1>+</span> <span class=pl-s1>self</span>.<span class=pl-c1>attn</span>(<span class=pl-s1>self</span>.<span class=pl-c1>ln_1</span>(<span class=pl-s1>x</span>))</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L46\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;46\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC46\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-s1>x</span> <span class=pl-c1>=</span> <span class=pl-s1>x</span> <span class=pl-c1>+</span> <span class=pl-s1>self</span>.<span class=pl-c1>mlp</span>(<span class=pl-s1>self</span>.<span class=pl-c1>ln_2</span>(<span class=pl-s1>x</span>))</td>\n        </tr>\n        <tr>\n          <td id=\&quot;file-moe_block-py-L47\&quot; class=\&quot;blob-num js-line-number js-blob-rnum\&quot; data-line-number=\&quot;47\&quot;></td>\n          <td id=\&quot;file-moe_block-py-LC47\&quot; class=\&quot;blob-code blob-code-inner js-file-line\&quot;>        <span class=pl-k>return</span> <span class=pl-s1>x</span></td>\n        </tr>\n  </table>\n</div>\n\n\n    </div>\n\n  </div>\n</div>\n\n      </div>\n      <div class=\&quot;gist-meta\&quot;>\n        <a href=\&quot;https://gist.github.com/wolfecameron/01537359d71ccc2efadf0411ec8991f6/raw/868f716b3cc8a6f99b758fae0167b11a85062f64/moe_block.py\&quot; style=\&quot;float:right\&quot; class=\&quot;Link--inTextBlock\&quot;>view raw</a>\n        <a href=\&quot;https://gist.github.com/wolfecameron/01537359d71ccc2efadf0411ec8991f6#file-moe_block-py\&quot; class=\&quot;Link--inTextBlock\&quot;>\n          moe_block.py\n        </a>\n        hosted with &amp;#10084; by <a class=\&quot;Link--inTextBlock\&quot; href=\&quot;https://github.com\&quot;>GitHub</a>\n      </div>\n    </div>\n</div>\n&quot;,&quot;stylesheet&quot;:&quot;https://github.githubassets.com/assets/gist-embed-7b7a1d3fd6f6.css&quot;}" data-component-name="GitgistToDOM"><link rel="stylesheet" href="https://github.githubassets.com/assets/gist-embed-7b7a1d3fd6f6.css"><div id="gist136708058" class="gist">
    <div class="gist-file" data-color-mode="light" data-light-theme="light">
      <div class="gist-data">
        <div class="js-gist-file-update-container js-task-list-container">
  <div id="file-moe_block-py" class="file my-2">
    
    <div itemprop="text" class="Box-body p-0 blob-wrapper data type-python  " style="overflow:auto">

        
<div class="js-check-bidi js-blob-code-container blob-code-content">

  
  <div data-view-component="true" class="flash flash-warn flash-full d-flex flex-items-center">
  
    

    <span>
      This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      <a class="Link--inTextBlock" href="https://github.co/hiddenchars" target="_blank">Learn more about bidirectional Unicode characters</a>
    </span>


  <div data-view-component="true" class="flash-action">        <a href="{{ revealButtonHref }}" data-view-component="true" class="btn-sm btn">    Show hidden characters
</a>
</div>
</div>

  <span data-view-component="true" class="line-alert tooltipped tooltipped-e">
    
    

</span>

  <table data-hpc="" class="highlight tab-size js-file-line-container" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-path="moe_block.py">
        <tbody><tr>
          <td id="file-moe_block-py-L1" class="blob-num js-line-number js-blob-rnum" data-line-number="1"></td>
          <td id="file-moe_block-py-LC1" class="blob-code blob-code-inner js-file-line"><span class="pl-k">from</span> <span class="pl-s1">torch</span> <span class="pl-k">import</span> <span class="pl-s1">nn</span></td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L2" class="blob-num js-line-number js-blob-rnum" data-line-number="2"></td>
          <td id="file-moe_block-py-LC2" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L3" class="blob-num js-line-number js-blob-rnum" data-line-number="3"></td>
          <td id="file-moe_block-py-LC3" class="blob-code blob-code-inner js-file-line"><span class="pl-k">class</span> <span class="pl-v">MoEBlock</span>(<span class="pl-s1">nn</span>.<span class="pl-c1">Module</span>):</td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L4" class="blob-num js-line-number js-blob-rnum" data-line-number="4"></td>
          <td id="file-moe_block-py-LC4" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L5" class="blob-num js-line-number js-blob-rnum" data-line-number="5"></td>
          <td id="file-moe_block-py-LC5" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">__init__</span>(</td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L6" class="blob-num js-line-number js-blob-rnum" data-line-number="6"></td>
          <td id="file-moe_block-py-LC6" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>,</td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L7" class="blob-num js-line-number js-blob-rnum" data-line-number="7"></td>
          <td id="file-moe_block-py-LC7" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">d</span>,</td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L8" class="blob-num js-line-number js-blob-rnum" data-line-number="8"></td>
          <td id="file-moe_block-py-LC8" class="blob-code blob-code-inner js-file-line">        <span class="pl-c1">H</span>,</td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L9" class="blob-num js-line-number js-blob-rnum" data-line-number="9"></td>
          <td id="file-moe_block-py-LC9" class="blob-code blob-code-inner js-file-line">        <span class="pl-c1">C</span>,</td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L10" class="blob-num js-line-number js-blob-rnum" data-line-number="10"></td>
          <td id="file-moe_block-py-LC10" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">n_exp</span>,</td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L11" class="blob-num js-line-number js-blob-rnum" data-line-number="11"></td>
          <td id="file-moe_block-py-LC11" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">top_k</span>,</td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L12" class="blob-num js-line-number js-blob-rnum" data-line-number="12"></td>
          <td id="file-moe_block-py-LC12" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">use_noisy_top_k</span> <span class="pl-c1">=</span> <span class="pl-c1">True</span>,</td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L13" class="blob-num js-line-number js-blob-rnum" data-line-number="13"></td>
          <td id="file-moe_block-py-LC13" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">capacity_factor</span> <span class="pl-c1">=</span> <span class="pl-c1">1.25</span>,</td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L14" class="blob-num js-line-number js-blob-rnum" data-line-number="14"></td>
          <td id="file-moe_block-py-LC14" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">bias</span> <span class="pl-c1">=</span> <span class="pl-c1">False</span>,</td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L15" class="blob-num js-line-number js-blob-rnum" data-line-number="15"></td>
          <td id="file-moe_block-py-LC15" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">dropout</span> <span class="pl-c1">=</span> <span class="pl-c1">0.2</span>,   </td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L16" class="blob-num js-line-number js-blob-rnum" data-line-number="16"></td>
          <td id="file-moe_block-py-LC16" class="blob-code blob-code-inner js-file-line">    ):</td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L17" class="blob-num js-line-number js-blob-rnum" data-line-number="17"></td>
          <td id="file-moe_block-py-LC17" class="blob-code blob-code-inner js-file-line">        <span class="pl-s">"""</span></td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L18" class="blob-num js-line-number js-blob-rnum" data-line-number="18"></td>
          <td id="file-moe_block-py-LC18" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        Arguments:</span></td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L19" class="blob-num js-line-number js-blob-rnum" data-line-number="19"></td>
          <td id="file-moe_block-py-LC19" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        d: size of embedding dimension</span></td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L20" class="blob-num js-line-number js-blob-rnum" data-line-number="20"></td>
          <td id="file-moe_block-py-LC20" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        H: number of attention heads</span></td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L21" class="blob-num js-line-number js-blob-rnum" data-line-number="21"></td>
          <td id="file-moe_block-py-LC21" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        C: maximum length of input sequences (in tokens)</span></td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L22" class="blob-num js-line-number js-blob-rnum" data-line-number="22"></td>
          <td id="file-moe_block-py-LC22" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        n_exp: the number of experts to create in the expert layer</span></td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L23" class="blob-num js-line-number js-blob-rnum" data-line-number="23"></td>
          <td id="file-moe_block-py-LC23" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        top_k: the number of active experts for each token</span></td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L24" class="blob-num js-line-number js-blob-rnum" data-line-number="24"></td>
          <td id="file-moe_block-py-LC24" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        use_noisy_top_k: whether to add noise when computing expert output</span></td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L25" class="blob-num js-line-number js-blob-rnum" data-line-number="25"></td>
          <td id="file-moe_block-py-LC25" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        capacity_factor: used to compute expert capacity</span></td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L26" class="blob-num js-line-number js-blob-rnum" data-line-number="26"></td>
          <td id="file-moe_block-py-LC26" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        bias: whether or not to use bias in linear layers</span></td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L27" class="blob-num js-line-number js-blob-rnum" data-line-number="27"></td>
          <td id="file-moe_block-py-LC27" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        dropout: probability of dropout</span></td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L28" class="blob-num js-line-number js-blob-rnum" data-line-number="28"></td>
          <td id="file-moe_block-py-LC28" class="blob-code blob-code-inner js-file-line"><span class="pl-s">        """</span></td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L29" class="blob-num js-line-number js-blob-rnum" data-line-number="29"></td>
          <td id="file-moe_block-py-LC29" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L30" class="blob-num js-line-number js-blob-rnum" data-line-number="30"></td>
          <td id="file-moe_block-py-LC30" class="blob-code blob-code-inner js-file-line">        <span class="pl-en">super</span>().<span class="pl-c1">__init__</span>()</td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L31" class="blob-num js-line-number js-blob-rnum" data-line-number="31"></td>
          <td id="file-moe_block-py-LC31" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">ln_1</span> <span class="pl-c1">=</span> <span class="pl-s1">nn</span>.<span class="pl-c1">LayerNorm</span>(<span class="pl-s1">d</span>)</td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L32" class="blob-num js-line-number js-blob-rnum" data-line-number="32"></td>
          <td id="file-moe_block-py-LC32" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">attn</span> <span class="pl-c1">=</span> <span class="pl-en">CausalSelfAttention</span>(<span class="pl-s1">d</span>, <span class="pl-c1">H</span>, <span class="pl-c1">T</span>, <span class="pl-s1">bias</span>, <span class="pl-s1">dropout</span>)</td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L33" class="blob-num js-line-number js-blob-rnum" data-line-number="33"></td>
          <td id="file-moe_block-py-LC33" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">ln_2</span> <span class="pl-c1">=</span> <span class="pl-s1">nn</span>.<span class="pl-c1">LayerNorm</span>(<span class="pl-s1">d</span>)</td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L34" class="blob-num js-line-number js-blob-rnum" data-line-number="34"></td>
          <td id="file-moe_block-py-LC34" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">self</span>.<span class="pl-c1">mlp</span> <span class="pl-c1">=</span> <span class="pl-en">MOELayer</span>(</td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L35" class="blob-num js-line-number js-blob-rnum" data-line-number="35"></td>
          <td id="file-moe_block-py-LC35" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">d</span>,</td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L36" class="blob-num js-line-number js-blob-rnum" data-line-number="36"></td>
          <td id="file-moe_block-py-LC36" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">n_exp</span>,</td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L37" class="blob-num js-line-number js-blob-rnum" data-line-number="37"></td>
          <td id="file-moe_block-py-LC37" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">top_k</span>,</td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L38" class="blob-num js-line-number js-blob-rnum" data-line-number="38"></td>
          <td id="file-moe_block-py-LC38" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">use_noisy_top_k</span>,</td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L39" class="blob-num js-line-number js-blob-rnum" data-line-number="39"></td>
          <td id="file-moe_block-py-LC39" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">capacity_factor</span>,</td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L40" class="blob-num js-line-number js-blob-rnum" data-line-number="40"></td>
          <td id="file-moe_block-py-LC40" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">bias</span>,</td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L41" class="blob-num js-line-number js-blob-rnum" data-line-number="41"></td>
          <td id="file-moe_block-py-LC41" class="blob-code blob-code-inner js-file-line">            <span class="pl-s1">dropout</span>,</td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L42" class="blob-num js-line-number js-blob-rnum" data-line-number="42"></td>
          <td id="file-moe_block-py-LC42" class="blob-code blob-code-inner js-file-line">        )</td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L43" class="blob-num js-line-number js-blob-rnum" data-line-number="43"></td>
          <td id="file-moe_block-py-LC43" class="blob-code blob-code-inner js-file-line">
</td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L44" class="blob-num js-line-number js-blob-rnum" data-line-number="44"></td>
          <td id="file-moe_block-py-LC44" class="blob-code blob-code-inner js-file-line">    <span class="pl-k">def</span> <span class="pl-en">forward</span>(<span class="pl-s1">self</span>, <span class="pl-s1">x</span>):</td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L45" class="blob-num js-line-number js-blob-rnum" data-line-number="45"></td>
          <td id="file-moe_block-py-LC45" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">x</span> <span class="pl-c1">=</span> <span class="pl-s1">x</span> <span class="pl-c1">+</span> <span class="pl-s1">self</span>.<span class="pl-c1">attn</span>(<span class="pl-s1">self</span>.<span class="pl-c1">ln_1</span>(<span class="pl-s1">x</span>))</td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L46" class="blob-num js-line-number js-blob-rnum" data-line-number="46"></td>
          <td id="file-moe_block-py-LC46" class="blob-code blob-code-inner js-file-line">        <span class="pl-s1">x</span> <span class="pl-c1">=</span> <span class="pl-s1">x</span> <span class="pl-c1">+</span> <span class="pl-s1">self</span>.<span class="pl-c1">mlp</span>(<span class="pl-s1">self</span>.<span class="pl-c1">ln_2</span>(<span class="pl-s1">x</span>))</td>
        </tr>
        <tr>
          <td id="file-moe_block-py-L47" class="blob-num js-line-number js-blob-rnum" data-line-number="47"></td>
          <td id="file-moe_block-py-LC47" class="blob-code blob-code-inner js-file-line">        <span class="pl-k">return</span> <span class="pl-s1">x</span></td>
        </tr>
  </tbody></table>
</div>


    </div>

  </div>
</div>

      </div>
      <div class="gist-meta">
        <a href="https://gist.github.com/wolfecameron/01537359d71ccc2efadf0411ec8991f6/raw/868f716b3cc8a6f99b758fae0167b11a85062f64/moe_block.py" style="float:right" class="Link--inTextBlock">view raw</a>
        <a href="https://gist.github.com/wolfecameron/01537359d71ccc2efadf0411ec8991f6#file-moe_block-py" class="Link--inTextBlock">
          moe_block.py
        </a>
        hosted with &#10084; by <a class="Link--inTextBlock" href="https://github.com">GitHub</a>
      </div>
    </div>
</div>
</div><p>From here, the final implementation of our MoE architecture exactly matches the decoder-only transformer (<code>GPT</code>) implementation from before. The only change is that we replace every <code>P</code>-th <code>Block</code> with an <code>MoEBlock</code>. We will avoid explicitly writing out this implementation here, as the code is identical to the <code>GPT</code> model defined before, aside from the addition of interleaved MoE blocks.</p><h2>Pretraining nanoMoE from Scratch</h2><p>Now that we understanding how MoEs work, let&#8217;s pretrain an LLM from scratch using this architecture. A full implementation of an MoE-based LLM is present in the repository below. This implementation&#8212;<em>called nanoMoE</em>&#8212;is based upon <a href="https://karpathy.ai/">Andrej Karpathy</a>&#8217;s <a href="https://github.com/karpathy/nanoGPT">nanoGPT</a> repository. However, the original GPT architecture has been modified to use an MoE-based decoder-only transformer architecture.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://github.com/wolfecameron/nanoMoE&quot;,&quot;text&quot;:&quot;nanoMoE Repo&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://github.com/wolfecameron/nanoMoE"><span>nanoMoE Repo</span></a></p><p>The nanoMoE repository reuses code for all of the MoE components that we have seen so far in this post. The key components of this implementation are:</p><ul><li><p><em>Model implementation</em>: see the <code>GPT</code> model definition, where the ability to construct an MoE model has been added. [<a href="https://github.com/wolfecameron/nanoMoE/blob/master/model.py">link</a>]</p></li><li><p><em>Training</em>: all training code is present in a single file and has not been meaningfully modified from the original nanoGPT code. [<a href="https://github.com/wolfecameron/nanoMoE/blob/master/train.py">link</a>]</p></li><li><p><em>Dataset</em>: nanoMoE is pretrained on a 25B token subset<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-12" href="#footnote-12" target="_self">12</a> of the OpenWebText dataset (same as nanoGPT but with fewer tokens). [<a href="https://github.com/wolfecameron/nanoMoE/tree/master/data/openwebtext">link</a>]</p></li><li><p><em>Configuration</em>: the final training configuration used to pretrain nanoMoE, which we will explain in the next section, can be found <a href="https://github.com/wolfecameron/nanoMoE/blob/master/config/train_nano_moe.py">here</a>.</p></li></ul><p>In this section, we will further outline the best practices that were discovered for successfully pretraining nanoMoE, go over the results of pretraining, and outline the optimal pretraining setup that was discovered for this mid-size MoE model.</p><h4>Best Practices for Training MoEs</h4><blockquote><p><em>&#8220;Despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability.&#8221;</em> - from [6]</p></blockquote><p>Although MoEs were <a href="https://cameronrwolfe.substack.com/i/142423094/origins-of-the-mixture-of-experts">proposed a long time ago</a>, their popularity has increased drastically for LLM research only recently. For years, the main impediment to the adoption of MoEs was their difficulty of use. Relative to dense models, MoEs are more complex and generally prone to instability during training.</p><p><strong>Why are MoEs unstable?</strong> As we have seen, MoE-based LLMs only make slight modifications to the decoder-only transformer architecture. With this in mind, we might wonder: <em>What exactly in the MoE architecture causes difficulty during training?</em> <em>Why is the training of an MoE less stable compared to a standard LLM?</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!efMH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F213eacf6-6f4c-48ac-9fec-b81a24580b4b_1370x804.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!efMH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F213eacf6-6f4c-48ac-9fec-b81a24580b4b_1370x804.png 424w, https://substackcdn.com/image/fetch/$s_!efMH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F213eacf6-6f4c-48ac-9fec-b81a24580b4b_1370x804.png 848w, https://substackcdn.com/image/fetch/$s_!efMH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F213eacf6-6f4c-48ac-9fec-b81a24580b4b_1370x804.png 1272w, https://substackcdn.com/image/fetch/$s_!efMH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F213eacf6-6f4c-48ac-9fec-b81a24580b4b_1370x804.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!efMH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F213eacf6-6f4c-48ac-9fec-b81a24580b4b_1370x804.png" width="486" height="285.21459854014597" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/213eacf6-6f4c-48ac-9fec-b81a24580b4b_1370x804.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:804,&quot;width&quot;:1370,&quot;resizeWidth&quot;:486,&quot;bytes&quot;:114402,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/155023686?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F213eacf6-6f4c-48ac-9fec-b81a24580b4b_1370x804.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!efMH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F213eacf6-6f4c-48ac-9fec-b81a24580b4b_1370x804.png 424w, https://substackcdn.com/image/fetch/$s_!efMH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F213eacf6-6f4c-48ac-9fec-b81a24580b4b_1370x804.png 848w, https://substackcdn.com/image/fetch/$s_!efMH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F213eacf6-6f4c-48ac-9fec-b81a24580b4b_1370x804.png 1272w, https://substackcdn.com/image/fetch/$s_!efMH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F213eacf6-6f4c-48ac-9fec-b81a24580b4b_1370x804.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Divergence during nanoMoE pretraining</figcaption></figure></div><p>There are two main issues that occur when training an MoE:</p><ol><li><p><em>Routing collapse</em>: the model converges to utilizing the same expert(s) over and over again.</p></li><li><p><em>Numerical instability</em>: the MoE may experience <a href="https://en.wikipedia.org/wiki/Round-off_error">round-off</a> errors, especially in the router (i.e., due to its use of exponentials in the softmax)<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-13" href="#footnote-13" target="_self">13</a>.</p></li></ol><p>These issues lead to training instability, meaning that the model&#8217;s loss may simply diverge during the training process; see above for a concrete example from training nanoMoE. When this happens, we need to stop the training process and restart from a saved checkpoint, which is time consuming and inefficient (i.e., lots of idle GPU time!). Ideally, <em>we want a stable training process that avoids these instabilities</em>. So, let&#8217;s cover best practices for improving MoE training stability. </p><p><strong>Auxiliary losses.</strong> As discussed previously, we do not have to choose between auxiliary losses when training an MoE. Instead, we can just combine multiple auxiliary losses into a single loss function. In the case of nanoMoE, we use both the standard auxiliary load balancing loss and the router z-loss during training. Using the correct auxiliary losses improves training stability by enabling uniform usage of experts and avoiding routing collapse during training. </p><p><strong>Training precision.</strong> When training an LLM, it usually makes sense to use mixed precision training, which converts some components of the model to run in a lower <code>float16</code> or <code>bfloat16</code> precision format instead of full <code>float32</code> precision. This functionality is supported automatically in PyTorch via the <a href="https://pytorch.org/docs/stable/amp.html">automatic mixed precision (AMP) module</a> and can significantly reduce training costs without deteriorating model performance. In other words, this is a &#8220;free&#8221; pretraining speedup that we can easily enable with minimal code changes. </p><blockquote><p><em>&#8220;Compared with the BF16 baseline, the relative loss error of our FP8-training model remains consistently below 0.25%, a level well within the acceptable range of training randomness.&#8221;</em> - from [8]</p></blockquote><p>Mixed precision has been used for some time, but researchers have more recently explored methods for reducing LLM training precision even further&#8212;<em>lower than 16-bits</em>. For example, DeepSeek-v3 [8] is trained using 8-bit precision. However, maintaining the same level of model quality becomes more difficult as training precision is reduced. Implementing large-scale LLM training with <code>FP8</code> precision requires novel and complex quantization techniques. Otherwise, training an LLM at such low precision may negatively impact the model&#8217;s performance. </p><pre><code>with torch.amp.autocast(device_type='cuda', enabled=False):
    # AMP is disabled for code in this block!
    &lt;router code goes here&gt;</code></pre><p><em>Why is this relevant to MoEs?</em> As we mentioned before, the routing mechanism within an MoE is prone to numerical instability. Computing the router&#8217;s output in lower precision makes this problem even worse! This issue is explicitly outlined in [6], where authors find that low precision training leads to large round-off errors in the router. To solve this issue, we must run the router in full (<code>float32</code>) precision even when training with AMP, which can be achieved by simply disabling AMP in the MoE&#8217;s routing mechanism; see above. </p><p><strong>Weight initialization.</strong> Traditionally, one of the biggest factors for stable training of large neural networks has been using the correct weight initialization strategy; e.g., <a href="https://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf">Glorot</a> or <a href="https://arxiv.org/abs/1502.01852">He</a> initialization. These techniques&#8212;<em>along with strategies like <a href="https://arxiv.org/abs/1502.03167">batch normalization</a></em>&#8212;unlocked the ability to train incredibly deep neural networks, which was quite difficult before. For LLMs, we usually adopt these same weight initialization strategies. However, authors in [6] recommend adopting a slightly modified weight initialization scheme that is specifically designed for MoEs. </p><pre><code># linear layers have flipped dimensions ([out_dim, in_dim]) in torch
w_fan_in = module.weight.shape[-1]
w_std = (scale / w_fan_in) ** 0.5
torch.nn.init.trunc_normal_(
    module.weight,
    mean=0.0,
    std=w_std,
    a=-2*w_std,
    b=2*w_std,
)</code></pre><p>This weight initialization strategy samples weights from a <a href="https://pytorch.org/rl/0.6/reference/generated/torchrl.modules.TruncatedNormal.html">truncated normal distribution</a> with a mean of zero (<code>&#181; = 0</code>) and standard deviation given by <code>&#963; = SQRT(s/n)</code>, where <code>s</code> is a scale hyperparameter and <code>n</code> is the size of the input to the layer being initialized (i.e., <a href="https://stackoverflow.com/questions/42670274/how-to-calculate-fan-in-and-fan-out-in-xavier-initialization-for-neural-networks">fan-in strategy</a>). Authors in [6] also recommend using a reduced scale hyperparameter of <code>s = 0.1</code> to <em>&#8220;improve quality and reduce the likelihood of destabilized training&#8221;</em>. An implementation of this modified weight initialization strategy in PyTorch is provided above.</p><p><strong>MoE finetuning.</strong> We will only focus on pretraining nanoMoE in this overview. However, we should also be aware that MoEs can be more difficult to finetune compared to standard dense models. In particular, MoEs are prone to overfitting due to the fact that they have so many parameters. These large models are great for pretraining over massive datasets, but they can overfit when finetuned over a small amount of data. We should be aware of this issue and try our best to prevent overfitting when finetuning MoEs (e.g., via a higher dropout ratio). We leave the exploration of finetuning nanoMoE&#8212;<em>and preventing overfitting</em>&#8212;as future work.</p><h4>nanoMoE Pretraining Experiments</h4><p>Now that we understand the different tricks that we can use to train MoEs in a stable fashion, let&#8217;s test them out in real life by pretraining nanoMoE from scratch. To test these commands yourself, you will need access to one or more GPUs. For the experiments presented here, I used two RTX 3090 GPUs on my personal workstation. These are commodity GPUs&#8212;<em>they do not have much memory (only 24 Gb)</em>. The pretraining settings have been scaled down accordingly, allowing everything to fit in GPU memory and run completely in less than a week. </p><p><strong>General pretraining settings.</strong> The final configuration used for pretraining is <a href="https://github.com/wolfecameron/nanoMoE/blob/master/config/train_nano_moe.py">here</a> and has the following settings:</p><ul><li><p><em>Model architecture</em>: six layers (or blocks), six attention heads per self-attention layer, <code>d = 368</code>, <code>N = 8</code> (total experts), <code>K = 2</code> (active experts), <code>P = 2 </code>(every other layer uses an MoE block).</p></li><li><p><em>Expert capacity</em>: capacity factor of 1.25 for training and 2.0 for evaluation.</p></li><li><p><em>Auxiliary losses</em>: we use both the load balancing auxiliary loss (scaling factor of <code>0.01</code>) and the router z-loss (scaling factor of<code> 0.001</code>). </p></li><li><p><em>Precision</em>: we use automatic mixed precision (<code>bfloat16</code>) for training but the router always uses full (<code>float32</code>) precision.</p></li><li><p><em>Learning rate</em>: we adopt a standard LLM learning rate strategy&#8212;<em>linear warmup from </em><code>6e-5</code><em> to </em><code>6e-4</code><em> at the start of training, followed by cosine decay to </em><code>6e-5</code>.</p></li><li><p><em>Weight initialization</em>: we use the weight initialization scheme proposed in [6] to improve MoE training stability. </p></li></ul><p><strong>Pretraining dataset.</strong> Similarly to nanoGPT, we use the <a href="https://huggingface.co/datasets/Skylion007/openwebtext">OpenWebText dataset</a> for pretraining nanoMoE. The pretraining process is scaled down to ~25 billion total tokens&#8212;<em>around 10% of the tokens used for pretraining nanoGPT</em>. This smaller dataset allows pretraining to complete in roughly 5 days on two 3090 GPUs. However, we can easily scale this up to a full pretraining run by obtaining a better GPU setup (e.g., 8&#215;A100 GPUs) and setting <code>max_iters = 600,000</code> (instead of <code>50,000</code>).</p><p><strong>Stability experiments.</strong> To test the impact of different settings on nanoMoE&#8217;s training stability, we perform five different experiments. First, we pretrain a baseline nanoMoE model using no auxiliary losses or best practices, <em>which leads to poor load balancing and instability</em>. Then, we enable several improvements one-by-one to observe their impact on pretraining stability:</p><ol><li><p>Auxiliary load balancing loss.</p></li><li><p>Router z-loss.</p></li><li><p>Full precision in the router. </p></li><li><p>Improved weight initialization scheme. </p></li></ol><p>The results of these five experiments are shown in the figure below. As we can see, each improvement to the pretraining process yields a slight improvement in training stability&#8212;<em>the divergence in pretraining comes a little bit later in the training process</em>. When we enable all of the improvements together, the model actually completes the entire training process without any issues! We can clearly see here that the ideas discussed tangibly impact nanoMoE&#8217;s training stability.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Pxn-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a03cfe9-7023-4580-ac83-1ee1c19930f1_2012x922.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Pxn-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a03cfe9-7023-4580-ac83-1ee1c19930f1_2012x922.png 424w, https://substackcdn.com/image/fetch/$s_!Pxn-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a03cfe9-7023-4580-ac83-1ee1c19930f1_2012x922.png 848w, https://substackcdn.com/image/fetch/$s_!Pxn-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a03cfe9-7023-4580-ac83-1ee1c19930f1_2012x922.png 1272w, https://substackcdn.com/image/fetch/$s_!Pxn-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a03cfe9-7023-4580-ac83-1ee1c19930f1_2012x922.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Pxn-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a03cfe9-7023-4580-ac83-1ee1c19930f1_2012x922.png" width="1456" height="667" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6a03cfe9-7023-4580-ac83-1ee1c19930f1_2012x922.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:667,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:242684,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://cameronrwolfe.substack.com/i/155023686?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a03cfe9-7023-4580-ac83-1ee1c19930f1_2012x922.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Pxn-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a03cfe9-7023-4580-ac83-1ee1c19930f1_2012x922.png 424w, https://substackcdn.com/image/fetch/$s_!Pxn-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a03cfe9-7023-4580-ac83-1ee1c19930f1_2012x922.png 848w, https://substackcdn.com/image/fetch/$s_!Pxn-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a03cfe9-7023-4580-ac83-1ee1c19930f1_2012x922.png 1272w, https://substackcdn.com/image/fetch/$s_!Pxn-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a03cfe9-7023-4580-ac83-1ee1c19930f1_2012x922.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Testing different stability techniques for nanoMoE</figcaption></figure></div><p>For those who are interested, I would encourage you to try these ideas out yourself! Just tweak the training configuration and execute the pretraining process using the command shown below. This command assumes that you are running pretraining on a single node with one or more GPUs available.</p><pre><code>torchrun --standalone --nproc_per_node=&lt;number of GPUs&gt; train.py &lt;path to config; e.g., config/train_nano_moe.py&gt;</code></pre><h2>Further Learning for Mixture-of-Experts</h2><p>In this overview, we have gained an in-depth understanding of how Mixture-of-Experts (MoE)-based LLMs operate by beginning with a standard decoder-only transformer architecture and modifying it to use an MoE architecture. Then, we applied these ideas by pretraining a mid-size MoE-based LLM, <em>called nanoMoE</em>, from scratch on the OpenWebText dataset. Although MoEs are considered to be more difficult to train than standard LLMs, we see in our experiments how ideas like auxiliary losses, mixed precision, better weight initialization and more can be applied to train MoEs successfully (i.e., without any instabilities)!</p><p>Although nanoMoE is a great learning tool, most practical implementations of MoEs will be more complex than this. To learn about how MoEs are actually used in LLM research, we should look at production-grade MoE frameworks for efficient training and inference (e.g., <a href="https://github.com/XueFuzhao/OpenMoE">OpenMoE</a> [9] or <a href="https://github.com/databricks/megablocks">Megablocks</a> [10]), as well as recent publications on the topic of MoEs; e.g., <a href="https://arxiv.org/abs/2401.04088">Mixtral</a>, <a href="https://arxiv.org/abs/2412.19437">DeepSeek-v3</a>, or <a href="https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm">DBRX</a>. </p><h4>New to the newsletter?</h4><p>Hi! I&#8217;m <a href="https://cameronrwolfe.me/">Cameron R. Wolfe</a>, Deep Learning Ph.D. and Research Scientist at <a href="https://research.netflix.com/research-area/nlp-and-conversations">Netflix</a>. This is the Deep (Learning) Focus newsletter, where I help readers understand important topics in AI research. If you like the newsletter, please subscribe, share it, or follow me on <a href="https://twitter.com/cwolferesearch">X</a> and <a href="https://www.linkedin.com/in/cameron-r-wolfe-ph-d-04744a238/">LinkedIn</a>!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://cameronrwolfe.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://cameronrwolfe.substack.com/subscribe?"><span>Subscribe now</span></a></p><h4>Bibliography</h4><p>[1] Vaswani, Ashish, et al. "Attention is all you need." <em>Advances in neural information processing systems</em> 30 (2017).</p><p>[2] Sennrich, Rico, Barry Haddow, and Alexandra Birch. "Neural machine translation of rare words with subword units." <em>arXiv preprint arXiv:1508.07909</em> (2015).</p><p>[3] Shazeer, Noam. "Glu variants improve transformer." <em>arXiv preprint arXiv:2002.05202</em> (2020).</p><p>[4] He, Kaiming, et al. "Deep residual learning for image recognition." <em>Proceedings of the IEEE conference on computer vision and pattern recognition</em>. 2016.</p><p>[5] Zoph, Barret, et al. "St-moe: Designing stable and transferable sparse expert models." <em>arXiv preprint arXiv:2202.08906</em> (2022).</p><p>[6] Fedus, William, Barret Zoph, and Noam Shazeer. "Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity." <em>Journal of Machine Learning Research</em> 23.120 (2022): 1-39.</p><p>[7] Shazeer, Noam, et al. "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer." <em>arXiv preprint arXiv:1701.06538</em> (2017).</p><p>[8] Liu, Aixin, et al. "Deepseek-v3 technical report." <em>arXiv preprint arXiv:2412.19437</em> (2024).</p><p>[9] Xue, Fuzhao, et al. "Openmoe: An early effort on open mixture-of-experts language models." <em>arXiv preprint arXiv:2402.01739</em> (2024).</p><p>[10] Gale, Trevor, et al. "Megablocks: Efficient sparse training with mixture-of-experts." <em>Proceedings of Machine Learning and Systems</em> 5 (2023): 288-304.</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>This architecture is not &#8220;new&#8221; per se. It has been around for a <a href="https://cameronrwolfe.substack.com/i/142423094/early-work-on-conditional-computation">very long time</a>. But, it&#8217;s adoption in large-scale LLM applications is more recent. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>The decoder is slightly different because we remove the cross-attention layer that is used in the decoder for the full encoder-decoder model.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>An explanation of basic positional encodings (or embeddings) for transformers can be found <a href="https://www.geeksforgeeks.org/working-of-positional-embedding-in-self-attention/">here</a>. However, most modern LLMs use <a href="https://arxiv.org/abs/2104.09864">rotary positional embeddings (RoPE)</a> in place of this basic position encoding scheme from [1]. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>Our implementation here also performs <a href="https://paperswithcode.com/method/attention-dropout">attention dropout</a>, where we randomly drop certain attention scores during training for regularization purposes. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>The word &#8220;pointwise&#8221; indicates that the same operation is applied to every token vector in the sequence. In this case, we individually pass every token vector in the sequence through the same feed-forward neural network with the same weights.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>We use a pre-normalization structure, where normalization is applied to the input of each layer. The original transformer [1] used a post-normalization structure, but later analysis showed that <a href="https://arxiv.org/abs/2002.04745">pre-normalization</a> is favorable. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>To apply a residual connection within a neural network layer, the input and output dimension of that layer must be the same. If the dimensions are not the same, we can still apply a residual connection by just linearly projecting the input.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>See [5] and [6] for more details and experiments on tuning the capacity factor.</p><p></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-9" href="#footnote-anchor-9" class="footnote-number" contenteditable="false" target="_self">9</a><div class="footnote-content"><p>The details are not super important here&#8212;<em>this is just an implementation complexity that is introduced to vectorize the operations of the router</em>. However, this is a great coding exercise in PyTorch for those who are interested in understanding!</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-10" href="#footnote-anchor-10" class="footnote-number" contenteditable="false" target="_self">10</a><div class="footnote-content"><p>This quantity is predicted by our routing algorithm and is, therefore, differentiable. So, the loss function as a whole is differentiable even though the fraction of tokens sent to each expert is not itself a differentiable quantity.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-11" href="#footnote-anchor-11" class="footnote-number" contenteditable="false" target="_self">11</a><div class="footnote-content"><p>We also multiply the result of the operation by <code>N</code> (the total number of experts), which ensures that the loss stays constant as the value of <code>N</code> increases. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-12" href="#footnote-anchor-12" class="footnote-number" contenteditable="false" target="_self">12</a><div class="footnote-content"><p>This number of tokens was selected such that the full pretraining run can be completed in ~5 days on a 2&#215; RTX 3090 GPU setup. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-13" href="#footnote-anchor-13" class="footnote-number" contenteditable="false" target="_self">13</a><div class="footnote-content"><p>Although softmax transformations are a pretty common operation, we should note that standard decoder-only transformers do NOT have these exponentials anywhere within their architecture!</p></div></div>]]></content:encoded></item><item><title><![CDATA[Demystifying Reasoning Models]]></title><description><![CDATA[Understanding reasoning models and their relation to standard LLMs...]]></description><link>https://cameronrwolfe.substack.com/p/demystifying-reasoning-models</link><guid isPermaLink="false">https://cameronrwolfe.substack.com/p/demystifying-reasoning-models</guid><dc:creator><![CDATA[Cameron R. Wolfe, Ph.D.]]></dc:creator><pubDate>Tue, 18 Feb 2025 10:33:55 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/23d9c87e-b238-4fdd-996e-4ed4465b9931_2334x1282.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pR5Z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4fb1867-b78e-4db6-aea7-14251a3facce_2389x1336.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pR5Z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4fb1867-b78e-4db6-aea7-14251a3facce_2389x1336.png 424w, https://substackcdn.com/image/fetch/$s_!pR5Z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4fb1867-b78e-4db6-aea7-14251a3facce_2389x1336.png 848w, https://substackcdn.com/image/fetch/$s_!pR5Z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4fb1867-b78e-4db6-aea7-14251a3facce_2389x1336.png 1272w, https://substackcdn.com/image/fetch/$s_!pR5Z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4fb1867-b78e-4db6-aea7-14251a3facce_2389x1336.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pR5Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4fb1867-b78e-4db6-aea7-14251a3facce_2389x1336.png" width="2389" height="1336" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b4fb1867-b78e-4db6-aea7-14251a3facce_2389x1336.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1336,&quot;width&quot;:2389,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:984300,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pR5Z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4fb1867-b78e-4db6-aea7-14251a3facce_2389x1336.png 424w, https://substackcdn.com/image/fetch/$s_!pR5Z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4fb1867-b78e-4db6-aea7-14251a3facce_2389x1336.png 848w, https://substackcdn.com/image/fetch/$s_!pR5Z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4fb1867-b78e-4db6-aea7-14251a3facce_2389x1336.png 1272w, https://substackcdn.com/image/fetch/$s_!pR5Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4fb1867-b78e-4db6-aea7-14251a3facce_2389x1336.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [4, 13, 22])</figcaption></figure></div><p>For the last several years, we have used a relatively fixed pipeline for training large language models (LLMs); see below. First, we pretrain these language models over raw textual data from the internet. Afterwards, we align them&#8212;<em>or train them to produce outputs that are preferable to humans</em>&#8212;using a combination of <a href="https://cameronrwolfe.substack.com/p/understanding-and-using-supervised">supervised finetuning (SFT)</a> and <a href="https://cameronrwolfe.substack.com/p/the-story-of-rlhf-origins-motivations">reinforcement learning from human feedback (RLHF)</a>. Both pretraining and alignment play a key role in model quality, but a large majority of advancements in this paradigm have been driven by <a href="https://cameronrwolfe.substack.com/p/llm-scaling-laws">LLM scaling laws</a>&#8212;<em>we get better results by pretraining larger models on more data</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9HTk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac82c7c1-fcbd-4b32-b9cd-febfadd77c19_1720x562.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9HTk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac82c7c1-fcbd-4b32-b9cd-febfadd77c19_1720x562.png 424w, https://substackcdn.com/image/fetch/$s_!9HTk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac82c7c1-fcbd-4b32-b9cd-febfadd77c19_1720x562.png 848w, https://substackcdn.com/image/fetch/$s_!9HTk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac82c7c1-fcbd-4b32-b9cd-febfadd77c19_1720x562.png 1272w, https://substackcdn.com/image/fetch/$s_!9HTk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac82c7c1-fcbd-4b32-b9cd-febfadd77c19_1720x562.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9HTk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac82c7c1-fcbd-4b32-b9cd-febfadd77c19_1720x562.png" width="1456" height="476" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ac82c7c1-fcbd-4b32-b9cd-febfadd77c19_1720x562.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:476,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:194789,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9HTk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac82c7c1-fcbd-4b32-b9cd-febfadd77c19_1720x562.png 424w, https://substackcdn.com/image/fetch/$s_!9HTk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac82c7c1-fcbd-4b32-b9cd-febfadd77c19_1720x562.png 848w, https://substackcdn.com/image/fetch/$s_!9HTk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac82c7c1-fcbd-4b32-b9cd-febfadd77c19_1720x562.png 1272w, https://substackcdn.com/image/fetch/$s_!9HTk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac82c7c1-fcbd-4b32-b9cd-febfadd77c19_1720x562.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Training pipeline for a standard LLM</figcaption></figure></div><p>Recently, a completely new paradigm in LLM research has emerged: <em>reasoning</em>. Reasoning models approach problem solving in a completely different manner compared to standard LLMs. In particular, they spend a variable amount of time &#8220;thinking&#8221; prior to providing their final answer to a question. Training models that are able to think effectively (e.g., decompose problems, detect errors in their thinking, explore alternative solutions and more) requires new strategies, usually involving large-scale reinforcement learning (RL). Additionally, such models give rise to new forms of scaling laws for training via RL and inference; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1eNI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88a91669-f7f0-41aa-b0f0-78392da2115a_1254x804.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1eNI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88a91669-f7f0-41aa-b0f0-78392da2115a_1254x804.png 424w, https://substackcdn.com/image/fetch/$s_!1eNI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88a91669-f7f0-41aa-b0f0-78392da2115a_1254x804.png 848w, https://substackcdn.com/image/fetch/$s_!1eNI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88a91669-f7f0-41aa-b0f0-78392da2115a_1254x804.png 1272w, https://substackcdn.com/image/fetch/$s_!1eNI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88a91669-f7f0-41aa-b0f0-78392da2115a_1254x804.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1eNI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88a91669-f7f0-41aa-b0f0-78392da2115a_1254x804.png" width="517" height="331.4736842105263" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/88a91669-f7f0-41aa-b0f0-78392da2115a_1254x804.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:804,&quot;width&quot;:1254,&quot;resizeWidth&quot;:517,&quot;bytes&quot;:152296,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1eNI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88a91669-f7f0-41aa-b0f0-78392da2115a_1254x804.png 424w, https://substackcdn.com/image/fetch/$s_!1eNI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88a91669-f7f0-41aa-b0f0-78392da2115a_1254x804.png 848w, https://substackcdn.com/image/fetch/$s_!1eNI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88a91669-f7f0-41aa-b0f0-78392da2115a_1254x804.png 1272w, https://substackcdn.com/image/fetch/$s_!1eNI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88a91669-f7f0-41aa-b0f0-78392da2115a_1254x804.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [4])</figcaption></figure></div><p>In this overview, we will learn more about recent advancements in reasoning models. To start, we will focus on several (closed) reasoning models that were proposed first by OpenAI. We will contextualize the explanation of these models with the fundamental ideas that underlie LLM reasoning capabilities. Afterwards, we will explore recently-proposed (open) reasoning models, outlining necessary details for creating such a model from scratch. Reasoning models are different from standard LLMs. But, don&#8217;t worry. A lot of the key concepts of LLMs still apply to reasoning models. <em>We will clarify important distinctions throughout.</em> </p><h2>The Age of Reasoning</h2><p>Just as AI progress was seemingly <a href="https://cameronrwolfe.substack.com/p/llm-scaling-laws">starting to slow down</a>, we witnessed a sudden and significant improvement in LLM capabilities with the popularization of <a href="https://sebastianraschka.com/blog/2025/understanding-reasoning-llms.html">reasoning models</a>. First to be released was OpenAI&#8217;s <a href="https://openai.com/index/introducing-openai-o1-preview/">o1-preview</a> [4], followed by a series of distilled (i.e., smaller) models like o1-mini and later model variants like <a href="https://openai.com/index/openai-o3-mini/">o3</a> [6]. In response, other companies released similar reasoning models, such as <a href="https://deepmind.google/technologies/gemini/flash-thinking/">Google&#8217;s Gemini 2.0 Flash Thinking</a>. In this section, we will explore these initial, closed reasoning models and the basic ideas behind how they work.</p><h4>Initial Reasoning Models: o1 and o1-mini</h4><blockquote><p><em>&#8220;We've developed a new series of AI models designed to spend more time thinking before they respond.&#8221;</em> - from [4]</p></blockquote><p>The release of <strong>o1-preview</strong> [4, 5] by OpenAI made two things very clear:</p><ol><li><p>Reasoning models can solve verifiable tasks&#8212;<em>such as math and coding tasks</em>&#8212;very accurately.</p></li><li><p>The approach taken by reasoning models to solve these problems is very different from that of a traditional LLM.</p></li></ol><p><strong>Long CoT.</strong> The main difference between a reasoning model and a standard LLM is the ability to &#8220;think&#8221; before answering a question. The reasoning model&#8217;s thoughts are just long chains of thought&#8212;<em>or</em> <em>long CoT for short, sometimes referred to as a reasoning trace or trajectory</em>&#8212;outputted by the LLM. This long CoT is generated no differently than any other sequence of text. However, these reasoning trajectories exhibit very interesting properties that are more akin to search algorithms than vanilla text generation. For example, the model will:</p><ul><li><p>Think through each part of a complex problem.</p></li><li><p>Decompose complex problems into smaller, solvable parts.</p></li><li><p>Critique its own (partial) solutions and find errors.</p></li><li><p>Explore many alternative solutions. </p></li></ul><p>For some concrete examples of these reasoning trajectories, see <a href="https://openai.com/index/learning-to-reason-with-llms/">this blog post</a>. Notably, the long CoT used by OpenAI&#8217;s reasoning models are &#8220;internal&#8221;, meaning that they are hidden from the user when interacting with the model. Instead, the user sees a model-written summary of the long CoT; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JJH6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JJH6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png 424w, https://substackcdn.com/image/fetch/$s_!JJH6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png 848w, https://substackcdn.com/image/fetch/$s_!JJH6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png 1272w, https://substackcdn.com/image/fetch/$s_!JJH6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JJH6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png" width="540" height="321.9230769230769" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:868,&quot;width&quot;:1456,&quot;resizeWidth&quot;:540,&quot;bytes&quot;:297984,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JJH6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png 424w, https://substackcdn.com/image/fetch/$s_!JJH6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png 848w, https://substackcdn.com/image/fetch/$s_!JJH6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png 1272w, https://substackcdn.com/image/fetch/$s_!JJH6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c08cfd9-85a6-4079-b510-59857ae05c3e_1970x1174.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(<a href="https://openai.com/index/learning-to-reason-with-llms/">source</a>)</figcaption></figure></div><p>The long CoT output of reasoning models gives us an easy way to control the inference-time compute of an LLM. If we want to spend more compute on solving a problem, we can simply generate a longer CoT. Similarly, less complex problems can be solved with a shorter CoT, thus saving compute at inference time. </p><p><strong>Reasoning capabilities.</strong> Initial reasoning models were actually less capable than standard LLMs in many ways<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>, but they improve the reasoning capabilities of an LLM by several orders of magnitude. For example, <em>o1-preview unanimously outperforms GPT-4o and even rivals the performance of human experts on most complex reasoning tasks</em>; see below. To achieve these results, o1-preview is evaluated using maximal inference-time compute<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a> and either <em>i)</em> a single output sample (solid bar) or <em>ii)</em> a majority vote among 64 parallel output samples (shaded bar). </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!O5uQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde143ac3-dbf4-476c-9524-282b23c1034c_2700x1050.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!O5uQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde143ac3-dbf4-476c-9524-282b23c1034c_2700x1050.png 424w, https://substackcdn.com/image/fetch/$s_!O5uQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde143ac3-dbf4-476c-9524-282b23c1034c_2700x1050.png 848w, https://substackcdn.com/image/fetch/$s_!O5uQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde143ac3-dbf4-476c-9524-282b23c1034c_2700x1050.png 1272w, https://substackcdn.com/image/fetch/$s_!O5uQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde143ac3-dbf4-476c-9524-282b23c1034c_2700x1050.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!O5uQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde143ac3-dbf4-476c-9524-282b23c1034c_2700x1050.png" width="1456" height="566" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de143ac3-dbf4-476c-9524-282b23c1034c_2700x1050.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:566,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Competition evals for Math (AIME 2024), Code (CodeForces), and PhD-Level Science Questions (GPQA Diamond)&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Competition evals for Math (AIME 2024), Code (CodeForces), and PhD-Level Science Questions (GPQA Diamond)" title="Competition evals for Math (AIME 2024), Code (CodeForces), and PhD-Level Science Questions (GPQA Diamond)" srcset="https://substackcdn.com/image/fetch/$s_!O5uQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde143ac3-dbf4-476c-9524-282b23c1034c_2700x1050.png 424w, https://substackcdn.com/image/fetch/$s_!O5uQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde143ac3-dbf4-476c-9524-282b23c1034c_2700x1050.png 848w, https://substackcdn.com/image/fetch/$s_!O5uQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde143ac3-dbf4-476c-9524-282b23c1034c_2700x1050.png 1272w, https://substackcdn.com/image/fetch/$s_!O5uQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde143ac3-dbf4-476c-9524-282b23c1034c_2700x1050.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">o1 models vs. GPT-4o on reasoning tasks (from [5])</figcaption></figure></div><p>Beyond o1-preview, <strong>OpenAI&#8217;s o1</strong>&#8212;<em>the full version of o1 that was released a few months after the preview</em>&#8212;places among the top 500 students in the US on the math olympiad qualification exam (<a href="https://artofproblemsolving.com/wiki/index.php/American_Invitational_Mathematics_Examination?srsltid=AfmBOopg_BQh_GIwm9fLXXJSK812QdJcW_e6uohok7JzFaFCbie0twRk">AIME 2024</a>) and ranks within the 11th percentile of competitive human programmers on <a href="https://arxiv.org/abs/2501.01257">Codeforces</a>. For reference, GPT-4o only solved 12% of AIME problems, while o1 solves anywhere from 74% to 93% of the problems depending upon inference settings. See the figure below for a more detailed comparison between the performance of o1 and GPT-4o.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KBJp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd030dac8-57ff-4d51-a8a5-7bbbec5fc3ba_2400x1650.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KBJp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd030dac8-57ff-4d51-a8a5-7bbbec5fc3ba_2400x1650.png 424w, https://substackcdn.com/image/fetch/$s_!KBJp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd030dac8-57ff-4d51-a8a5-7bbbec5fc3ba_2400x1650.png 848w, https://substackcdn.com/image/fetch/$s_!KBJp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd030dac8-57ff-4d51-a8a5-7bbbec5fc3ba_2400x1650.png 1272w, https://substackcdn.com/image/fetch/$s_!KBJp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd030dac8-57ff-4d51-a8a5-7bbbec5fc3ba_2400x1650.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KBJp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd030dac8-57ff-4d51-a8a5-7bbbec5fc3ba_2400x1650.png" width="1456" height="1001" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d030dac8-57ff-4d51-a8a5-7bbbec5fc3ba_2400x1650.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1001,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Breakdown of the accuracy and raw score of gpt-4o vs. o1 on various competition evals&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Breakdown of the accuracy and raw score of gpt-4o vs. o1 on various competition evals" title="Breakdown of the accuracy and raw score of gpt-4o vs. o1 on various competition evals" srcset="https://substackcdn.com/image/fetch/$s_!KBJp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd030dac8-57ff-4d51-a8a5-7bbbec5fc3ba_2400x1650.png 424w, https://substackcdn.com/image/fetch/$s_!KBJp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd030dac8-57ff-4d51-a8a5-7bbbec5fc3ba_2400x1650.png 848w, https://substackcdn.com/image/fetch/$s_!KBJp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd030dac8-57ff-4d51-a8a5-7bbbec5fc3ba_2400x1650.png 1272w, https://substackcdn.com/image/fetch/$s_!KBJp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd030dac8-57ff-4d51-a8a5-7bbbec5fc3ba_2400x1650.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Improvement of o1 over GPT-4o (from [5])</figcaption></figure></div><p>Similarly, <strong>o1-mini</strong>&#8212;<em>a cheaper and faster version of o1</em>&#8212;has impressive reasoning capabilities despite its 80% cost reduction relative to the full o1 model. This model, despite having limited world knowledge compared to o1, is especially capable at coding tasks and performs very well given its efficiency.</p><h4>State-of-the-Art Reasoning Models: o3 and o3-mini</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qxzS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffeccad4f-894f-4593-9573-ff3285420af7_1200x675.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qxzS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffeccad4f-894f-4593-9573-ff3285420af7_1200x675.jpeg 424w, https://substackcdn.com/image/fetch/$s_!qxzS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffeccad4f-894f-4593-9573-ff3285420af7_1200x675.jpeg 848w, https://substackcdn.com/image/fetch/$s_!qxzS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffeccad4f-894f-4593-9573-ff3285420af7_1200x675.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!qxzS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffeccad4f-894f-4593-9573-ff3285420af7_1200x675.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qxzS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffeccad4f-894f-4593-9573-ff3285420af7_1200x675.jpeg" width="1200" height="675" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/feccad4f-894f-4593-9573-ff3285420af7_1200x675.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:675,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;o Series Performance&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="o Series Performance" title="o Series Performance" srcset="https://substackcdn.com/image/fetch/$s_!qxzS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffeccad4f-894f-4593-9573-ff3285420af7_1200x675.jpeg 424w, https://substackcdn.com/image/fetch/$s_!qxzS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffeccad4f-894f-4593-9573-ff3285420af7_1200x675.jpeg 848w, https://substackcdn.com/image/fetch/$s_!qxzS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffeccad4f-894f-4593-9573-ff3285420af7_1200x675.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!qxzS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffeccad4f-894f-4593-9573-ff3285420af7_1200x675.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Performance of OpenAI&#8217;s o3 on ARC-AGI (<a href="https://arcprize.org/blog/oai-o3-pub-breakthrough">source</a>)</figcaption></figure></div><p>Shortly after the announcement and release of o1 models, OpenAI announced <strong>o3</strong>&#8212;<em>the most recent model in the o1 lineage</em>. This model was initially just announced (not released). We were able to see the model&#8217;s performance on several notable benchmarks&#8212;<em>as measured by OpenAI</em>&#8212;but could not actually use the model. The metrics released by OpenAI were very impressive. In fact, <em>the performance of o3 was quite shocking to many people</em>. The most notable achievements of o3 are:</p><ul><li><p>A score of 87.5% on the <a href="https://arcprize.org/blog/oai-o3-pub-breakthrough">ARC-AGI benchmark</a>&#8212;<em>the &#8220;North Star&#8221; towards AGI that was left unbeaten<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a> for five years</em>&#8212;on which GPT-4o achieves 5% accuracy. o3 is the first model to exceed human-level performance of 85% on ARC-AGI.</p></li><li><p>An accuracy of 71.7% on <a href="https://openai.com/index/introducing-swe-bench-verified/">SWE-Bench Verified</a> and an <a href="https://en.wikipedia.org/wiki/Elo_rating_system">Elo score</a> of 2727 on Codeforces, <em>ranking o3 among the top 200 competitive programmers on the planet</em>.</p></li><li><p>An accuracy of 25.2% on EpochAI&#8217;s <a href="https://epoch.ai/frontiermath">FrontierMath benchmark</a>, <em>improving upon the previous state-of-the-art accuracy of 2.0%</em><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>. </p></li></ul><p>However, the public did not have access to the o3 model to verify any of these results. The full o3 model still has yet to be released at the time of writing, but OpenAI did recently release a smaller version of the model&#8212;<em><strong>o3-mini</strong></em> [6]. </p><blockquote><p><em>&#8220;Reducing reasoning effort can result in faster responses and fewer tokens used on reasoning in a response.&#8221;</em> - from [6]</p></blockquote><p>Compared to other reasoning models from OpenAI, o3-mini is more cost effective and production-ready. For example, this model supports features like function calling, web search and structured outputs<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>. o3-mini also has multiple settings&#8212;<em>including low, medium and high effort</em>&#8212;for the amount of reasoning that it performs when solving a problem. This setting can be directly specified in the API request, and the model performs very impressively&#8212;<em>on par with o1 in many cases</em>&#8212;depending on the level of reasoning effort; see below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yL5T!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F809e35bd-3da6-4382-8635-dcff356f25c0_2424x1332.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yL5T!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F809e35bd-3da6-4382-8635-dcff356f25c0_2424x1332.png 424w, https://substackcdn.com/image/fetch/$s_!yL5T!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F809e35bd-3da6-4382-8635-dcff356f25c0_2424x1332.png 848w, https://substackcdn.com/image/fetch/$s_!yL5T!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F809e35bd-3da6-4382-8635-dcff356f25c0_2424x1332.png 1272w, https://substackcdn.com/image/fetch/$s_!yL5T!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F809e35bd-3da6-4382-8635-dcff356f25c0_2424x1332.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yL5T!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F809e35bd-3da6-4382-8635-dcff356f25c0_2424x1332.png" width="1456" height="800" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/809e35bd-3da6-4382-8635-dcff356f25c0_2424x1332.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:800,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1004490,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yL5T!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F809e35bd-3da6-4382-8635-dcff356f25c0_2424x1332.png 424w, https://substackcdn.com/image/fetch/$s_!yL5T!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F809e35bd-3da6-4382-8635-dcff356f25c0_2424x1332.png 848w, https://substackcdn.com/image/fetch/$s_!yL5T!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F809e35bd-3da6-4382-8635-dcff356f25c0_2424x1332.png 1272w, https://substackcdn.com/image/fetch/$s_!yL5T!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F809e35bd-3da6-4382-8635-dcff356f25c0_2424x1332.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">o3-mini performance breakdown (from [6])</figcaption></figure></div><p>In most cases, o3-mini with low reasoning effort matches the performance of o1-mini, while o3-mini with high reasoning effort exceeds the performance of all other reasoning models released by OpenAI (including the full o1 model). </p><p>o3-mini also has better world knowledge (i.e., improved factuality), is noticeably more efficient, and scores higher in human preference studies compared to prior reasoning models; see below. In particular, authors in [6] mention that during internal A/B tests <em>&#8220;o3-mini delivered responses 24% faster than o1-mini, with an average response time of 7.7 seconds compared to 10.16 seconds.&#8221;</em> o3-mini is the most efficient model released (so far) of OpenAI&#8217;s o1-style reasoning models.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PYI2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F044cb648-2c4d-4aaa-88bb-bf4548876d24_1944x994.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PYI2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F044cb648-2c4d-4aaa-88bb-bf4548876d24_1944x994.webp 424w, https://substackcdn.com/image/fetch/$s_!PYI2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F044cb648-2c4d-4aaa-88bb-bf4548876d24_1944x994.webp 848w, https://substackcdn.com/image/fetch/$s_!PYI2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F044cb648-2c4d-4aaa-88bb-bf4548876d24_1944x994.webp 1272w, https://substackcdn.com/image/fetch/$s_!PYI2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F044cb648-2c4d-4aaa-88bb-bf4548876d24_1944x994.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PYI2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F044cb648-2c4d-4aaa-88bb-bf4548876d24_1944x994.webp" width="1456" height="744" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/044cb648-2c4d-4aaa-88bb-bf4548876d24_1944x994.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:744,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The chart compares win rates for STEM and non-STEM tasks across AI models. \&quot;o3_mini_v43_s960_j128\&quot; (yellow) outperforms \&quot;o1_mini_chatgpt\&quot; (red baseline) in both categories, with a higher win rate for STEM tasks.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The chart compares win rates for STEM and non-STEM tasks across AI models. &quot;o3_mini_v43_s960_j128&quot; (yellow) outperforms &quot;o1_mini_chatgpt&quot; (red baseline) in both categories, with a higher win rate for STEM tasks." title="The chart compares win rates for STEM and non-STEM tasks across AI models. &quot;o3_mini_v43_s960_j128&quot; (yellow) outperforms &quot;o1_mini_chatgpt&quot; (red baseline) in both categories, with a higher win rate for STEM tasks." srcset="https://substackcdn.com/image/fetch/$s_!PYI2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F044cb648-2c4d-4aaa-88bb-bf4548876d24_1944x994.webp 424w, https://substackcdn.com/image/fetch/$s_!PYI2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F044cb648-2c4d-4aaa-88bb-bf4548876d24_1944x994.webp 848w, https://substackcdn.com/image/fetch/$s_!PYI2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F044cb648-2c4d-4aaa-88bb-bf4548876d24_1944x994.webp 1272w, https://substackcdn.com/image/fetch/$s_!PYI2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F044cb648-2c4d-4aaa-88bb-bf4548876d24_1944x994.webp 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Win-rate of o3-mini vs. o1-mini on STEM / non-STEM prompts (from [6])</figcaption></figure></div><p><strong>Other model providers.</strong> The release of o1-style models by OpenAI was quickly followed by other model providers. For example, Google recently released the experimental <a href="https://deepmind.google/technologies/gemini/flash-thinking/">Gemini-2.0 Flash Thinking</a>, which maintains the signature long context of Gemini models&#8212;<em>1M token context window</em>&#8212;and achieves respectable metrics on key verifiable tasks (e.g., AIME and GPQA). However, <em>this model still lags behind the performance of o1 and o3-mini</em>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kQ_a!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff78afa03-d704-43f4-b001-3965969a3b84_1070x556.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kQ_a!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff78afa03-d704-43f4-b001-3965969a3b84_1070x556.png 424w, https://substackcdn.com/image/fetch/$s_!kQ_a!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff78afa03-d704-43f4-b001-3965969a3b84_1070x556.png 848w, https://substackcdn.com/image/fetch/$s_!kQ_a!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff78afa03-d704-43f4-b001-3965969a3b84_1070x556.png 1272w, https://substackcdn.com/image/fetch/$s_!kQ_a!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff78afa03-d704-43f4-b001-3965969a3b84_1070x556.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kQ_a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff78afa03-d704-43f4-b001-3965969a3b84_1070x556.png" width="1070" height="556" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f78afa03-d704-43f4-b001-3965969a3b84_1070x556.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:556,&quot;width&quot;:1070,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kQ_a!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff78afa03-d704-43f4-b001-3965969a3b84_1070x556.png 424w, https://substackcdn.com/image/fetch/$s_!kQ_a!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff78afa03-d704-43f4-b001-3965969a3b84_1070x556.png 848w, https://substackcdn.com/image/fetch/$s_!kQ_a!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff78afa03-d704-43f4-b001-3965969a3b84_1070x556.png 1272w, https://substackcdn.com/image/fetch/$s_!kQ_a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff78afa03-d704-43f4-b001-3965969a3b84_1070x556.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(<a href="https://deepmind.google/technologies/gemini/flash-thinking/">source</a>)</figcaption></figure></div><p>Very recently, a reasoning beta was announced for Grok-3 that is very compelling. As shown below, the Grok-3 reasoning model exceeds the performance of o3-mini with high reasoning efforts and even comes close to matching the full o3 model in a few cases; e.g., 96% accuracy on AIME&#8217;24, compared to the 97% accuracy of o3. Grok-3, which was trained using a <a href="https://www.datacenterfrontier.com/machine-learning/article/55244139/the-colossus-ai-supercomputer-elon-musks-drive-toward-data-center-ai-technology-domination">massive new compute cluster</a>, is impressive (especially given the youth of xAI). At the time of writing, the reasoning beta of Grok-3 is the closest competitor to reasoning models from OpenAI. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1Gxi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64bc6bd5-d713-4c5e-9740-9a5e3ec81923_640x318.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1Gxi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64bc6bd5-d713-4c5e-9740-9a5e3ec81923_640x318.png 424w, https://substackcdn.com/image/fetch/$s_!1Gxi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64bc6bd5-d713-4c5e-9740-9a5e3ec81923_640x318.png 848w, https://substackcdn.com/image/fetch/$s_!1Gxi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64bc6bd5-d713-4c5e-9740-9a5e3ec81923_640x318.png 1272w, https://substackcdn.com/image/fetch/$s_!1Gxi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64bc6bd5-d713-4c5e-9740-9a5e3ec81923_640x318.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1Gxi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64bc6bd5-d713-4c5e-9740-9a5e3ec81923_640x318.png" width="640" height="318" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/64bc6bd5-d713-4c5e-9740-9a5e3ec81923_640x318.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:318,&quot;width&quot;:640,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;r/singularity - Grok 3 Reasoning Benchmarks&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="r/singularity - Grok 3 Reasoning Benchmarks" title="r/singularity - Grok 3 Reasoning Benchmarks" srcset="https://substackcdn.com/image/fetch/$s_!1Gxi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64bc6bd5-d713-4c5e-9740-9a5e3ec81923_640x318.png 424w, https://substackcdn.com/image/fetch/$s_!1Gxi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64bc6bd5-d713-4c5e-9740-9a5e3ec81923_640x318.png 848w, https://substackcdn.com/image/fetch/$s_!1Gxi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64bc6bd5-d713-4c5e-9740-9a5e3ec81923_640x318.png 1272w, https://substackcdn.com/image/fetch/$s_!1Gxi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64bc6bd5-d713-4c5e-9740-9a5e3ec81923_640x318.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from Grok-3 announcement video on X)</figcaption></figure></div><h4>Benchmarks for Reasoning Models</h4><blockquote><p><em>&#8220;Recent frontier models do so well on MATH and GSM8K that these benchmarks are no longer effective at differentiating models.&#8221;</em> - from [5]</p></blockquote><p>Before learning more about how reasoning models work, let&#8217;s take a deeper look at their performance. To truly understand the capabilities of these models, we need to do more than just look at metrics&#8212;<em>we need to inspect concrete examples of the problems that these models are solving</em>. For example, consider <a href="https://arxiv.org/abs/2110.14168">GSM8K</a> (shown below), a grade-school level math benchmark. These questions might seem trivial, but LLMs struggled to accurately solve this benchmark for <a href="https://paperswithcode.com/sota/arithmetic-reasoning-on-gsm8k">several years</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yc8B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87c06563-9df0-4cd4-8e8b-62acf408ffce_2300x838.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yc8B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87c06563-9df0-4cd4-8e8b-62acf408ffce_2300x838.png 424w, https://substackcdn.com/image/fetch/$s_!yc8B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87c06563-9df0-4cd4-8e8b-62acf408ffce_2300x838.png 848w, https://substackcdn.com/image/fetch/$s_!yc8B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87c06563-9df0-4cd4-8e8b-62acf408ffce_2300x838.png 1272w, https://substackcdn.com/image/fetch/$s_!yc8B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87c06563-9df0-4cd4-8e8b-62acf408ffce_2300x838.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yc8B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87c06563-9df0-4cd4-8e8b-62acf408ffce_2300x838.png" width="1456" height="530" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/87c06563-9df0-4cd4-8e8b-62acf408ffce_2300x838.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:530,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:340201,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yc8B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87c06563-9df0-4cd4-8e8b-62acf408ffce_2300x838.png 424w, https://substackcdn.com/image/fetch/$s_!yc8B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87c06563-9df0-4cd4-8e8b-62acf408ffce_2300x838.png 848w, https://substackcdn.com/image/fetch/$s_!yc8B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87c06563-9df0-4cd4-8e8b-62acf408ffce_2300x838.png 1272w, https://substackcdn.com/image/fetch/$s_!yc8B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87c06563-9df0-4cd4-8e8b-62acf408ffce_2300x838.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Example questions from GSM8K (<a href="https://huggingface.co/datasets/openai/gsm8k">source</a>)</figcaption></figure></div><p>With the advent of reasoning models, this benchmark has been completely saturated&#8212;<em>we can no longer use it to meaningfully evaluate the best reasoning models</em>. Instead, we are beginning to solve much harder problems with LLMs. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FsXZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95dc2906-5bef-4d7a-a234-5e833d189ba1_1900x248.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FsXZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95dc2906-5bef-4d7a-a234-5e833d189ba1_1900x248.png 424w, https://substackcdn.com/image/fetch/$s_!FsXZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95dc2906-5bef-4d7a-a234-5e833d189ba1_1900x248.png 848w, https://substackcdn.com/image/fetch/$s_!FsXZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95dc2906-5bef-4d7a-a234-5e833d189ba1_1900x248.png 1272w, https://substackcdn.com/image/fetch/$s_!FsXZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95dc2906-5bef-4d7a-a234-5e833d189ba1_1900x248.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FsXZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95dc2906-5bef-4d7a-a234-5e833d189ba1_1900x248.png" width="1456" height="190" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/95dc2906-5bef-4d7a-a234-5e833d189ba1_1900x248.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:190,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:60533,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FsXZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95dc2906-5bef-4d7a-a234-5e833d189ba1_1900x248.png 424w, https://substackcdn.com/image/fetch/$s_!FsXZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95dc2906-5bef-4d7a-a234-5e833d189ba1_1900x248.png 848w, https://substackcdn.com/image/fetch/$s_!FsXZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95dc2906-5bef-4d7a-a234-5e833d189ba1_1900x248.png 1272w, https://substackcdn.com/image/fetch/$s_!FsXZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95dc2906-5bef-4d7a-a234-5e833d189ba1_1900x248.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Example problem from AIME 2024 (<a href="https://artofproblemsolving.com/wiki/index.php/2024_AIME_I_Problems">source</a>)</figcaption></figure></div><p>For example, consider the <a href="https://artofproblemsolving.com/wiki/index.php/2024_AIME_I_Problems/Problem_15">15th problem from AIME 2024</a>, as shown above. This problem is quite complex and goes beyond the arithmetic reasoning questions found in GSM8K. There are (at least) six different ways that this problem can be solved, all of which require knowledge of advanced mathematical techniques (e.g., derivatives, <a href="https://en.wikipedia.org/wiki/Number_theory">number theory</a> or <a href="https://en.wikipedia.org/wiki/Lagrange_multiplier">Lagrange multipliers</a>). </p><p>Additionally, the complex benchmarks being solved by reasoning models go beyond math! For example, GPQA [7] contains hundreds of multiple-choice questions from several scientific domains; e.g., Biology, Physics, and Chemistry. All of these questions are written by domain experts and verified to be both very difficult and &#8220;Google-proof&#8221;, meaning that non-experts struggle to solve these problems even when given sufficient time and unrestricted internet access.</p><blockquote><p><em>&#8220;We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy, while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web.&#8221;</em> - from [7]</p></blockquote><p>The ARC-AGI benchmark&#8212;<em>described as a &#8220;material stepping stone toward AGI&#8221;</em>&#8212;involves a variety of grid-based puzzles in which the LLM must learn patterns among input-output grids and perfectly replicate this learned pattern on a final output example; see below. Most LLMs struggle to solve these puzzles (e.g., GPT-4o achieves an accuracy of only 5%), but reasoning models perform quite well on this benchmark&#8212;<em>30-90% accuracy depending on the compute budget</em>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CNiP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb2e0506-6107-4e23-8ef5-3e0f4bb1e6e8_1538x1062.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CNiP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb2e0506-6107-4e23-8ef5-3e0f4bb1e6e8_1538x1062.png 424w, https://substackcdn.com/image/fetch/$s_!CNiP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb2e0506-6107-4e23-8ef5-3e0f4bb1e6e8_1538x1062.png 848w, https://substackcdn.com/image/fetch/$s_!CNiP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb2e0506-6107-4e23-8ef5-3e0f4bb1e6e8_1538x1062.png 1272w, https://substackcdn.com/image/fetch/$s_!CNiP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb2e0506-6107-4e23-8ef5-3e0f4bb1e6e8_1538x1062.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CNiP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb2e0506-6107-4e23-8ef5-3e0f4bb1e6e8_1538x1062.png" width="1456" height="1005" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bb2e0506-6107-4e23-8ef5-3e0f4bb1e6e8_1538x1062.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1005,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:874757,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CNiP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb2e0506-6107-4e23-8ef5-3e0f4bb1e6e8_1538x1062.png 424w, https://substackcdn.com/image/fetch/$s_!CNiP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb2e0506-6107-4e23-8ef5-3e0f4bb1e6e8_1538x1062.png 848w, https://substackcdn.com/image/fetch/$s_!CNiP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb2e0506-6107-4e23-8ef5-3e0f4bb1e6e8_1538x1062.png 1272w, https://substackcdn.com/image/fetch/$s_!CNiP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb2e0506-6107-4e23-8ef5-3e0f4bb1e6e8_1538x1062.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>To say the least, <em>these are a different caliber of (non-trivial) problems that reasoning LLMs are beginning to solve</em>. Despite the difficulty of these benchmarks, modern reasoning models are found to be remarkably capable&#8212;<em>OpenAI&#8217;s o3 model is reported to achieve a score of nearly 97% on AIME 2024</em>. After manually inspecting some of these questions, we can truly understand the gravity of this result.</p><h2>Fundamentals of Reasoning Models</h2><blockquote><p>&#8220;<em>We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute).&#8221;</em> - from [1]</p></blockquote><p>Although the reasoning models presented above are clearly impressive, there are all closed models. So, <em>we have no information about how they actually work</em>. The only information we are given is the above quote and the plot shown below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ozKr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fe00c0c-da10-431b-8316-4ea3939e50fe_1264x645.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ozKr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fe00c0c-da10-431b-8316-4ea3939e50fe_1264x645.png 424w, https://substackcdn.com/image/fetch/$s_!ozKr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fe00c0c-da10-431b-8316-4ea3939e50fe_1264x645.png 848w, https://substackcdn.com/image/fetch/$s_!ozKr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fe00c0c-da10-431b-8316-4ea3939e50fe_1264x645.png 1272w, https://substackcdn.com/image/fetch/$s_!ozKr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fe00c0c-da10-431b-8316-4ea3939e50fe_1264x645.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ozKr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fe00c0c-da10-431b-8316-4ea3939e50fe_1264x645.png" width="443" height="226.05617088607596" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1fe00c0c-da10-431b-8316-4ea3939e50fe_1264x645.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:645,&quot;width&quot;:1264,&quot;resizeWidth&quot;:443,&quot;bytes&quot;:104279,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ozKr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fe00c0c-da10-431b-8316-4ea3939e50fe_1264x645.png 424w, https://substackcdn.com/image/fetch/$s_!ozKr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fe00c0c-da10-431b-8316-4ea3939e50fe_1264x645.png 848w, https://substackcdn.com/image/fetch/$s_!ozKr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fe00c0c-da10-431b-8316-4ea3939e50fe_1264x645.png 1272w, https://substackcdn.com/image/fetch/$s_!ozKr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fe00c0c-da10-431b-8316-4ea3939e50fe_1264x645.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(from [5])</figcaption></figure></div><p>From this limited information, however, we can draw some useful conclusions. Mainly, there are two key components involved in scaling a reasoning model:</p><ul><li><p>More training via RL.</p></li><li><p>More inference-time compute (i.e., inference-time scaling).</p></li></ul><p>Although OpenAI does not reveal many of the details behind their approach to scaling these two components of a reasoning model, there is still <a href="https://github.com/srush/awesome-o1">a lot of research</a> that has been published on this topic. To provide more context, let&#8217;s briefly take a look at some of this work&#8212;<em>along with details shared by OpenAI</em>&#8212;to outline some of the key concepts that underlie how reasoning models are trained and used. </p><h4>Reinforcement Learning with Verifiable Rewards</h4><p>One detail that we should immediately notice about o1-style models is that they are primarily used for and evaluated on problems that are verifiable in nature; e.g., math and coding. But, <em>what exactly does &#8220;verifiable&#8221; mean in this context?</em> First, we assume that we have access to either <em>i)</em> a ground truth answer for the problem or <em>ii)</em> some rules-based technique that can be used to verify correctness. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zfsl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zfsl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 424w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 848w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 1272w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zfsl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png" width="614" height="210.42994505494505" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:499,&quot;width&quot;:1456,&quot;resizeWidth&quot;:614,&quot;bytes&quot;:172420,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zfsl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 424w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 848w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 1272w, https://substackcdn.com/image/fetch/$s_!zfsl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb865992-1eee-4fdb-b98a-165f4d555e11_1774x608.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Verifying a math problem via exact string match</figcaption></figure></div><p>For example, we can define a ground truth final answer for most math problem&#8212;this is done in <a href="https://huggingface.co/datasets/openai/gsm8k">GSM8K</a> with the <code>#### &lt;answer&gt;</code> syntax. Then, we can extract the final answer from the LLM&#8217;s output and compare this answer to the ground truth using a basic string match; see above. Similarly, if we have test cases prepared for a coding question, we can simply execute the code produced by our LLM and check whether the provided solution satisfies all of the test cases.</p><blockquote><p><em>&#8220;Reinforcement Learning with Verifiable Rewards (RLVR) can be seen as a simplified form of existing approaches for bootstrapping LM reasoning or a simpler form of RL with execution feedback, in which we simply use answer matching or constraint verification as a binary signal to train the model.&#8221; </em>- from [13]</p></blockquote><p>Saying that a domain is &#8220;verifiable&#8221; does NOT mean that we can automatically verify arbitrary solutions to problems in this domain. Rather, we will often need access to ground truth answers&#8212;<em>typically obtained from humans</em>&#8212;for verification. </p><p>However, there are some behaviors<em> </em>that can be verified using simple rules instead of ground truth. For example, we can determine whether a reasoning model has the correct output format, follows certain instructions, or produces outputs of a particular length (e.g., the low, medium or high reasoning effort used by o3-mini) by performing simple checks with a set of hard-coded rules. </p><p><strong>Verification complexities.</strong> Verifying an LLM&#8217;s output can become quite complex depending on the problems we are solving. Even for math problems, verifying a match between the LLM&#8217;s answer and ground truth is difficult. For example, the solution may be presented in a different form or format, leading to false negative verifications. In these cases, simple string matching may not be enough! Instead, we can prompt an LLM to tell us whether the two solutions are a match or not, which has been found to drastically reduce incorrect verifications [14]. For code, implementing verification is tough as well&#8212;<em>it requires constructing a data pipeline that can very efficiently execute and verify test cases within our training setup</em>.</p><blockquote><p><em>&#8220;We do not apply neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale RL process, and retraining the reward model needs additional training resources and it complicates the whole training pipeline.&#8221;</em> - from [1]</p></blockquote><p><strong>Neural verification.</strong> Beyond the verifiable problems outlined above, we can also consider weaker forms of verification. For example, creative writing is a task that is difficult to verify. However, we can:</p><ol><li><p>Train a <a href="https://arxiv.org/abs/2403.13787">neural reward model</a> or verifier.</p></li><li><p>Score our LLM&#8217;s output with this model.</p></li><li><p>Use the predicted score as a reward or verification signal.</p></li></ol><p>Such a setup is very similar to <a href="https://cameronrwolfe.substack.com/p/the-story-of-rlhf-origins-motivations">reinforcement learning from human feedback (RLHF)</a>. In this case, we are training our reward model to perform binary verification based on the correctness or quality of the model&#8217;s response<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a>. However, using a neural verifier comes with the risk of <a href="https://lilianweng.github.io/posts/2024-11-28-reward-hacking/">reward hacking</a>, especially when performing large-scale RL. The model is trained for longer and does much more exploring of the reward landscape, thus increasing the risk of reward hacking.  As a result, many recent reasoning models have avoided this approach.</p><p><strong>Learning from verifiable rewards.</strong> We now understand verification, but how can verification be used to train an LLM? The idea here is simple: <em>we just directly use the verification result as a reward signal for training with RL</em>; see below. There are many different ways of implementing this idea (e.g., <a href="https://arxiv.org/abs/2305.20050">process rewards</a> or <a href="https://www.interconnects.ai/p/openais-o1-using-search-was-a-psyop">pure RL</a>), but they share the common theme of using RL to learn from verifiable rewards. <em>This is the fundamental concept upon which all modern reasoning models are based</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mzxO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mzxO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png 424w, https://substackcdn.com/image/fetch/$s_!mzxO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png 848w, https://substackcdn.com/image/fetch/$s_!mzxO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png 1272w, https://substackcdn.com/image/fetch/$s_!mzxO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mzxO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png" width="1456" height="570" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:570,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:190474,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mzxO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png 424w, https://substackcdn.com/image/fetch/$s_!mzxO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png 848w, https://substackcdn.com/image/fetch/$s_!mzxO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png 1272w, https://substackcdn.com/image/fetch/$s_!mzxO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7334cdb5-5398-47d2-98bb-01ca41a58879_1854x726.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [13])</figcaption></figure></div><p>For a complete exposition of methods that can be used to learn from verifiable rewards with RL, check out the incredible video by <a href="https://rush-nlp.com/">Sasha Rush</a> below.</p><div id="youtube2-6PEJ96k1kiw" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;6PEJ96k1kiw&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/6PEJ96k1kiw?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><h4>Inference-Time Strategies: Chain of Thought and Decoding</h4><p>There are two basic ways<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a> that we can increase the amount of compute that our language model is consuming at inference time:</p><ul><li><p>Generate more tokens (i.e., longer output sequence).</p></li><li><p>Generate multiple outputs.</p></li></ul><p>In this section, we will go into these techniques in more detail, exploring how they are practically implemented in LLMs via chains of thought and different decoding strategies; e.g., parallel versus sequential decoding.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NPw_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F599a636e-b0b2-4de3-84c8-3edf906bfa82_1616x882.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NPw_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F599a636e-b0b2-4de3-84c8-3edf906bfa82_1616x882.png 424w, https://substackcdn.com/image/fetch/$s_!NPw_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F599a636e-b0b2-4de3-84c8-3edf906bfa82_1616x882.png 848w, https://substackcdn.com/image/fetch/$s_!NPw_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F599a636e-b0b2-4de3-84c8-3edf906bfa82_1616x882.png 1272w, https://substackcdn.com/image/fetch/$s_!NPw_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F599a636e-b0b2-4de3-84c8-3edf906bfa82_1616x882.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NPw_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F599a636e-b0b2-4de3-84c8-3edf906bfa82_1616x882.png" width="469" height="256.0817307692308" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/599a636e-b0b2-4de3-84c8-3edf906bfa82_1616x882.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:795,&quot;width&quot;:1456,&quot;resizeWidth&quot;:469,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NPw_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F599a636e-b0b2-4de3-84c8-3edf906bfa82_1616x882.png 424w, https://substackcdn.com/image/fetch/$s_!NPw_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F599a636e-b0b2-4de3-84c8-3edf906bfa82_1616x882.png 848w, https://substackcdn.com/image/fetch/$s_!NPw_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F599a636e-b0b2-4de3-84c8-3edf906bfa82_1616x882.png 1272w, https://substackcdn.com/image/fetch/$s_!NPw_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F599a636e-b0b2-4de3-84c8-3edf906bfa82_1616x882.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [8])</figcaption></figure></div><p><strong>Chain of thought.</strong> We already know that reasoning models use long CoT as their medium for reasoning. Proposed in [8], a chain of thought&#8212;<em>at the simplest level</em>&#8212;is just an explanation that an LLM provides for its own output. In most cases, these explanations are written prior to the LLM generating its final answer, allowing the model to use its explanation as context when generating its answer; see above.</p><p>The long CoT used by reasoning models is much different than a standard CoT. A standard CoT is concise and human-readable. A long CoT is several thousand tokens long<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a>. Although it can be used for interpretability purposes, the long CoT is not optimized for human readability. Rather, it is an extensive reasoning trace that approaches problem solving in a detailed manner and contains a variety of complex reasoning behaviors (e.g., backtracking and self-refinement). </p><blockquote><p><em>&#8220;We have decided not to show the raw chains of thought to users&#8230; We strive to partially make up for [this decision] by teaching the model to reproduce useful ideas from the chain of thought in the answer. For the o1 model series we show a model-generated summary of the chain of thought.&#8221;</em> - from [5]</p></blockquote><p>Additionally, reasoning models logically separate their CoT from the final output of the model. For example, OpenAI avoids exposing the long CoT directly to users and instead provides an LLM-generated summary of the long CoT to supplement the reasoning model&#8217;s final answer. Such a logical separation is fundamentally necessary due to the length of CoT. Most users will only read the final answer&#8212;<em>reading the entire reasoning trace would be incredibly time consuming</em>. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mBBe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7b26d4a-0d1c-4e27-a63d-5fe7035e83b1_604x278.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mBBe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7b26d4a-0d1c-4e27-a63d-5fe7035e83b1_604x278.png 424w, https://substackcdn.com/image/fetch/$s_!mBBe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7b26d4a-0d1c-4e27-a63d-5fe7035e83b1_604x278.png 848w, https://substackcdn.com/image/fetch/$s_!mBBe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7b26d4a-0d1c-4e27-a63d-5fe7035e83b1_604x278.png 1272w, https://substackcdn.com/image/fetch/$s_!mBBe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7b26d4a-0d1c-4e27-a63d-5fe7035e83b1_604x278.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mBBe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7b26d4a-0d1c-4e27-a63d-5fe7035e83b1_604x278.png" width="372" height="171.2185430463576" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a7b26d4a-0d1c-4e27-a63d-5fe7035e83b1_604x278.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:278,&quot;width&quot;:604,&quot;resizeWidth&quot;:372,&quot;bytes&quot;:27675,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mBBe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7b26d4a-0d1c-4e27-a63d-5fe7035e83b1_604x278.png 424w, https://substackcdn.com/image/fetch/$s_!mBBe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7b26d4a-0d1c-4e27-a63d-5fe7035e83b1_604x278.png 848w, https://substackcdn.com/image/fetch/$s_!mBBe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7b26d4a-0d1c-4e27-a63d-5fe7035e83b1_604x278.png 1272w, https://substackcdn.com/image/fetch/$s_!mBBe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7b26d4a-0d1c-4e27-a63d-5fe7035e83b1_604x278.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(from [15])</figcaption></figure></div><p><strong>Parallel decoding.</strong> To improve the accuracy of an LLM&#8217;s final output, we may also use parallel decoding techniques; see above. The idea here is simple: <em>instead of generating a single output with our LLM, we generate multiple outputs and aggregate these outputs to form a single, final answer</em>. This aggregation can be done in many ways; e.g., using <a href="https://arxiv.org/abs/2203.11171">majority vote</a> or consensus, using <a href="https://arxiv.org/abs/2206.02336">weighted voting</a>, identifying the best output(s) with a <a href="https://arxiv.org/abs/2408.15240">neural reward model or verifier</a> (i.e., also known as <a href="https://arxiv.org/abs/2110.14168">Best-of-N or rejection sampling</a>), or <a href="https://arxiv.org/abs/2210.02441">other domain-specific algorithms</a>. </p><p>The main benefit of these approaches is their simplicity and effectiveness. Scaling up parallel decoding is easy&#8212;<em>we just generate, verify and aggregate a larger number of outputs&#8212;</em>and yields meaningful boosts in performance [9, 10, 11]. Parallel decoding techniques are clearly used by o1-style models&#8212;<em>just look at the details of the plots provided in their blog posts (shown below)</em>! However, parallel decoding techniques cannot by themselves explain some of the more complex reasoning behaviors exhibited by recently released reasoning models. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-0o4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37f574b5-9d41-4b11-b49a-2d6b4c9e95ee_1942x1120.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-0o4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37f574b5-9d41-4b11-b49a-2d6b4c9e95ee_1942x1120.png 424w, https://substackcdn.com/image/fetch/$s_!-0o4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37f574b5-9d41-4b11-b49a-2d6b4c9e95ee_1942x1120.png 848w, https://substackcdn.com/image/fetch/$s_!-0o4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37f574b5-9d41-4b11-b49a-2d6b4c9e95ee_1942x1120.png 1272w, https://substackcdn.com/image/fetch/$s_!-0o4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37f574b5-9d41-4b11-b49a-2d6b4c9e95ee_1942x1120.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-0o4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37f574b5-9d41-4b11-b49a-2d6b4c9e95ee_1942x1120.png" width="578" height="333.46153846153845" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/37f574b5-9d41-4b11-b49a-2d6b4c9e95ee_1942x1120.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:840,&quot;width&quot;:1456,&quot;resizeWidth&quot;:578,&quot;bytes&quot;:345110,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-0o4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37f574b5-9d41-4b11-b49a-2d6b4c9e95ee_1942x1120.png 424w, https://substackcdn.com/image/fetch/$s_!-0o4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37f574b5-9d41-4b11-b49a-2d6b4c9e95ee_1942x1120.png 848w, https://substackcdn.com/image/fetch/$s_!-0o4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37f574b5-9d41-4b11-b49a-2d6b4c9e95ee_1942x1120.png 1272w, https://substackcdn.com/image/fetch/$s_!-0o4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37f574b5-9d41-4b11-b49a-2d6b4c9e95ee_1942x1120.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [5])</figcaption></figure></div><p>As a side note, we can also apply the idea of rejection sampling to training (i.e., training vs. test-time rejection sampling). To do this, we just:</p><ul><li><p>Sample several outputs or trajectories.</p></li><li><p>Use our reward model (or other scoring mechanism) to pick the best outputs.</p></li><li><p>Train on these outputs.</p></li></ul><p>This approach is commonly used in practice; e.g., LLaMA models perform several rounds of training-time rejection sampling in their post training process prior to the application of RLHF. Rejection sampling is very effective in practice and is easier to implement and scale compared to <a href="https://cameronrwolfe.substack.com/p/proximal-policy-optimization-ppo">PPO-based RLHF</a>. </p><blockquote><p><em>&#8220;We adopt a relatively simple post-training procedure based on supervised finetuning (SFT), rejection sampling (RS), and direct preference optimization (DPO) as opposed to more complex reinforcement learning algorithms that tend to be less stable and harder to scale.&#8221;</em> - from [12]</p></blockquote><p><strong>Self-refinement.</strong> Beyond parallel decoding, we can also consider critique or self-refinement strategies for decoding. First, the LLM generates an initial response. Then, feedback&#8212;<em>either from the LLM or some external sourc</em>e&#8212;is provided for the response, and the LLM can revise its response based on the feedback. This cycle can repeat for an arbitrary number of iterations; see below for an illustration.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dvWP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a8ce6da-c042-4dc3-adeb-89f0f0cc1263_898x378.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dvWP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a8ce6da-c042-4dc3-adeb-89f0f0cc1263_898x378.png 424w, https://substackcdn.com/image/fetch/$s_!dvWP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a8ce6da-c042-4dc3-adeb-89f0f0cc1263_898x378.png 848w, https://substackcdn.com/image/fetch/$s_!dvWP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a8ce6da-c042-4dc3-adeb-89f0f0cc1263_898x378.png 1272w, https://substackcdn.com/image/fetch/$s_!dvWP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a8ce6da-c042-4dc3-adeb-89f0f0cc1263_898x378.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dvWP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a8ce6da-c042-4dc3-adeb-89f0f0cc1263_898x378.png" width="394" height="165.84855233853006" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9a8ce6da-c042-4dc3-adeb-89f0f0cc1263_898x378.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:378,&quot;width&quot;:898,&quot;resizeWidth&quot;:394,&quot;bytes&quot;:42022,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dvWP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a8ce6da-c042-4dc3-adeb-89f0f0cc1263_898x378.png 424w, https://substackcdn.com/image/fetch/$s_!dvWP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a8ce6da-c042-4dc3-adeb-89f0f0cc1263_898x378.png 848w, https://substackcdn.com/image/fetch/$s_!dvWP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a8ce6da-c042-4dc3-adeb-89f0f0cc1263_898x378.png 1272w, https://substackcdn.com/image/fetch/$s_!dvWP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a8ce6da-c042-4dc3-adeb-89f0f0cc1263_898x378.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(from [15])</figcaption></figure></div><p>Several different approaches for refinement exist, but they can be broadly categorized into two groups:</p><ul><li><p><em>Extrinsic</em>: feedback comes from some external verifier or module.</p></li><li><p><em>Intrinsic</em>: the LLM provides feedback on its own generation.</p></li></ul><p>The results and practical effectiveness of refinement are somewhat mixed. There are many successful examples of using extrinsic feedback&#8212;<em>such as from a verifier [16] or a code interpreter [17]</em>&#8212;to refine the output of an LLM. Whether intrinsic refinement is effective is highly dependent upon the quality of feedback provided by the LLM. Intrinsic refinement can work well for simple tasks [18]. However, this approach struggles to generalize to more complex tasks (e.g., math) [19]. </p><blockquote><p><em>&#8220;When LLMs give relatively accurate self-examinations as rewards, they are capable of refining responses in an in-context way.&#8221;</em> - from [18]</p></blockquote><h2>Open Reasoning: DeepSeek-R1 and More</h2><p>So far, we have learned about the basic concepts that allow us to instill reasoning capabilities within an LLM. However, all of the models we have learned about are closed&#8212;<em>we have no way of knowing how exactly these models were created</em>. Luckily, several open reasoning models have been recently released. The most notable of these models, which we will cover in this section, is called DeepSeek-R1 [1]. In addition to matching the performance of OpenAI&#8217;s o1, this model comes with a full technical report that provides sufficient details for replication and, therefore, completely demystifies the process needed to create a powerful reasoning model.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jOEt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F728166d1-a874-48ab-a2a4-ea81e0636228_1224x730.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jOEt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F728166d1-a874-48ab-a2a4-ea81e0636228_1224x730.png 424w, https://substackcdn.com/image/fetch/$s_!jOEt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F728166d1-a874-48ab-a2a4-ea81e0636228_1224x730.png 848w, https://substackcdn.com/image/fetch/$s_!jOEt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F728166d1-a874-48ab-a2a4-ea81e0636228_1224x730.png 1272w, https://substackcdn.com/image/fetch/$s_!jOEt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F728166d1-a874-48ab-a2a4-ea81e0636228_1224x730.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jOEt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F728166d1-a874-48ab-a2a4-ea81e0636228_1224x730.png" width="1224" height="730" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/728166d1-a874-48ab-a2a4-ea81e0636228_1224x730.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:730,&quot;width&quot;:1224,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:162084,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jOEt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F728166d1-a874-48ab-a2a4-ea81e0636228_1224x730.png 424w, https://substackcdn.com/image/fetch/$s_!jOEt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F728166d1-a874-48ab-a2a4-ea81e0636228_1224x730.png 848w, https://substackcdn.com/image/fetch/$s_!jOEt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F728166d1-a874-48ab-a2a4-ea81e0636228_1224x730.png 1272w, https://substackcdn.com/image/fetch/$s_!jOEt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F728166d1-a874-48ab-a2a4-ea81e0636228_1224x730.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>The core idea behind DeepSeek-R1 aligns well with what we have learned for far. The model is trained with RL on verifiable tasks, where it learns to leverage long CoT to solve complex reasoning problems. Interestingly, the RL training process is the key contributor to the model&#8217;s strong reasoning capabilities. Multiple versions of this model&#8212;<em>DeepSeek-R1-Zero and DeepSeek-R1</em>&#8212;are released that have comparable reasoning capabilities. As we will see, the first of these models completely forgoes any supervised training, demonstrating that complex reasoning capabilities naturally emerge from large-scale training with RL. </p><blockquote><p><em>&#8220;DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors.&#8221;</em> - from [1]</p></blockquote><p><strong>DeepSeek-v3.</strong> The creation of both DeepSeek-R1-Zero and DeepSeek-R1 begins with a powerful base model, called DeepSeek-v3 [2]. In addition to having open weights and a detailed technical report [2], this model surpasses the performance of prior open LLMs and even matches the quality of closed models; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!a08q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc26d7720-a597-49c3-82b7-5ee830132411_1846x1186.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!a08q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc26d7720-a597-49c3-82b7-5ee830132411_1846x1186.png 424w, https://substackcdn.com/image/fetch/$s_!a08q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc26d7720-a597-49c3-82b7-5ee830132411_1846x1186.png 848w, https://substackcdn.com/image/fetch/$s_!a08q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc26d7720-a597-49c3-82b7-5ee830132411_1846x1186.png 1272w, https://substackcdn.com/image/fetch/$s_!a08q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc26d7720-a597-49c3-82b7-5ee830132411_1846x1186.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!a08q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc26d7720-a597-49c3-82b7-5ee830132411_1846x1186.png" width="1456" height="935" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c26d7720-a597-49c3-82b7-5ee830132411_1846x1186.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:935,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!a08q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc26d7720-a597-49c3-82b7-5ee830132411_1846x1186.png 424w, https://substackcdn.com/image/fetch/$s_!a08q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc26d7720-a597-49c3-82b7-5ee830132411_1846x1186.png 848w, https://substackcdn.com/image/fetch/$s_!a08q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc26d7720-a597-49c3-82b7-5ee830132411_1846x1186.png 1272w, https://substackcdn.com/image/fetch/$s_!a08q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc26d7720-a597-49c3-82b7-5ee830132411_1846x1186.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [2])</figcaption></figure></div><p>DeepSeek-v3 is a 671 billion parameter Mixture-of-Experts (MoE) model. If you are unfamiliar with MoEs, please check out the post below, which explains the concept and provides several practical examples, including DeepSeek-v3. </p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;236eeec4-83df-43d7-94b0-19ecf7fbab2a&quot;,&quot;caption&quot;:&quot;In an area of study that is rapidly changing, the decoder-only transformer architecture has remained one of the few enduring staples in large language model (LLM) research. This architecture has been used since the proposal of the original GPT model and has remained largely unchanged, aside from minor tweaks to improve efficiency. One o&#8230;&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Mixture-of-Experts (MoE) LLMs&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:29736521,&quot;name&quot;:&quot;Cameron R. Wolfe, Ph.D.&quot;,&quot;bio&quot;:&quot;ML @ Netflix &#8226; Rice University PhD &#8226; I make AI understandable&quot;,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/69aba7df-b571-4609-aa47-fc2d031c11b8_1242x1595.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-01-27T10:33:48.037Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3fdf1382-38dc-45fc-a741-b62babfd99c5_2258x1268.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://cameronrwolfe.substack.com/p/moe-llms&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:154340424,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:168,&quot;comment_count&quot;:10,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Deep (Learning) Focus&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fab9b43fb-52d5-40da-995d-5b7cd3f91064_896x896.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p>To improve inference and training efficiency, DeepSeek-v3 makes the following design choices (see <a href="https://cameronrwolfe.substack.com/i/154340424/deepseek-v-and-deepseek-v">here</a> for more details):</p><ul><li><p>Uses Multi-Headed Latent Attention (MLA). </p></li><li><p>Adopts an optimized MoE structure (e.g., fine-grained and shared experts). </p></li><li><p>Uses a multi-token prediction objective during pretraining.</p></li><li><p>Forgoes load balancing losses typically used to train MoE models. </p></li><li><p>Decreases precision to FP8 throughout training by adopting a novel quantized training strategy that is proposed in [2]. </p></li></ul><p>For these reasons, the training of DeepSeek-v3 is very economical compared to other models&#8212;<em>the model is impressive in terms of both performance and efficiency</em>. Several prior versions of this model were released that inspire some of the design decisions made by DeepSeek-v3; e.g., see <a href="https://arxiv.org/abs/2405.04434">DeepSeek-v2</a> and <a href="https://api-docs.deepseek.com/news/news1210">DeepSeek-v2.5</a><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-9" href="#footnote-9" target="_self">9</a>. </p><h4>DeepSeek-R1-Zero</h4><blockquote><p><em>&#8220;We explore the potential of LLMs to develop reasoning capabilities without any supervised data, focusing on their self-evolution through a pure reinforcement learning process.&#8221; </em>- from [1]</p></blockquote><p>The first reasoning model proposed by DeepSeek was DeepSeek-R1-Zero. This model adopts an interesting training strategy that teaches the model to reason purely via large-scale RL&#8212;<em>without any SFT</em>. The model naturally explores and learns to leverage long CoT to solve complex reasoning problems through RL. DeepSeek-R1-Zero is the first open research effort to show that reasoning capabilities can be developed without supervised training.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_Old!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c284b27-d0f4-4699-b4a0-24c37e8eef88_1840x882.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_Old!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c284b27-d0f4-4699-b4a0-24c37e8eef88_1840x882.png 424w, https://substackcdn.com/image/fetch/$s_!_Old!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c284b27-d0f4-4699-b4a0-24c37e8eef88_1840x882.png 848w, https://substackcdn.com/image/fetch/$s_!_Old!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c284b27-d0f4-4699-b4a0-24c37e8eef88_1840x882.png 1272w, https://substackcdn.com/image/fetch/$s_!_Old!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c284b27-d0f4-4699-b4a0-24c37e8eef88_1840x882.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_Old!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c284b27-d0f4-4699-b4a0-24c37e8eef88_1840x882.png" width="1456" height="698" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1c284b27-d0f4-4699-b4a0-24c37e8eef88_1840x882.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:698,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:267922,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_Old!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c284b27-d0f4-4699-b4a0-24c37e8eef88_1840x882.png 424w, https://substackcdn.com/image/fetch/$s_!_Old!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c284b27-d0f4-4699-b4a0-24c37e8eef88_1840x882.png 848w, https://substackcdn.com/image/fetch/$s_!_Old!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c284b27-d0f4-4699-b4a0-24c37e8eef88_1840x882.png 1272w, https://substackcdn.com/image/fetch/$s_!_Old!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c284b27-d0f4-4699-b4a0-24c37e8eef88_1840x882.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [22])</figcaption></figure></div><p><strong>RL with GRPO.</strong> The training of DeepSeek-R1-Zero begins with the DeepSeek-v3 [2] base model. We directly finetune this base model via RL. In particular, authors in [1] select <a href="https://huggingface.co/docs/trl/main/en/grpo_trainer">Group Relative Policy Optimization (GRPO)</a> [3], which is depicted in the figure above, as their RL algorithm. The selection of RL algorithms for LLM training is an open and active research topic. Traditionally, researchers have used <a href="https://cameronrwolfe.substack.com/p/proximal-policy-optimization-ppo">PPO</a> for training LLMs, but there is a recent trend towards adopting simpler RL algorithms&#8212;<em>such as <a href="https://arxiv.org/abs/2402.14740">REINFORCE</a> or <a href="https://arxiv.org/abs/2501.12599">GRPO</a></em>&#8212;for LLM training. The main reasons provided for the selection of GRPO in [1] are:</p><ul><li><p>A reduction in the cost of RL training.</p></li><li><p>The elimination of the critic model, which is (usually) the same size as the policy model (i.e., the LLM itself). </p></li></ul><p><strong>Defining rewards.</strong> Unlike most traditional work on RL with LLMs, no neural reward models&#8212;<em>meaning LLM-based reward models that are trained over preference data</em>&#8212;are used to train DeepSeek-R1-Zero. Rather, the authors use a rules-based reward system, which <em>i)</em> avoids reward hacking, <em>ii)</em> saves on compute costs<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-10" href="#footnote-10" target="_self">10</a>, and <em>iii)</em> is simpler to implement. There are two types of rewards used in particular:</p><ol><li><p><em>Accuracy reward</em>: evaluates whether the model&#8217;s response is correct.</p></li><li><p><em>Format reward</em>: enforces a desired format on the model&#8217;s output.</p></li></ol><p>DeepSeek-R1-Zero is trained purely on automatically verifiable tasks, such as math and coding problems. For math problems with deterministic results, the model can provide its answer in a specified format, allowing us to verify via basic string matching. Similarly, coding problems can be verified by executing the code produced by the LLM in a sandbox over predefined test cases.</p><blockquote><p><em>&#8220;The neural reward model may suffer from reward hacking in the large-scale reinforcement learning process, and retraining the reward model needs additional training resources and it complicates the whole training pipeline.&#8221;</em> - from [1]</p></blockquote><p>As mentioned above, the format reward provides a positive training signal when the model produces an output that uses the correct format or template. The format used in [1] simply places the model&#8217;s long CoT&#8212;<em>or the thinking / reasoning process</em>&#8212;between two special tokens: <code>&lt;think&gt;</code> and <code>&lt;/think&gt;</code>. The model then produces its answer separately&#8212;<em>between the </em><code>&lt;answer&gt;</code><em> and </em><code>&lt;/answer&gt;</code><em> tags</em>&#8212;after the completion of the reasoning process; see below for an illustration.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lZD6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdc9fc1-4032-41ba-9d7a-946f4826f826_1840x454.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lZD6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdc9fc1-4032-41ba-9d7a-946f4826f826_1840x454.png 424w, https://substackcdn.com/image/fetch/$s_!lZD6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdc9fc1-4032-41ba-9d7a-946f4826f826_1840x454.png 848w, https://substackcdn.com/image/fetch/$s_!lZD6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdc9fc1-4032-41ba-9d7a-946f4826f826_1840x454.png 1272w, https://substackcdn.com/image/fetch/$s_!lZD6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdc9fc1-4032-41ba-9d7a-946f4826f826_1840x454.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lZD6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdc9fc1-4032-41ba-9d7a-946f4826f826_1840x454.png" width="1840" height="454" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9bdc9fc1-4032-41ba-9d7a-946f4826f826_1840x454.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:454,&quot;width&quot;:1840,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:355304,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lZD6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdc9fc1-4032-41ba-9d7a-946f4826f826_1840x454.png 424w, https://substackcdn.com/image/fetch/$s_!lZD6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdc9fc1-4032-41ba-9d7a-946f4826f826_1840x454.png 848w, https://substackcdn.com/image/fetch/$s_!lZD6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdc9fc1-4032-41ba-9d7a-946f4826f826_1840x454.png 1272w, https://substackcdn.com/image/fetch/$s_!lZD6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdc9fc1-4032-41ba-9d7a-946f4826f826_1840x454.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p><strong>Learning via RL.</strong> Despite using no SFT, DeepSeek-R1-Zero shows clear progress in its reasoning capabilities throughout the RL training process. The model&#8217;s performance on AIME 2024 is plotted below as training progresses. Here, the model&#8217;s performance gradually improves, eventually reaching parity with o1-preview<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-11" href="#footnote-11" target="_self">11</a>. After training completes, DeepSeek-R1-Zero has improved from an initial performance of 15.6% to 71.0%&#8212;<em>or 86.7% when using majority voting with 16 votes</em>&#8212;on AIME 2024! Such results mirror the trends in performance we see with closed reasoning models&#8212;<em>DeepSeek-R1-Zero achieves impressive performance after RL training and can further improve its performance via parallel decoding strategies</em>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8rFM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe19787e1-df29-413b-8ab3-7ed137eca9d9_1844x1028.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8rFM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe19787e1-df29-413b-8ab3-7ed137eca9d9_1844x1028.png 424w, https://substackcdn.com/image/fetch/$s_!8rFM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe19787e1-df29-413b-8ab3-7ed137eca9d9_1844x1028.png 848w, https://substackcdn.com/image/fetch/$s_!8rFM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe19787e1-df29-413b-8ab3-7ed137eca9d9_1844x1028.png 1272w, https://substackcdn.com/image/fetch/$s_!8rFM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe19787e1-df29-413b-8ab3-7ed137eca9d9_1844x1028.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8rFM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe19787e1-df29-413b-8ab3-7ed137eca9d9_1844x1028.png" width="1456" height="812" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e19787e1-df29-413b-8ab3-7ed137eca9d9_1844x1028.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:812,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:770207,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8rFM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe19787e1-df29-413b-8ab3-7ed137eca9d9_1844x1028.png 424w, https://substackcdn.com/image/fetch/$s_!8rFM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe19787e1-df29-413b-8ab3-7ed137eca9d9_1844x1028.png 848w, https://substackcdn.com/image/fetch/$s_!8rFM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe19787e1-df29-413b-8ab3-7ed137eca9d9_1844x1028.png 1272w, https://substackcdn.com/image/fetch/$s_!8rFM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe19787e1-df29-413b-8ab3-7ed137eca9d9_1844x1028.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>A full performance comparison between DeepSeek-R1-Zero and o1 models is provided in the table below. DeepSeek-R1 matches or exceeds the performance of o1-mini in most cases and performs comparably to o1-preview on several tasks. However, reasoning models from OpenAI perform much better in the coding domain&#8212;<em>DeepSeek-R1-Zero is clearly a less powerful coding model</em>. As we will soon see, this problem is fixed in DeepSeek-R1 (the follow-up model).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5Xef!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba93d001-c99e-4b80-a371-b97d92ea1adc_2008x506.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5Xef!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba93d001-c99e-4b80-a371-b97d92ea1adc_2008x506.png 424w, https://substackcdn.com/image/fetch/$s_!5Xef!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba93d001-c99e-4b80-a371-b97d92ea1adc_2008x506.png 848w, https://substackcdn.com/image/fetch/$s_!5Xef!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba93d001-c99e-4b80-a371-b97d92ea1adc_2008x506.png 1272w, https://substackcdn.com/image/fetch/$s_!5Xef!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba93d001-c99e-4b80-a371-b97d92ea1adc_2008x506.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5Xef!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba93d001-c99e-4b80-a371-b97d92ea1adc_2008x506.png" width="1456" height="367" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ba93d001-c99e-4b80-a371-b97d92ea1adc_2008x506.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:367,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:855771,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5Xef!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba93d001-c99e-4b80-a371-b97d92ea1adc_2008x506.png 424w, https://substackcdn.com/image/fetch/$s_!5Xef!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba93d001-c99e-4b80-a371-b97d92ea1adc_2008x506.png 848w, https://substackcdn.com/image/fetch/$s_!5Xef!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba93d001-c99e-4b80-a371-b97d92ea1adc_2008x506.png 1272w, https://substackcdn.com/image/fetch/$s_!5Xef!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba93d001-c99e-4b80-a371-b97d92ea1adc_2008x506.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p><strong>What is happening here?</strong> Clearly, DeepSeek-R1-Zero gains impressive reasoning capabilities from the RL training process outlined in [1]. However, <em>the dynamics of the model&#8217;s learning process are also quite observable</em>! Because we perform no SFT-style training, we can closely monitor the progression of the model&#8217;s reasoning strategy throughout the RL training process. As shown below, DeepSeek-R1-Zero learns to leverage more &#8220;thinking time&#8221;&#8212;<em>or just generate progressively longer chains of thought</em>&#8212;to improve its reasoning process as training progresses. The model naturally learns to leverage more test-time compute to solve harder problems!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!COPD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!COPD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 424w, https://substackcdn.com/image/fetch/$s_!COPD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 848w, https://substackcdn.com/image/fetch/$s_!COPD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 1272w, https://substackcdn.com/image/fetch/$s_!COPD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!COPD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png" width="1456" height="812" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:812,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1809109,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!COPD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 424w, https://substackcdn.com/image/fetch/$s_!COPD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 848w, https://substackcdn.com/image/fetch/$s_!COPD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 1272w, https://substackcdn.com/image/fetch/$s_!COPD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36e006bb-5959-485b-bb4a-d45b235a8a9d_1800x1004.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>Authors in [1] also observe several interesting tendencies that emerge naturally during training with RL. For example, the model develops an ability to reflect upon its own solutions by revisiting and evaluating prior components of its reasoning process. Similarly, the model begins to explicitly test out and explore alternative solutions or approaches during the problem solving process. This behavior is not explicitly programmed&#8212;<em>it arises naturally during training with RL</em>! </p><blockquote><p><em>&#8220;The self-evolution of DeepSeek-R1-Zero is a fascinating demonstration of how RL can drive a model to improve its reasoning capabilities autonomously.&#8221;</em> - from [1]</p></blockquote><p>At the most basic level, the RL environment constructed in [1] allows the model to explore different strategies for arriving at a correct&#8212;<em>as determined by verification</em>&#8212;final solution. During exploration, we reward the model for:</p><ol><li><p>Using the correct reasoning template or structure.</p></li><li><p>Producing a correct final solution.</p></li></ol><p>From these rewards alone, the model learns how to solve complex reasoning problems. We do not explicitly need to teach the model how to decompose problems, search for a solution, perform backtracking, or evaluate its own line of thought. Instead, we just provide the correct incentives (or rewards) to the model during the training process. Then, the LLM can autonomously learn necessary behaviors for solving problems via an RL-based &#8220;self-evolution&#8221; process. </p><h4>DeepSeek-R1</h4><p>DeepSeek-R1-Zero shows us that LLMs can develop impressive reasoning capabilities from pure RL with no SFT, but this model has some minor bugs. For example, its readability is poor<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-12" href="#footnote-12" target="_self">12</a> and it incorrectly mixes languages together. Put simply, DeepSeek-R1-Zero is very good at reasoning, <em>but it lacks some of the desirable properties of a well-<a href="https://cameronrwolfe.substack.com/p/the-history-of-open-source-llms-imitation">aligned</a> LLM</em>. As a solution, authors in [1] propose a new, multi-stage training process that integrates some &#8220;cold start&#8221; SFT data into training along with some other tricks. This training pipeline is used to create DeepSeek-R1, an LLM that is both aligned and capable of complex reasoning.</p><p>Similarly to DeepSeek-R1-Zero, we begin with DeepSeek-v3 as a base model. Then, DeepSeek-R1 undergoes four stages of training, including two SFT phases and two RL phases. The purpose of the SFT phases is to provide a better starting point for exploration during each of the RL phases. This training pipeline is one of the key contributions of [1]&#8212;<em>it provides an effective recipe for combining reasoning-style training with the standard post training recipe for LLMs. </em>Let&#8217;s take a deeper look at each stage of the training recipe used for DeepSeek-R1. </p><blockquote><p><em>&#8220;To prevent the early unstable cold start phase of RL training from the base model, for DeepSeek-R1 we construct and collect a small amount of long CoT data to fine-tune the model as the initial RL actor.&#8221;</em> - from [1]</p></blockquote><p><strong>Phase One: Cold Start (or Reasoning-Oriented SFT).</strong> Prior to RL training, R1 is trained via SFT over a small dataset of long CoT examples, which is referred to in [1] as &#8220;cold start&#8221; data. There are a few different approaches that we can use to collect this cold start data:</p><ol><li><p>Prompt a model (e.g., DeepSeek-v3) to produce long CoT data, either with few-shot examples or by instructing the model to generate detailed answers with accompanied reflection and verification.</p></li><li><p>Use the R1-Zero model to generate a large number of long CoT outputs, then ask humans to post-process and select the model&#8217;s best outputs.</p></li></ol><p>Authors in [1] combine these approaches to collect &#8220;thousands of cold-start data&#8221; on which DeepSeek-v3 is finetuned directly via SFT. Because we are using long CoT data, <em>this is a reasoning-oriented finetuning process</em>. From this cold start data, the model learns a viable (initial) template for solving reasoning problems. </p><p>The data used for reasoning-oriented SFT introduces a human prior into DeepSeek-R1&#8217;s training process. We can explicitly select the style and pattern of data from which the model learns during this stage. For example, authors in [1] mention that they structure this data to include summaries of each long CoT, thus teaching the model to summarize its entire reasoning process prior to providing its final answer. This data serves as a seed for the RL training process&#8212;<em>the model begins its self-exploration by matching the style of the SFT training data.</em></p><p><strong>Stage Two: Reasoning-Oriented RL.</strong> After SFT, we just repeat the large-scale RL training process proposed by R1-Zero to enhance the underlying model&#8217;s ability to handle reasoning-intensive tasks. The only change made for DeepSeek-R1 is the addition of a language consistency reward, calculated as the portion of the model&#8217;s output written in the desired target language. This language consistency reward is found in [1] to slightly deteriorate the model&#8217;s reasoning capabilities. However, language consistency improves the overall alignment of the resulting model with human preferences&#8212;<em>the model&#8217;s output is more fluent and readable</em>.</p><p><strong>Stage Three: Rejection sampling.</strong> After the convergence of reasoning-oriented RL, we use the resulting model to collect a large and diverse SFT dataset. Unlike the initial cold start SFT phase, however, we collect more than just reasoning-oriented data. Namely, we augment the reasoning data with general purpose data so that the model can learn from a broader set of problems and domains. </p><p>To collect more reasoning data, authors in [1]:</p><ol><li><p>Curate a diverse set of reasoning-based prompts.</p></li><li><p>Generate candidate trajectories<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-13" href="#footnote-13" target="_self">13</a> using the model from phase two.</p></li><li><p>Perform rejection sampling&#8212;<em>or filter and select the top trajectories based on the quality and correctness or each trajectory</em>. </p></li></ol><p>This is the same training-time rejection sampling process that we learned about earlier in this post! Interestingly, we rely upon more than rules-based techniques for verification in this phase. We also incorporate additional data from non-verifiable domains by using DeepSeek-v3 as a <a href="https://arxiv.org/abs/2408.15240">generative reward model</a> or weak verifier. After applying heuristic filtering (e.g., removing outputs with language mixing or long paragraphs), we arrive at a final set of 600K reasoning trajectories. </p><blockquote><p><em>&#8220;We reuse portions of the SFT dataset of DeepSeek-V3. For certain non-reasoning tasks, we call DeepSeek-V3 to generate a potential chain-of-thought before answering the question by prompting.&#8221;</em> - from [1]</p></blockquote><p>The SFT dataset from this stage includes a substantial ratio of non-reasoning data (e.g., writing or translation examples). We source this data from the same post training dataset used for DeepSeek-v3. However, the data is augmented by asking DeepSeek-v3 to generate a long CoT to explain the outputs of complex queries&#8212;<em>simpler queries, however, are not given any CoT</em>. A total of 200K non-reasoning examples are collected, forming an SFT dataset of 800K examples. </p><p><strong>Stage Four: General-purpose RLHF.</strong> The final training stage of DeepSeek-R1 aligns the model with human preferences while continuing to hone its reasoning abilities. Similarly to the prior stage, we train the model over a combination of reasoning-based and general purpose data. In particular, we train the model using RL with a combination of different rewards for each type of data:</p><ul><li><p>Rules-based rewards (same as R1-Zero) for reasoning-based problems. </p></li><li><p>Neural reward models&#8212;<em>trained over human preference pairs, just as in standard RLHF</em>&#8212;for general purpose data.</p></li></ul><p>DeepSeek-R1 is aligned to be more helpful and harmless on general purpose data. These are two <a href="https://arxiv.org/abs/2204.05862">very common alignment criteria</a> used in LLM research. Each of these criteria are modeled with a separate neural reward model that is trained over a (supervised) dataset of human preferences. Helpfulness rewards are only measured over the final answer of the model (i.e., excluding the long CoT), while harmless rewards consider the model&#8217;s entire output trajectory<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-14" href="#footnote-14" target="_self">14</a>. By combining rules and preference-based rewards, DeepSeek-R1 can be aligned to human preferences while maintaining strong reasoning performance.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0Wcf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d42ce87-35e7-4af2-8a45-cf348df75132_1918x1094.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0Wcf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d42ce87-35e7-4af2-8a45-cf348df75132_1918x1094.png 424w, https://substackcdn.com/image/fetch/$s_!0Wcf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d42ce87-35e7-4af2-8a45-cf348df75132_1918x1094.png 848w, https://substackcdn.com/image/fetch/$s_!0Wcf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d42ce87-35e7-4af2-8a45-cf348df75132_1918x1094.png 1272w, https://substackcdn.com/image/fetch/$s_!0Wcf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d42ce87-35e7-4af2-8a45-cf348df75132_1918x1094.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0Wcf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d42ce87-35e7-4af2-8a45-cf348df75132_1918x1094.png" width="724" height="412.7197802197802" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5d42ce87-35e7-4af2-8a45-cf348df75132_1918x1094.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:830,&quot;width&quot;:1456,&quot;resizeWidth&quot;:724,&quot;bytes&quot;:573212,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0Wcf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d42ce87-35e7-4af2-8a45-cf348df75132_1918x1094.png 424w, https://substackcdn.com/image/fetch/$s_!0Wcf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d42ce87-35e7-4af2-8a45-cf348df75132_1918x1094.png 848w, https://substackcdn.com/image/fetch/$s_!0Wcf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d42ce87-35e7-4af2-8a45-cf348df75132_1918x1094.png 1272w, https://substackcdn.com/image/fetch/$s_!0Wcf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d42ce87-35e7-4af2-8a45-cf348df75132_1918x1094.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p><strong>How does it perform?</strong> As shown above, R1 matches or surpasses the performance of o1 on most reasoning tasks. Unlike R1-Zero, R1 also has reasonably strong coding abilities. On general purpose tasks, R1 continues to perform well as a result of its hybrid training pipeline. In general, R1 is a very capable model that seems to be on par with OpenAI&#8217;s o1 and can solve a wide variety of tasks&#8212;<em>including both traditional and reasoning-oriented tasks</em>&#8212;with high accuracy.</p><p>One interesting observation about this model (and other reasoning models) is that it performs poorly on instruction following benchmarks (e.g., <a href="https://arxiv.org/abs/2311.07911">IF-Eval</a>) compared to standard LLMs. Currently, <em>reasoning models seem to be worse than standard LLMs at following instructions</em>. In the future, I personally believe this trend is likely to reverse. In theory, reasoning models should be capable of leveraging their thought process to better interpret and adhere to a prompt provided by a human user. For example, <a href="https://arxiv.org/abs/2412.16339">deliberative alignment</a> follows a somewhat similar approach.</p><p><strong>Is SFT necessary?</strong> R1-Zero emphasizes the ability to train strong reasoning models without SFT, while the full R1 model uses several SFT phases to obtain a stronger, final model. So, we might begin to wonder: <em>Should we use SFT of not? </em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Vw21!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6b1fbd1-3f9b-4983-8914-1a93d2d2fa87_2388x1154.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Vw21!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6b1fbd1-3f9b-4983-8914-1a93d2d2fa87_2388x1154.png 424w, https://substackcdn.com/image/fetch/$s_!Vw21!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6b1fbd1-3f9b-4983-8914-1a93d2d2fa87_2388x1154.png 848w, https://substackcdn.com/image/fetch/$s_!Vw21!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6b1fbd1-3f9b-4983-8914-1a93d2d2fa87_2388x1154.png 1272w, https://substackcdn.com/image/fetch/$s_!Vw21!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6b1fbd1-3f9b-4983-8914-1a93d2d2fa87_2388x1154.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Vw21!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6b1fbd1-3f9b-4983-8914-1a93d2d2fa87_2388x1154.png" width="1456" height="704" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c6b1fbd1-3f9b-4983-8914-1a93d2d2fa87_2388x1154.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:704,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:664432,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Vw21!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6b1fbd1-3f9b-4983-8914-1a93d2d2fa87_2388x1154.png 424w, https://substackcdn.com/image/fetch/$s_!Vw21!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6b1fbd1-3f9b-4983-8914-1a93d2d2fa87_2388x1154.png 848w, https://substackcdn.com/image/fetch/$s_!Vw21!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6b1fbd1-3f9b-4983-8914-1a93d2d2fa87_2388x1154.png 1272w, https://substackcdn.com/image/fetch/$s_!Vw21!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6b1fbd1-3f9b-4983-8914-1a93d2d2fa87_2388x1154.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Is SFT necessary for reasoning models?</figcaption></figure></div><p>For a standard LLM, SFT provides a high-quality starting point for RLHF. If we applied RLHF directly to the base model, the learning process would be much less efficient. Data for SFT is either synthetically generated or manually created by humans. Generally, collecting data for SFT is expensive (both in terms of time and money). <em>We have to manually write a good response from scratch for the LLM</em>!</p><p>Collecting such SFT data for reasoning models is more difficult due to their long CoT. Asking humans to manually create long CoT data would be time consuming and expensive! Our only option is to generate this data synthetically, but:</p><ol><li><p>Generating this particular style of output with a model may still be hard.</p></li><li><p>Correctly verifying such long outputs is difficult.</p></li></ol><p>Given the additional complexity of collecting SFT data for reasoning models, authors in [1] first try to avoid SFT altogether! From these experiments, we see that such reasoning abilities naturally emerge from pure RL&#8212;<em>this is an incredible discovery</em>! However, the resulting model has several shortcomings (e.g., language mixing). When we train over some SFT prior to RL (i.e., a &#8220;cold start&#8221;), we provide a better prior to RL, which <em>i)</em> eliminates instability during the initial phases of RL training, <em>ii)</em> speeds up up training and <em>iii)</em> improves model quality. So, SFT is not completely necessary, <em>but it is still practically useful if we have the data</em>!</p><h4>Distilled Models</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9nuA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e1abb7a-4035-421b-bcbe-35ccfdb71e47_1248x534.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9nuA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e1abb7a-4035-421b-bcbe-35ccfdb71e47_1248x534.png 424w, https://substackcdn.com/image/fetch/$s_!9nuA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e1abb7a-4035-421b-bcbe-35ccfdb71e47_1248x534.png 848w, https://substackcdn.com/image/fetch/$s_!9nuA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e1abb7a-4035-421b-bcbe-35ccfdb71e47_1248x534.png 1272w, https://substackcdn.com/image/fetch/$s_!9nuA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e1abb7a-4035-421b-bcbe-35ccfdb71e47_1248x534.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9nuA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e1abb7a-4035-421b-bcbe-35ccfdb71e47_1248x534.png" width="1248" height="534" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9e1abb7a-4035-421b-bcbe-35ccfdb71e47_1248x534.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:534,&quot;width&quot;:1248,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:278392,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9nuA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e1abb7a-4035-421b-bcbe-35ccfdb71e47_1248x534.png 424w, https://substackcdn.com/image/fetch/$s_!9nuA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e1abb7a-4035-421b-bcbe-35ccfdb71e47_1248x534.png 848w, https://substackcdn.com/image/fetch/$s_!9nuA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e1abb7a-4035-421b-bcbe-35ccfdb71e47_1248x534.png 1272w, https://substackcdn.com/image/fetch/$s_!9nuA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e1abb7a-4035-421b-bcbe-35ccfdb71e47_1248x534.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Illustration of the knowledge distillation process (<a href="https://arxiv.org/abs/2006.05525">source</a>)</figcaption></figure></div><p>Beyond DeepSeek-R1, authors in [1] release a series of dense models that are distilled from R1. The <a href="https://arxiv.org/abs/2402.13116">distillation process</a> is found to significantly enhance the reasoning capabilities of smaller and more efficient models. The full DeepSeek-R1 model is large (i.e., a 671 billion parameter <a href="https://cameronrwolfe.substack.com/i/154340424/deepseek-v-and-deepseek-v">Mixture-of-Experts model</a>), so these distilled models are practically useful&#8212;<em>they are</em> <em>comparable to R1 but more cost sensitive and easier to use</em>. Additionally, the release of these distilled models matches recent trends in closed reasoning models (e.g., o1-mini and o3-mini). </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iwuY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8aa60aba-ec97-40c9-b10a-1b1a262ff251_1222x574.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iwuY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8aa60aba-ec97-40c9-b10a-1b1a262ff251_1222x574.png 424w, https://substackcdn.com/image/fetch/$s_!iwuY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8aa60aba-ec97-40c9-b10a-1b1a262ff251_1222x574.png 848w, https://substackcdn.com/image/fetch/$s_!iwuY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8aa60aba-ec97-40c9-b10a-1b1a262ff251_1222x574.png 1272w, https://substackcdn.com/image/fetch/$s_!iwuY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8aa60aba-ec97-40c9-b10a-1b1a262ff251_1222x574.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iwuY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8aa60aba-ec97-40c9-b10a-1b1a262ff251_1222x574.png" width="1222" height="574" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8aa60aba-ec97-40c9-b10a-1b1a262ff251_1222x574.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:574,&quot;width&quot;:1222,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:151199,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iwuY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8aa60aba-ec97-40c9-b10a-1b1a262ff251_1222x574.png 424w, https://substackcdn.com/image/fetch/$s_!iwuY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8aa60aba-ec97-40c9-b10a-1b1a262ff251_1222x574.png 848w, https://substackcdn.com/image/fetch/$s_!iwuY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8aa60aba-ec97-40c9-b10a-1b1a262ff251_1222x574.png 1272w, https://substackcdn.com/image/fetch/$s_!iwuY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8aa60aba-ec97-40c9-b10a-1b1a262ff251_1222x574.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p><strong>Distilling R1.</strong> To create these models, we begin with several sizes of two base models<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-15" href="#footnote-15" target="_self">15</a>&#8212;<em>Qwen-2.5 [20] and LLaMA-3 [21]</em>. We then train the base models via SFT over the 800,000 supervised training examples curated in the third stage of the training pipeline for DeepSeek-R1&#8212;<em>that&#8217;s it</em>!</p><p>This is a simple knowledge distillation pipeline, <em>but the results are impressive</em>. As shown above, the distilled Qwen2.5-14B model outperforms <a href="https://qwenlm.github.io/blog/qwq-32b-preview/">QwQ-32B-Preview</a>, which was the best open reasoning model prior to the release of R1. Additionally, even the smallest distilled models outperform standard closed LLMs that are not optimized for reasoning (e.g., GPT-4o), while the 32 and 70 billion parameter distilled models exceed the performance of o1-mini on most benchmarks.</p><blockquote><p><em>&#8220;Distilling more powerful models into smaller ones yields excellent results, whereas smaller models relying on the large-scale RL require enormous computational power and may not even achieve the performance of distillation.&#8221;</em> - from [1]</p></blockquote><p><strong>Distillation versus RL.</strong> Although we see that distillation is effective in the discussion above, we might wonder whether we could get better results by just directly applying the large-scale RL training process used by DeepSeek-R1 to these smaller models. Interestingly, authors in [1] observe that distilling the Qwen2.5-32B base model from R1&#8212;<em>using the distillation approach described above</em>&#8212;outperforms directly training this model via large-scale RL; see below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IhEm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4ed3b-81bd-44a2-b8b7-5c0ec792f3cd_2464x406.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IhEm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4ed3b-81bd-44a2-b8b7-5c0ec792f3cd_2464x406.png 424w, https://substackcdn.com/image/fetch/$s_!IhEm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4ed3b-81bd-44a2-b8b7-5c0ec792f3cd_2464x406.png 848w, https://substackcdn.com/image/fetch/$s_!IhEm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4ed3b-81bd-44a2-b8b7-5c0ec792f3cd_2464x406.png 1272w, https://substackcdn.com/image/fetch/$s_!IhEm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4ed3b-81bd-44a2-b8b7-5c0ec792f3cd_2464x406.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IhEm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4ed3b-81bd-44a2-b8b7-5c0ec792f3cd_2464x406.png" width="1456" height="240" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bbc4ed3b-81bd-44a2-b8b7-5c0ec792f3cd_2464x406.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:240,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:248243,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IhEm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4ed3b-81bd-44a2-b8b7-5c0ec792f3cd_2464x406.png 424w, https://substackcdn.com/image/fetch/$s_!IhEm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4ed3b-81bd-44a2-b8b7-5c0ec792f3cd_2464x406.png 848w, https://substackcdn.com/image/fetch/$s_!IhEm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4ed3b-81bd-44a2-b8b7-5c0ec792f3cd_2464x406.png 1272w, https://substackcdn.com/image/fetch/$s_!IhEm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc4ed3b-81bd-44a2-b8b7-5c0ec792f3cd_2464x406.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">(from [1])</figcaption></figure></div><p>In other words, the reasoning patterns discovered by large models are crucial for improving the reasoning capabilities of these smaller, dense models. However, authors in [1] do make the following additional points:</p><ul><li><p>It is possible that the performance of distilled models could be further improved via added training with RL.</p></li><li><p>&#8220;Advancing beyond the boundaries of intelligence&#8221;&#8212;<em>or creating new reasoning models that even exceed the performance of models like DeepSeek-R1</em>&#8212;will still require powerful base models and large-scale training with RL.</p></li></ul><p><strong>Other distilled reasoning models.</strong> Given the simplicity of training high-quality reasoning models via distillation, a wide variety of reasoning models were released by the research community following the proposal of R1. Some of the most notable releases are:</p><ul><li><p><a href="https://novasky-ai.github.io/posts/sky-t1/">Sky-T1</a> and <a href="https://novasky-ai.github.io/posts/reduce-overthinking/">Sky-T1-Flash</a></p></li><li><p><a href="https://www.bespokelabs.ai/blog/bespoke-stratos-the-unreasonable-effectiveness-of-reasoning-distillation">Bespoke Stratos</a></p></li><li><p><a href="https://arxiv.org/abs/2502.03387">LIMO</a></p></li><li><p><a href="https://arxiv.org/abs/2501.19393">S1</a></p></li><li><p><a href="https://arxiv.org/abs/2501.11284">RedStar</a></p></li></ul><p>There are many more models that have been released as well! The current pace of reasoning model releases is reminiscent of the post-LLaMA era of LLM research. After the release of a powerful open base model (i.e., <a href="https://cameronrwolfe.substack.com/p/llama-llms-for-everyone">LLaMA</a>), we saw a wide variety of model variants released that were based on this model (e.g., <a href="https://crfm.stanford.edu/2023/03/13/alpaca.html">Alpaca</a>, <a href="https://lmsys.org/blog/2023-03-30-vicuna/">Vicuna</a>, <a href="https://bair.berkeley.edu/blog/2023/04/03/koala/">Koala</a> and many more). Now, we have access to a strong open reasoning model, as we are seeing a very similar trend! The research in this area is very interesting and deserving of its own post&#8212;<em>stay tuned</em>!</p><h2>Key Emerging Trends</h2><p>We have now learned about a variety of reasoning models, beginning with closed models like o1 or o3 and ending with a fully-outlined replication of these models in DeepSeek-R1. As we have learned about this research, there are a few common trends that begin to emerge. These trends, outlined below, make some important distinctions between research on reasoning models and standard LLMs. </p><p><strong>Long CoT (and inference-time scaling).</strong> The key distinction between reasoning models and standard LLMs is their output structure. Instead of just directly generating a final answer (with an optional concise explanation), reasoning models generate a long CoT that describes their reasoning process in great detail. This long CoT can be of variable length, enabling controllable compute costs at inference time: <em>longer CoT = more tokens = more compute</em>. In this way, using more compute at inference time&#8212;<em>by generating a longer CoT</em>&#8212;has become a tool that can allow users to dynamically improve a model&#8217;s reasoning capabilities. </p><p><strong>Self-evolution through RL.</strong> Obviously, the ability of LLMs to execute complex reasoning strategies within their long CoT is new and exciting. From recent research, we learn that the key contributor to the development of these special abilities is large-scale RL training. We see in [1] that such reasoning capabilities naturally emerge during RL if the model is correctly incentivized, usually via rules-based rewards that are deterministic and reliable. Additionally, we can further improve a model&#8217;s reasoning capabilities by using more compute for training via RL&#8212;<em>this is yet another scaling law that we can leverage</em>!</p><p><strong>Less supervision.</strong> The dependence of reasoning models upon human supervision is less pronounced relative to standard LLMs. In particular, rewards during RL training are derived primarily from rules-based systems, instead of relying upon human preferences. Of course, reasoning models still have several areas of dependence upon human supervision; e.g., the base model is trained with human-curated data and verification relies upon human-provided ground truth labels. However, there is still a big push by reasoning models like R1 (and especially R1-Zero) to demonstrate that reasoning capabilities can develop autonomously. </p><p><strong>Distillation is effective.</strong> Now that we have access to large and powerful reasoning models, we can distill the capabilities of these models into smaller, dense models using simple strategies! This finding has led to an explosion of research in this area, and we are likely to see many more efficient and distilled reasoning models released in the near future. One key question in this area is whether smaller models will generalize or <a href="https://arxiv.org/abs/2305.15717">struggle to fully match</a> the breadth of their teachers.</p><blockquote><p><em>&#8220;When evaluating DeepSeek-R1, we observe that it is sensitive to prompts. Few-shot prompting consistently degrades its performance.&#8221;</em> - from [1]</p></blockquote><p><strong>New problems to solve.</strong> Above all else, the advent of reasoning models has raised a variety of new (and interesting!) questions that we need to solve:</p><ul><li><p>How do we handle safety training for long CoT?</p></li><li><p>What is the best balance between general / reasoning capabilities?</p></li><li><p>What is the optimal role of SFT in training reasoning models?</p></li><li><p>How do we minimize &#8220;overthinking&#8221; in long CoT?</p></li><li><p>How do we handle efficient hosting of reasoning models?</p></li></ul><p>As mentioned at the beginning of this post, reasoning models are a truly new type of LLM that will force us to rethink existing frameworks. Solidified techniques that have been used for years (e.g., few-shot prompting) are becoming obsolete for these new models. <em>The field of LLM research is re-inventing itself once again</em>.</p><h4>New to the newsletter?</h4><p>Hi! I&#8217;m <a href="https://cameronrwolfe.me/">Cameron R. Wolfe</a>, Deep Learning Ph.D. and Machine Learning Scientist at <a href="https://research.netflix.com/research-area/nlp-and-conversations">Netflix</a>. This is the Deep (Learning) Focus newsletter, where I help readers better understand important topics in AI research. If you like the newsletter, please subscribe, share it, or follow me on <a href="https://twitter.com/cwolferesearch">X</a> and <a href="https://www.linkedin.com/in/cameron-r-wolfe-ph-d-04744a238/">LinkedIn</a>!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://cameronrwolfe.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://cameronrwolfe.substack.com/subscribe?"><span>Subscribe now</span></a></p><h4>Bibliography </h4><p>[1] Guo, Daya, et al. "Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning." <em>arXiv preprint arXiv:2501.12948</em> (2025).</p><p>[2] Liu, Aixin, et al. "Deepseek-v3 technical report." <em>arXiv preprint arXiv:2412.19437</em> (2024).</p><p>[3] Shao, Zhihong, et al. "Deepseekmath: Pushing the limits of mathematical reasoning in open language models." <em>arXiv preprint arXiv:2402.03300</em> (2024).</p><p>[4] OpenAI. &#8220;Introducing OpenAI o1-preview&#8221; <em><a href="https://openai.com/index/introducing-openai-o1-preview/">https://openai.com/index/introducing-openai-o1-preview/</a> </em>(2024).</p><p>[5] OpenAI. &#8220;Learning to Reason with LLMs&#8221; <em><a href="https://openai.com/index/learning-to-reason-with-llms/">https://openai.com/index/learning-to-reason-with-llms/</a></em> (2024).</p><p>[6] OpenAI. &#8220;OpenAI o3-mini&#8221; <em><a href="https://openai.com/index/openai-o3-mini/">https://openai.com/index/openai-o3-mini/</a> </em>(2025).</p><p>[7] Rein, David, et al. "Gpqa: A graduate-level google-proof q&amp;a benchmark." arXiv preprint arXiv:2311.12022 (2023).</p><p>[8] Wei, Jason, et al. "Chain-of-thought prompting elicits reasoning in large language models." Advances in neural information processing systems 35 (2022): 24824-24837.</p><p>[9] Zelikman, Eric, et al. "Star: Bootstrapping reasoning with reasoning." Advances in Neural Information Processing Systems 35 (2022): 15476-15488.</p><p>[10] Gulcehre, Caglar, et al. "Reinforced self-training (rest) for language modeling." arXiv preprint arXiv:2308.08998 (2023).</p><p>[11] Nakano, Reiichiro, et al. "Webgpt: Browser-assisted question-answering with human feedback." arXiv preprint arXiv:2112.09332 (2021).</p><p>[12] Dubey, Abhimanyu, et al. "The llama 3 herd of models." arXiv preprint arXiv:2407.21783 (2024).</p><p>[13] Lambert, Nathan, et al. "Tulu 3: Pushing frontiers in open language model post-training." arXiv preprint arXiv:2411.15124 (2024).</p><p>[14] Bespoke Labs. &#8220;Bespoke-Stratos: The unreasonable effectiveness of reasoning distillation&#8221; <em><a href="https://www.bespokelabs.ai/blog/bespoke-stratos-the-unreasonable-effectiveness-of-reasoning-distillation">https://www.bespokelabs.ai/blog/bespoke-stratos-the-unreasonable-effectiveness-of-reasoning-distillation</a> </em>(2025).</p><p>[15] Welleck, Sean, et al. "From decoding to meta-generation: Inference-time algorithms for large language models." <em>arXiv preprint arXiv:2406.16838</em> (2024).</p><p>[16] Aggarwal, Pranjal, Bryan Parno, and Sean Welleck. "AlphaVerus: Bootstrapping formally verified code generation through self-improving translation and treefinement." <em>arXiv preprint arXiv:2412.06176</em> (2024).</p><p>[17] Chen, Xinyun, et al. "Teaching large language models to self-debug." <em>arXiv preprint arXiv:2304.05128</em> (2023).</p><p>[18] Wang, Yifei, et al. "A Theoretical Understanding of Self-Correction through In-context Alignment." <em>arXiv preprint arXiv:2405.18634</em> (2024).</p><p>[19] Huang, Jie, et al. "Large language models cannot self-correct reasoning yet." <em>arXiv preprint arXiv:2310.01798</em> (2023).</p><p>[20] Yang, An, et al. "Qwen2. 5 technical report." <em>arXiv preprint arXiv:2412.15115</em> (2024).</p><p>[21] Dubey, Abhimanyu, et al. "The llama 3 herd of models." <em>arXiv preprint arXiv:2407.21783</em> (2024).</p><p>[22] Shao, Zhihong, et al. "Deepseekmath: Pushing the limits of mathematical reasoning in open language models." <em>arXiv preprint arXiv:2402.03300</em> (2024).</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>For example, o1-preview did not have the ability to upload files, could not understand other modalities of data (e.g., images), and had no web search capabilities.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Although the details of how OpenAI controls the amount of inference-time compute used by o1-style models are not clear, it seems from <a href="https://openai.com/index/learning-to-reason-with-llms/">their blog post</a> that these models have multiple &#8220;settings&#8221; for the amount of compute that they can use at inference time. These settings are likely related to the length of the model&#8217;s long CoT, so high inference-time compute settings would simply generate very long chains of thought. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Technically, this benchmark is still unbeaten because o3 exceeded the maximum computational budget when achieving &gt;85% accuracy. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>This benchmark was described by <a href="https://en.wikipedia.org/wiki/Terence_Tao">Terence Tao</a> as likely to be unsolved by AI systems for &#8220;several years at least&#8221;. There has been some recent questioning of OpenAI&#8217;s performance on this benchmark due to <a href="https://techcrunch.com/2025/01/19/ai-benchmarking-organization-criticized-for-waiting-to-disclose-funding-from-openai/">conflict of interest</a> between OpenAI and the organization that created this benchmark (<a href="https://epoch.ai/">EpochAI</a>). </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>Notably, o3-mini does NOT have vision support, unlike o1. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>In contrast, RLHF trains the reward model over various kinds of human preferences, usually via a <a href="https://gombru.github.io/2019/04/03/ranking_loss/">ranking loss</a>. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>In addition to these two techniques, we could also perform some sort of search (e.g., <a href="https://en.wikipedia.org/wiki/Monte_Carlo_tree_search">monte carlo tree search</a>)&#8212;see <a href="https://arxiv.org/abs/2405.00451">here</a> for an example. However, we can also categorize search-based methods as generating more tokens at inference time. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>The length of a long CoT may vary depending on model settings (e.g., OpenAI provides several settings for &#8220;reasoning effort&#8221;) or problem difficulty. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-9" href="#footnote-anchor-9" class="footnote-number" contenteditable="false" target="_self">9</a><div class="footnote-content"><p>There is also a <a href="https://arxiv.org/abs/2401.02954">DeepSeek-v1 model</a>, but this model is dense (i.e., not an MoE) and much different from the model family that is used for DeepSeek-R1. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-10" href="#footnote-anchor-10" class="footnote-number" contenteditable="false" target="_self">10</a><div class="footnote-content"><p>The compute savings come from the fact that we do not have to train (or run inference on) any reward models. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-11" href="#footnote-anchor-11" class="footnote-number" contenteditable="false" target="_self">11</a><div class="footnote-content"><p>See <a href="https://platform.openai.com/docs/models#o1">here</a> for a full list of OpenAI&#8217;s o1 models. For clarity, the <code>o1-0912</code> model mentioned in [1] is the same as the <code>o1-preview</code> model.  </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-12" href="#footnote-anchor-12" class="footnote-number" contenteditable="false" target="_self">12</a><div class="footnote-content"><p>For example, the model lacks markdown formatting and highlighting within its answers, which is a common feature for modern LLMs. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-13" href="#footnote-anchor-13" class="footnote-number" contenteditable="false" target="_self">13</a><div class="footnote-content"><p>In [1], authors refer to the long CoT outputs generated by the DeepSeek-R1 model variants as &#8220;trajectories&#8221;. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-14" href="#footnote-anchor-14" class="footnote-number" contenteditable="false" target="_self">14</a><div class="footnote-content"><p>Notably, this is in direct contrast to the (original) approach adopted by OpenAI. o1-style models have their long CoT hidden from the end user, and these reasoning traces do not undergo any safety training. The rationale for this strategy is to allow the model to be more transparent in its trajectory, which improves interpretability. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-15" href="#footnote-anchor-15" class="footnote-number" contenteditable="false" target="_self">15</a><div class="footnote-content"><p>The exact models used are Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Qwen2.5-14B, Qwen2.5-32B, Llama-3.1-8B, and Llama-3.3-70B-Instruct. Notably, we do not always start with the base model&#8212;<em>many of these models have undergone post training</em>!</p></div></div>]]></content:encoded></item></channel></rss>