Deep (Learning) Focus

RL Scaling Laws for LLMs

Cameron R. Wolfe, Ph.D. — Mon, 20 Apr 2026 09:33:44 GMT

(from [1, 2, 3])

Scaling is one of the most impactful concepts in the history of AI research. For large language models (LLMs), scaling has mostly been studied in the context of pretraining, where rigorous scaling laws have allowed us to clearly define the relationship between compute and performance. Inspired by these predictable trends, the LLM research community has empirically validated pretraining scaling laws across several orders of magnitude. Through this process, we have discovered that meaningful improvements in model capabilities can be consistently achieved by investing more data and compute into pretraining.

“The way ML used to work is that people would just tinker with stuff and try to get interesting results. That’s what’s been going on in the past. Then the scaling insight arrived. Scaling laws, GPT-3, and suddenly everyone realized we should scale. This is an example of how language affects thought. Scaling is just one word, but it’s such a powerful word because it informs people what to do.” - Ilya Sutskever

The success of scaling laws in the context of pretraining has inspired the same concept of scaling to be applied in other areas of the LLM training process. Most notably, scaling now plays a key role in reinforcement learning (RL), where researchers have demonstrated smooth and predictable model capability improvements with larger-scale training. In this overview, we will study scaling laws in the context of RL. Rather than studying this topic in isolation, however, we will first build a deep understanding of scaling laws for pretraining and aim to outline how scaling laws have evolved in their application to RL. As we will see, the exact definition of scaling laws is completely different between these two domains, but the fundamental concept of scale remains powerful in both.

Scaling Law Fundamentals

Many early advancements in LLMs were driven by scaling up the pretraining process. Put simply, investing more compute into pretraining—by training a larger model on more data—yields better performance. We can rigorously define the relationship between compute and performance via a scaling law [13], or an equation that models the decrease in an LLM’s test loss as compute increases. As we will see, the pretraining process for an LLM follows smooth trends that can be accurately predicted via a scaling law, allowing the performance of larger models to be estimated before they are even trained. This ability to granularly forecast the expected result of a certain training configuration has many benefits:

Significant compute investments are less daunting, as we know what the result of this invested compute will be.
Iteration speed for experiments can be increased by running smaller scale experiments and extrapolating their results.

We will now build an understanding of scaling laws for LLM pretraining from the ground up. This knowledge of the mechanics and practical application of scaling laws for pretraining is needed to form a contrast with the scaling laws used for RL training. Pretraining scaling laws are highly standardized and follow a well-defined approach to estimate very particular training metrics. On the other hand, RL scaling laws—while still informative—tend to be much messier and bespoke, both in terms of their structure and the quantities that we measure.

Further learning. Although we cover the key details of pretraining scaling laws in this section, this is a popular and complex topic with a long history of study. For more details and links to further reading, please see the overview above.

What is a power law?

We can model the LLM pretraining process with a power law. At the simplest level, a power law describes a relationship between two quantities. A basic power law can be expressed as y = a × x^p. The two quantities being studied are x and y, while a and p are constants that describe their relationship—a controls the vertical position of the curve, while p controls the steepness or direction of the curve. Plotting this simple power law function gives us the figure shown below.

Plot of a basic power law between x and y

We provide the power law plot in both normal and log scale because most papers that study LLM scaling laws tend to plot their results in log scale. However, the plots provided for LLM scaling do not look like the plot shown above—they are usually flipped upside down; see below for an example. This is just an inverse power law, which can be formulated as y = a × (1 / x)^p. This is nearly identical to a standard power law, but we just use a negative exponent for p. As we can see below, using a negative exponent for p flips the power law plot upside down.

(from [13])

LLM power laws. This inverse power law, when plotted with a log scale, yields the signature linear relationship that characterizes LLM scaling laws. The two quantities we model via this inverse power law in LLM pretraining are:

The LLM’s test loss L—the next token prediction or cross entropy loss in particular (or another entropy-based metric like bits-per-byte or perplexity)—measured over an in-distribution, held-out validation set.
The compute C spent during pretraining that is estimated via the number of training FLOPs C = 6 × N × D, where N is the number of model parameters and D is the number of tokens observed during pretraining.

The factor of six used when estimating training compute comes from the fact that the LLM performs a single forward and backward pass during each training step. A single forward pass costs about 2N FLOPs per token, and the backward pass is 2× the cost of the forward pass. Therefore, a training step costs about 6N FLOPs per token, and we multiply this quantity by the total number of tokens observed during training to yield the C = 6 × N × D approximation. This approximation of pretraining compute was used in one of the first papers to study pretraining scaling laws [13], leading to its adoption in other work on the topic.

Neural Scaling Laws [13] and Chinchilla [14]

To develop a more concrete understanding of scaling laws for pretraining, we will overview two seminal papers [13, 14] that established the foundational principles of scaling. In [13], authors study the impact of several settings on the pretraining process, discovering that performance improves smoothly as we increase:

Model parameters.
Data volume.
Training compute.

More specifically, a power law relationship is observed between each of these factors and the LLM’s test loss when performance is not bottlenecked by either of the other two factors. To observe these power laws, LLMs with sizes up to 1.5B parameters are trained on several subsets of the WebText2 corpus. As shown below, the performance of these models steadily improves as we increase model size, data volume, or compute. These trends span eight orders of magnitude in compute, six orders of magnitude in model size, and two orders of magnitude in dataset size. The exact power law relationships and equations are provided below.

(from [13])

Each of these equations is very similar to the inverse power law equation that we saw before, but we set a = 1 and have an additional multiplicative constant (i.e., C_c, D_c, or N_c) inside of the parenthesis. To fit these power laws, we train a collection of models with different sizes while varying the amount of compute and data used for training. We can then measure the test loss for each of these models, forming a dataset of training configurations with a corresponding test loss. We can then fit the parameters of our power law to this data. Although there are many ways to fit a power law, one common approach for simple power law relationships is to fit a linear model on the observed data in log-log space.

What do power laws tell us? Although the power law plots provided above look promising, we should notice that these plots are generated using a log scale. If we generate normal plots (i.e., without log scale), we get the figures below, where we see that the shape of the power law resembles an exponential decay. In this way, increasing the quality of an LLM becomes exponentially more difficult with scale.

Power law plots without log scale

Compute-optimal allocation. The key scaling trends for LLM pretraining were established in [13], where we see that the LLM’s test loss follows smooth power law trends with compute, model parameters, and data volume. One important takeaway from this analysis is that, given a fixed compute budget, we get the best results by training a larger model over less data—usually ending the training process before the model fully converges. Chinchilla [14] builds upon this analysis with an extensive study of optimal compute allocations for pretraining. In particular, the analysis in [14] studies how to optimally allocate a fixed compute budget between model parameters and the number of training tokens to minimize the test loss.

(from [14])

By training over 400 LLMs of varying sizes over different amounts of data, we learn that the scaling recommendations provided in [13] lead most LLMs to be undertrained—training these models on more data would yield better results. More specifically, Chinchilla finds that model and data scale should be increased proportionally for pretraining to be compute optimal; see above. This study is conducted using the same scaling law formulations, but authors explicitly sweep various model and data size combinations under a fixed compute budget to find the optimal balance of data and parameters for minimizing test loss.

Scaling Laws beyond Pretraining

Until recently, most of the compute used for training an LLM was invested into pretraining. We mostly focused on scaling up the pretraining process, while post-training was a less expensive endeavor used to optimize a model’s style and behavior. The advent of reasoning models drastically changed these standards.

“Scaling RL compute is emerging as a critical paradigm for advancing LLMs. While pre-training establishes the foundations of a model; the subsequent phase of RL training unlocks many of today’s most important LLM capabilities, from test-time thinking to agentic capabilities… Deepseek-R1-Zero used 100,000 H800 GPU hours for RL training – 3.75% of its pre-training compute. This dramatic increase in RL compute is amplified across frontier LLM generations, with more than 10× increase from o1 to o3 and a similar leap from Grok-3 to Grok-4.” - from [1]

Compared to a standard LLM, reasoning models output a long reasoning trace or chain of thought—typically encapsulated by … tokens—before providing a final answer. This idea was popularized by OpenAI’s o-series models, which demonstrated drastic improvements in reasoning capabilities by training models to generate reasoning tokens prior to their final answer. The initial release of o1 highlighted two important new axes of scaling:

RL training compute.
Inference-time compute.

As shown in the figure below, we observe a smooth increase in performance—resembling a scaling law—by increasing RL training and inference-time compute.

(source)

This breakthrough in reasoning was followed by the release of the open-weight DeepSeek-R1 [15] reasoning model, which performed on par with o1 and provided a technical report describing how the model was trained. Today, such reasoning capabilities have become the new standard for both open and closed models.

RL for reasoning. Reasoning models are trained via RL with verifiable rewards. As shown above, model performance improves as we scale up the RL training process. As a result, recent LLM research has heavily focused on scaling RL training for verifiable tasks (e.g., math or coding). Large-scale RL training has unlocked huge improvements in general reasoning capabilities and the quality of coding agents. However, a non-negligible fraction of total training compute is now being spent on RL, and optimally allocating compute for RL is difficult.

For pretraining, we use known scaling laws to reason about how to properly invest available compute—these laws provide a standardized understanding of how performance changes with model size, data, and compute. Given that compute is the primary bottleneck to AI progress, we need analogous scaling laws that enable us to better understand and predict the results of RL training at scale.

Background on Reinforcement Learning

We will soon build upon our understanding of pretraining scaling laws to study RL scaling laws. However, we cannot properly interpret the scaling properties of RL without first understanding the basics of the RL training process. In this section, we will briefly outline the key concepts needed for this discussion, focusing on the GRPO algorithm and its many variants. We focus on GRPO in particular because it is the most common algorithm to use for large-scale RL training with reasoning models—at least in publicly disclosed research. For example, the popular DeepSeek-R1 reasoning model [15] uses GRPO for RL training.

Group Relative Policy Optimization (GRPO) [4]

“We introduce the Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO). GRPO foregoes the critic model, instead estimating the baseline from group scores, significantly reducing training resources.” - from [4]

Proposed in [4], GRPO is an RL optimization algorithm that builds upon prior algorithms like Proximal Policy Optimization (PPO). Whereas PPO was the most popular RL optimizer for LLMs in the RLHF era, GRPO is now almost universally used for large-scale RL with reasoning models1. GRPO is a simpler and lighter weight optimizer compared to PPO, which has aided its adoption by the LLM research community (especially for open research). The main change made by GRPO relative to PPO is in the advantage estimation technique.

(from [4])

Instead of using a value model and GAE to estimate advantage as in PPO, GRPO estimates the advantage by sampling multiple completions or rollouts (i.e., a “group” of completions) for each prompt in a batch and using their rewards to form a baseline. This group-derived baseline replaces the value function and allows GRPO to not train a value model, thus drastically reducing the memory and compute overhead of RL training. Concretely, the advantage for completion i is computed by normalizing the reward for this completion r_i with the mean and standard deviation of rewards in the group; see below. The same advantage value is assigned to every token t in the completion.

(from [4])

Intuitively, GRPO looks at the relative difference in rewards between multiple completions to the same prompt. The advantage is defined as the delta of one completion’s reward relative to the average reward observed in a group. This approach teaches the model to emphasize completions with higher-than-average reward.

Loss function. Once the advantage has been computed, the loss function used for GRPO is quite similar to that of PPO. The center point of the loss function for both PPO and GRPO is the token-level policy (or importance) ratio. Specifically, this is the ratio of the probability assigned to a token by the current policy and the policy used to generate the rollout (i.e., the “old” policy); see below.

The policy (or importance) ratio

Using this policy ratio and the advantage, we can compute the loss function for GRPO as shown below. This loss function uses the same clipping mechanism proposed by PPO; see here for more details. Similarly to PPO, GRPO takes the minimum of a clipped and unclipped objective in its loss formulation, where the objective is just the product of the policy ratio and advantage for token t.

GRPO loss function

The inner term of the loss function for GRPO is computed on the token level. By default, we aggregate this loss over our batch by:

Averaging the token-level losses within each completion.
Averaging completion-level losses over the group.

The exact manner in which we aggregate the loss in GRPO can change, and we will soon see that the manner in which we aggregate loss over a batch can impact performance. Given that GRPO computes advantage based on group-level reward statistics, we must sample a large number of completions per prompt to obtain a reliable advantage estimate. As a result, GRPO usually needs relatively large batch sizes in order for training to be stable2; see here for more details.

GRPO & reward models. GRPO is mostly used in verifiable reward settings without a neural reward model. A common misconception about GRPO is that it eliminates the need for a reward model, but GRPO can be used with or without a reward model. In fact, the original GRPO paper [4] used a reward model instead of verifiable rewards! Removing the reward model is a benefit of verifiable rewards, not an intrinsic benefit of GRPO itself—the primary advantage of GRPO is the elimination of the value model. For more details on GRPO, including example code and a discussion of prior work that led to GRPO, please see the overview below.

Recent GRPO Variants

The GRPO algorithm exploded in popularity after the release of DeepSeek-R1 [15] as many researchers began to replicate or extend results from the paper. Despite details of the model being openly published, fully replicating the training pipeline for DeepSeek-R1 proved non-trivial, leading many subsequent works to propose tweaks to the GRPO algorithm. In this section, we will overview the most successful modifications that are now commonly adopted for better RL training.

Token-level importance and sequence-level advantage in GRPO

Group Sequence Policy Optimization (GSPO) [5] modifies the GRPO objective by computing the policy ratio on a sequence level rather than at the token level. The GRPO loss (shown above) introduces a misalignment between how the model is optimized and how rewards (or advantages) are assigned:

Advantage is computed at the sequence level (in an outcome reward setting).
Policy ratios—and the loss in general—are computed at the token level.

As shown in [5], per-token policy ratios tend to have high variance during RL training, which increases the variance of policy gradients and, in turn, leads to training instability. Specifically, the high variance of policy ratios can lead a single token to dominate the loss expression or even cause numerical instability during the RL training process. This problem is particularly acute when training LLMs on long sequences or using large, sparse Mixture-of-Experts models.

To protect against this variance, token-level importance ratios are clipped in the range [1 - ε, 1 + ε]. This clipping operation is formulated such that tokens have zero contribution to the gradient update if they are clipped within the objective. The importance ratio captures the change in a token’s probability after multiple policy updates over the same data—we clip tokens for which we observe a sufficiently large change in their probability. However, simply removing the contribution of these tokens to the policy gradient can be problematic. These can be rare (low probability) tokens that are identified as important by the policy update. Such tokens may capture key reasoning steps that the model needs to learn, but we suppress the learning process via the clipping operation.

The key idea of GSPO is to compute importance ratios for the sequence rather than each token. Once we have derived the sequence-level importance ratio, the GSPO training objective is almost identical to that of GRPO; see above. We apply clipping to the sequence-level importance ratio, use the same advantage, and take a minimum of clipped and unclipped objectives at the sequence level.

The sequence-level importance ratio can be derived by factorizing the probability of a sequence into a product of individual token probabilities. However, authors in [5] choose to define the sequence-level importance ratio using the logarithmic form of a geometric mean, which is defined as shown below. This geometric mean is taken over token-level probabilities, which normalizes the sequence-level policy ratio by the length of the sequence. By using this approach, we ensure that importance ratios for sequences of different lengths are comparable, as well as improve numerical stability—especially for long sequences—by formulating the ratio as a sum over logprobs instead of a product over raw probabilities.

Geometric mean definition

We see in [5] that GSPO improves training stability, sample efficiency, and overall performance. The stability of GSPO is found to be especially useful when training large MoE models, such as Qwen3-235B-A22B. For these reasons, GSPO was adopted in the training process for the popular Qwen 3 model series.

Dynamic Sampling Policy Optimization (DAPO) [6] is not a single algorithm, but rather a modified recipe that proposes several useful tweaks to the vanilla GRPO optimizer. We see in [6] that the vanilla GRPO optimizer suffers from notable issues like:

Entropy collapse: the entropy of the model’s next token distribution collapses during the training process. Probability mass is primarily assigned to a single token and outputs are more deterministic.
Reward noise: the training reward is very noisy and does not steadily increase during the RL training process.
Training instability: the training process is unstable and may diverge.

To solve these issues, authors in [6] propose a suite of tricks that can be used in tandem. First, the entropy collapse problem in GRPO is shown to be caused by the fact that clipping emphasizes high probability tokens and punishes low probability (exploratory) tokens. The “clip higher” approach is proposed in [6] to solve this issue by decoupling lower and upper clipping bounds. Specifically, we clip in the range [1-ε_low, 1+ε_high], where ε_low=0.2 (default setting in GRPO) and ε_high=0.28 in [6]. Increasing ε_high prevents entropy collapse and improves overall GRPO performance; see below.

(from [6])

As RL training progresses, the number of samples for which all completions in a group are accurate increases. Such groups have zero advantage and, in turn, no impact on the policy gradient. As a result, these groups effectively reduce the batch size in GRPO, leading to noisier gradient estimates and degraded sample efficiency. Dynamic sampling is proposed in [6] to solve this problem by:

Filtering all prompts with perfect accuracy from a batch.
Continuing to sample prompts until we have a full batch.

This approach can increase the cost of constructing a batch, as we dynamically continue sampling prompts until the batch is full. However, we see in [6] that this cost is offset by the improved sample efficiency of RL training; see below.

(from [6])

Finally, DAPO also proposes a modified loss aggregation strategy and a new approach for handling completions that exceed the maximum sequence length. Vanilla GRPO aggregates token-level losses by i) computing the average loss in each sequence and ii) averaging sequence-level losses in the batch. However, this approach introduces a subtle bias—tokens within longer sequences have relatively less contribution to the overall batch gradient. To solve this, DAPO computes a token-level loss that is simply averaged over all tokens in the batch; see below.

(from [6])

Additionally, a length-based penalty term is introduced to the reward to apply a “soft” punishment to completions that are too long. Instead of assigning a hard negative reward to any completion that exceeds the maximum sequence length, authors in [6] argue that we should slowly increase the overlong penalty to its maximum value as we approach the maximum sequence length. This approach provides a smooth length penalty from which the model can effectively learn.

(from [7])

GRPO Done Right (Dr. GRPO) [7] outlined two key sources of bias that exist in the vanilla GRPO algorithm (depicted above):

Response-level length bias: GRPO normalizes the summed loss of tokens in each sequence by the total number of tokens in that sequence, leading to biased gradient updates based on the length of each response.
Question-level difficulty biases: the standard deviation term in the denominator of the advantage formulation in GRPO causes the advantage to become very large for questions that are either too easy (i.e., most responses have a reward of one) or too hard (i.e., most responses have a reward of zero).

To solve the first bias, Dr. GRPO aggregates the loss by summing token-level losses in a sequence and dividing this sum by a fixed constant MAX_TOKENS, thus removing response length from the aggregation process. The difference between this loss aggregation strategy and that of DAPO is nuanced. In DAPO, each token in the batch has an equal contribution to the gradient. As a result, DAPO still places more emphasis upon longer sequences in a batch, as these sequences have a larger ratio of total tokens (even if all tokens are weighted equally across the batch). On the other hand, replacing the sequence-level average with division by a fixed constant in Dr. GRPO effectively decouples aggregation from response lengths and, in turn, protects against length-based optimization bias.

The question-level difficulty bias is handled by removing the standard deviation term from the advantage estimator; see below. By making these two changes, Dr. GRPO improves training stability and efficiency, while making the resulting model more token efficient (i.e., responses are not artificially long).

(from [7])

Truncated Importance Sampling (TIS) [9] attempts to address mismatches in token probabilities introduced by efficient RL training frameworks. As we know, there are two main operations that occur during RL training: i) sampling rollouts and ii) computing policy updates. In modern RL frameworks, these operations are usually handled via separate engines:

Optimized inference engines like vLLM or SGLang—often with lower precision inference (e.g., int8 or fp8) for extra efficiency—are used to generate rollouts.
Distributed training frameworks like FSDP or DeepSpeed are used to compute policy updates.

Given that generating rollouts consumes the majority of compute during RL3, this approach is usually necessary—we want the inference process to be as efficient as possible. However, the use of separate engines can also introduce non-negligible differences in the token probabilities produced by each engine; see below.

(from [9])

Additionally, this difference in token probabilities is not easy to fix by simply standardizing implementations across engines. Authors in [9] investigate several code interventions to decrease the gap in token probabilities with little success, and this process would have to be repeated for every combination of engines used for RL training. Instead, a more flexible approach is proposed in [9] that uses an importance sampling term to automatically correct for this engine mismatch during RL within the policy gradient expression. The exact expression is shown below and is formulated as a REINFORCE-style policy update.

Policy gradient with different engines and TIS

The above expression explicitly uses a different engine for sampling rollouts (sampler) and computing policy updates (learner). The importance ratio between these two engines is simply the quotient of probabilities from learner and sampler engines. In [9], authors truncate this importance ratio by capping it at a maximum size of ρ. Compared to the clipping operation used in PPO or GRPO, this truncation operation has a few differences:

We are directly truncating the importance ratio. Clipping is also applied to the importance ratio, but it is followed by a minimum of clipped and unclipped objectives, making the operation two-sided.
We truncate the importance ratio with a maximum value of ρ, a one-sided truncation that simply prevents extreme up-weighting.

However, the practical application of TIS is quite simple—we just compute the truncated importance ratio and multiply our policy gradient expression by this ratio. As shown below, including this importance ratio in the policy gradient has a huge impact on RL training stability and model performance, leading to quick adoption of TIS in popular training frameworks (e.g., verl and OpenInstruct).

(from [9])

We formulated TIS above using a sequence-level importance ratio with REINFORCE. However, we can also create a token-level formulation with PPO or GRPO; see below. As we can see, the truncated importance ratio is computed in addition to the other components of the PPO-style policy gradient expression. We then multiply the existing expression by this correction term, and this can be done either at the sequence level—similarly to the gradient expression used by GSPO—or at a token level—as in the normal expression for PPO or GRPO.

(from [9])

Token-level TIS is commonly used in practice due to aligning well with PPO and GRPO objectives and is, therefore, relatively simple to integrate into existing training frameworks in addition to offering stability benefits. In recent work [10], however, authors have argued from an analytical perspective that sequence-level TIS is less biased compared to token-level TIS. Currently, there is no clear consensus on which of these approaches is superior. The best implementation in practice may differ depending upon the setup or domain being considered.

Clipped Importance Sampling-Weight Policy Optimization (CISPO) [8] is another recent RL variant that, similarly to TIS, builds upon a REINFORCE-style objective with an added importance ratio. When using a PPO-style clipping approach, we know any token that is clipped from the objective has no contribution to the policy gradient. In [10], authors observe empirically that the important “fork” tokens in the model’s reasoning trace (e.g., “aha” or “wait”) are rare and are initially assigned low probabilities in the base model. Due to the importance of these tokens, their probability usually increases drastically after the first policy update, leading these tokens to have a very large importance ratio—that is then clipped by the PPO objective—for subsequent policy updates.

“We found that tokens associated with reflective behaviors… were typically rare and assigned low probabilities by our base model. During policy updates, these tokens were likely to exhibit high [importance ratio] values. As a result, these tokens were clipped out after the first on-policy update, preventing them from contributing to subsequent off-policy gradient updates… These low-probability tokens are often crucial for stabilizing entropy and facilitating scalable RL.” - from [10]

As a result, important fork tokens are usually masked from the PPO-style loss after the first policy update for a batch of data. Although this masking may not always be an issue (i.e., most standard RL setups perform only ~2-4 updates on each batch of sampled data), MiniMax-M1 performs 16 policy updates for each batch of data. Therefore, important tokens being masked out of the loss after only one or a few updates can significantly damage training efficiency. To solve this issue, authors adopt the modified REINFORCE-style loss shown below. As we can see, CISPO adopts some of the recommendations proposed by DAPO [6] as well, including the token-level loss formulation to correct for length biases.

CISPO loss (from [10])

This loss formulation applies a stop gradient to the clipped importance ratio, ensuring that each token contributes to the loss even when it is clipped. Put differently, the importance ratio is used as a weight that controls the contribution of a token to the policy gradient. Clipping in CISPO puts a cap on this weight, ensuring no single token is over-amplified due to a large importance ratio.

Clipping in the GRPO loss

When we look at the GRPO objective, the clipping mechanics are quite different; see above. In particular, token probabilities are only present in the importance ratio, and the gradient flows through the token probability terms inside the importance ratio. When the importance ratio is clipped, the gradient for that token is zero and there is no contribution to the policy gradient. The modified clipping approach used by CISPO ensures that all tokens contribute to the policy gradient, improving the stability and efficiency of RL; see below.

(from [10])

The loss formulation of CISPO and TIS look quite similar, but these algorithms—despite both using an importance ratio—aim to solve different issues. CISPO uses the same definition of the importance ratio adopted in PPO and GRPO. This importance ratio is clipped to ensure that token probabilities do not change too much over a single batch of data, thus enforcing a trust region. CISPO simply modifies the manner in which the importance ratio is clipped to ensure that all tokens continue contributing to the policy gradient (with a capped weight) even if they are clipped. On the other hand, TIS uses an importance ratio to capture the difference in token probabilities between training and inference engines, thus correcting for the mismatch between engines during RL training.

Further reading. For more details on each of these algorithms, please see the overview linked below, which covers many GRPO variants and modifications.

Regularization for RL

There are two primary regularization terms commonly added to RL training:

Entropy bonus: rewards the LLM for remaining uncertain and helps to avoid overly-confident token distributions.
KL divergence: anchors the policy to a reference policy throughout training to prevent the LLM from changing too much.

Regularization terms are less commonly used in recent RL training pipelines, but we will see examples of both strategies being applied later in the overview. To avoid future confusion, we will briefly explain each regularization strategy now.

KL divergence. During RL training, we can compute the Kullback-Leibler (KL) Divergence between the current policy and a reference policy—usually the policy from before RL training begins (i.e., the base model). There are several techniques that can be used to approximate the KL divergence between two models; see here. The easiest—and most common—approximation of KL divergence [7] is the difference in token-level log probabilities between the current policy and the reference policy. This approximation and another common variant used in the original GRPO paper [12]4 are outlined below. Both estimators are usually supported in open RL implementations; e.g., see their implementation in TRL.

Common approximations of KL divergence

After the KL divergence has been computed, there are two common ways that it can be incorporated into the RL training process:

By directly subtracting the KL divergence from the reward.
By adding the KL divergence to the loss function as a penalty term.

Both of these approaches can be found in practice depending on the RL optimizer—or exact implementation—being used. PPO incorporates KL divergence into the reward, while GRPO adds it as a penalty to the objective function; see below.

Two ways of incorporating the KL divergence into RL

Due to the popularity of GRPO, recent RL implementations more often include the KL divergence in the loss, but completely omitting the KL divergence—and not using any regularization—is becoming increasingly common. During training, the KL divergence term penalizes our policy from becoming too different from the reference policy, but drifting away from the reference policy may not be negative if we are performing large-scale, reasoning-oriented RL training.

Entropy bonus. From an information theory perspective, entropy captures the level of uncertainty associated with the possible states for a variable:

High entropy: probability mass is spread across many outcomes.
Low entropy: probability mass is concentrated on a few outcomes.

In the LLM domain, we can measure the entropy of a model’s token distribution—low entropy means that the LLM places most of its probability into a small set of tokens and vice versa. Specifically, we can compute entropy using the equation below.

Entropy of an LLM token distribution

Usually, entropy is computed for each token (i.e., at each decoding step) and then averaged across the generated trajectory. After computing the entropy, we can turn it into an entropy bonus and use it as a regularization term by simply scaling it with a coefficient β and incorporating it into either the reward—this is done in the original PPO paper—or the objective function. The purpose of the entropy bonus is to prevent the LLM from becoming overly confident in its token distribution and, in turn, avoid entropy collapse that prevents the policy from exploring during training. Similarly to the KL divergence, entropy bonuses are now more commonly incorporated into the loss function. In fact, we will soon study a paper that adds an entropy bonus to the GRPO loss [3].

Scaling the RL Training Process

“While RL compute for LLMs has scaled massively, our understanding of how to scale RL has not kept pace; the methodology remains more art than science.” - from [1]

Scaling laws provide researchers with the ability to extrapolate the performance of expensive training runs from those that require less compute. Despite the expanding role of RL in training frontier models, however, our understanding its fundamental scaling properties remain somewhat rudimentary, especially relative to pretraining. In this section, we will take a look at several notable papers that are trying to solve this issue. As we will see, RL scaling laws are very different from those used for pretraining, and many of these differences arise from the massive design space of RL training. Put simply, RL is complicated, and we are far from a single standardized approach for handling RL “correctly”. However, there are still useful scaling insights that can be gleaned from this work that will help us to allocate available compute for RL experiments more effectively.

The Art of Scaling Reinforcement Learning Compute for LLMs [1]

Unlike pretraining, RL has no established predictive scaling laws for reliably estimating performance trends. Best practices for RL are found in new algorithm proposals, but these findings may not generalize at scale. Model reports also frequently provide practical recommendations for RL training, but these methods are often anecdotal and dependent upon training settings. As a result, we must test RL design choices the hard way—by running large-scale experiments and seeing what works. Given the computation cost of modern RL, this approach is a major bottleneck that limits iteration speed and hinders technical progress. We need a standardized approach to identify strong RL candidates at smaller scales.

RL scaling. In [1], authors model the RL training process with sigmoidal compute-performance curves. We fit such a curve separately for each RL training run to model the relationship between expected reward—calculated over a validation set at regular intervals during training—and compute (in units of GPU hours); see below.

Saturating S-curve for RL scaling (from [1])

This curve models the relationship between two quantities:

Reward gain: the difference between the reward after RL training with compute C and the initial reward before RL training.
Asymptotic reward ceiling: the maximum possible gain in reward we can achieve by spending unlimited compute on RL training.

The relationship between these quantities is controlled by the term 1 / [1 + (C_mid/C)^B]. This term includes i) the compute level at which we reach the midpoint of the curve C_mid, ii) an efficiency exponent B for the steepness of the curve, and iii) the current compute level C. Intuitively, this term captures how much of the total possible performance gain has been unlocked by running RL with compute C. The shape of this curve is visualized in the figure below.

(from [1])

According to this structure, RL training is flat in terms of reward during the early phase of training, then undergoes a phase of fast improvement before reaching a plateau. Authors find in [1] that these saturating curves between compute and reward robustly model the RL training process in practice. As we will see, this structure has also been validated and adopted by other work on RL scaling.

We fit this curve to the results of each RL training run, allowing us to compare results obtained from multiple runs with different training setups. There are two ways that these scaling curves may differ:

Their value of A may be different, indicating that one training setting achieves better asymptotic performance.
Their value of B (or C_mid) may be different, meaning that one training setting is more compute efficient than the other.

However, not all training settings yield a benefit in both A and B. In such cases, authors prioritize asymptotic improvements over efficiency improvements, arguing that improving the model’s asymptotic performance is more valuable because a degradation in efficiency can be offset by just training for longer.

Applying RL scaling laws. The RL scaling laws proposed in [1] allow us to extrapolate the performance of a training run without incurring the full cost. We can use the early phase of training to predict what would be the final performance after training with more compute. The scaling law is fit using validation performance measured at regular intervals during training5, allowing us to efficiently assess the scalability of different changes to the RL training process.

“We propose a best-practice recipe, ScaleRL, and demonstrate its effectiveness by successfully scaling and predicting validation performance on a single RL run scaled up to 100,000 GPU-hours. Our work provides both a scientific framework for analyzing scaling in RL and a practical recipe that brings RL training closer to the predictability long achieved in pre-training.” - from [1]

Authors in [1] use this approach to derive an optimal training recipe, called ScaleRL. Beginning with a baseline setup, authors test interventions to the RL training process in multiple phases of increasing scale—4K, 8K, 16K, and 100K GPU hours. In each phase, scaling laws are fit to extrapolate the performance of each setting, allowing authors to both i) verify the accuracy of their scaling law formulation and ii) efficiently discover scalable design choices for RL.

“We present the first large-scale systematic study, amounting to more than 400,000 GPU-hours, that defines a principled framework for analyzing and predicting RL scaling in LLMs.” - from [1]

Baseline RL setup. RL experiments in [1] primarily use the Polaris-53K math-focused reasoning dataset. Analysis begins with a baseline RL recipe that uses the GRPO loss with no KL divergence and the clip higher approach from DAPO [6]. All models produce a reasoning trace before their final output. A context length of 16K tokens is used—12K reasoning tokens, 2K input tokens, and 2K output tokens—as well as a batch size of 768—a total of 48 prompts with 16 rollouts each.

To enforce the 12K-token reasoning budget, interruptions are used during training. When a reasoning trace reaches 12K tokens, we append a static end-of-reasoning phrase “Okay, time is up. Let me stop thinking and formulate a final answer. " to the model’s output so that the model can stop reasoning and begin generating a final answer.

(from [1])

From an engineering perspective, authors adopt a split generator-trainer approach, where a subset of GPUs run optimized inference engines (e.g., vLLM) for generating rollouts and remaining GPUs run a training backend (e.g., FSDP) to update policy parameters. To improve efficiency, an asynchronous RL training approach is adopted. Two algorithms are considered: asynchronous PPO and PipelineRL [11]. As shown above, both approaches achieve similar asymptotic performance A, but PipelineRL has significantly improved compute efficiency B due to its design principles that minimize GPU idle time (e.g., in-flight weight updates). Authors also find in [1] that it is important to bound the degree of asynchrony by ensuring policy updates are not more than K steps ahead of the generated rollouts. K = 8 is found to be optimal for this setting; see above.

Ablating RL modifications. To build upon the baseline RL recipe, authors run small-scale RL experiments (i.e., ~4-8K GPU hours) to test the impact of various RL design choices. The baseline loss is first compared with both GSPO and CISPO loss formulations; see below. We see that both GSPO and CISPO noticeably outperform the baseline formulation in terms of asymptotic performance. CISPO also has marginally improved compute efficiency relative to GSPO, leading authors to use CISPO for remaining experiments in [1].

(from [1])

As we learned from the explanation of TIS, using different engines for generating rollouts and computing policy updates can lead to a non-negligible mismatch in log probabilities between training and inference. One change that we can make to minimize this mismatch is using full (float32) precision in the LLM’s language modeling head—the final linear layer that predicts token probabilities. As shown above, using a full precision head in training and inference engines significantly improves both asymptotic performance and compute efficiency. We should note, however, that authors do not adopt any approach for correcting the trainer-generator mismatch (e.g., TIS), which could also help to solve this issue.

Authors also test different loss aggregation strategies, including vanilla GRPO aggregation versus DAPO-style aggregation, finding that the loss aggregation proposed by DAPO [6] tends to perform the best; see below. In a similar vein, several advantage normalization techniques are tested. Specifically, authors test dividing mean-centered rewards by the standard deviation of rewards in a group—as in vanilla GRPO—or the standard deviation of rewards in the entire batch, as well as not dividing the mean-centered reward by anything—as in Dr. GRPO [7]. All techniques perform comparably, indicating that advantage normalization does not significantly impact asymptotic performance; see below. Remaining experiments normalize the advantage using the standard deviation of rewards across the batch due to the slight boost observed in asymptotic performance.

(from [1])

Authors also discover data curation and filtering strategies that benefit the asymptotic performance of RL. Prompts with zero variance in rewards across a group have zero advantage and, therefore, no contribution to the policy gradient. Filtering these zero variance prompts from the batch benefits asymptotic performance; see below. Notably, this approach is different from the dynamic sampling method proposed in DAPO [6], as we do not continue sampling prompts until the batch is full. Rather, we just filter zero-variance prompts from the batch, forming a smaller effective batch. By doing this, we avoid dampening the policy gradient signal, as we are averaging the policy gradient over a smaller effective batch instead of the full batch that includes prompts with no gradient.

(from [1])

Many data curriculum strategies have been proposed for RL training, but we learn in [1] that simple approaches can be quite effective. During training, the number of prompts that are solved easily by the current policy increases, and these prompts usually remain easy for the model throughout the rest of training. As shown above, dynamically removing these prompts from the training process improves asymptotic performance. To do this, authors maintain a history of pass rates for each prompt and permanently remove prompts that exceed a pass rate of 90%. This approach, called no positive resampling in [1], avoids wasting compute on prompts that the model already knows how to correctly solve.

(from [1])

The ScaleRL recipe, which combines all best practices identified in the smaller-scale experiments outlined above, uses the loss formulation shown above. As mentioned before, a PipelineRL setup, forced interruptions for reasoning, and a full precision language modeling head are also used for ScaleRL.

(from [1])

To validate this recipe, larger-scale experiments are performed with up to 16K GPU hours. Authors perform leave-one-out ablations by removing individual components of the ScaleRL recipe to determine if they still have an impact when used in tandem with other components. When fitting sigmoidal scaling curves up to 8K GPU hours, we see that extrapolated results accurately predict performance up to the end of the 16K GPU hour run. In these experiments, the full ScaleRL recipe is found to yield the best performance; see above. Not all components significantly benefit performance in the leave-one-out analysis, but authors argue that these design choices still tend to benefit training stability.

“Even when individual design choices appear redundant within the combined recipe, they often enhance training stability, robustness, or efficiency in ways that generalize across models and setups. ScaleRL retains such components not just for marginal gains in a specific configuration, but because they address recurring sources of instability and variance that arise across RL regimes.” - from [1]

Scaling up. Based on the analysis in [1], authors perform a final training run of ScaleRL up to 100K GPU hours, finding that the extrapolated performance continues to match actual performance in extended RL training runs; see below. Prior to this large-scale experiment, different methods for scaling up the RL training process (e.g., longer context, larger batch size, larger models, etc.) are considered in [1]. By analyzing these options in this extended run of ScaleRL, we learn the following:

The scaling laws proposed in [1] are also found to accurately extrapolate the performance of Mixture-of-Experts models, indicating generalizability to larger models with different architectures.
Using a longer context window during RL slows down training progress initially but yields higher asymptotic performance in the long run.
Increasing the batch size improves the asymptotic performance of RL and prevents stagnation on downstream benchmarks.
How we allocate the batch in terms of number of prompts and number of rollouts per prompt is less impactful—the total batch size matters most.

(from [1])

Key takeaways. The empirical analysis in [1] is extensive and contains a wide variety of practical details that are incredibly useful for those working on RL. For this reason, those who are interested in gaining a practical grasp of RL training should definitely read the full paper. However, the comprehensive empirical analysis presented in [1] can be largely summarized as follows:

Asynchronous RL (i.e., PipelineRL) with a split generator-trainer setup is highly efficient and yields models that perform well, so long as we bound the level of asynchronicity during training.
The proposed ScaleRL training recipe combines all of the practical GRPO modifications that were found to be useful across experiments in [1].
The performance ceiling of RL A can be impacted by changes to the RL setup (e.g., loss type or batch size). However, many common RL interventions (e.g., loss aggregation, data curriculum, or advantage normalization) impact compute efficiency B rather than asymptotic performance.
The methods that appear superior in smaller-scale RL runs do not always generalize to the high-compute regime. However, we can still identify the recipes that are most scalable by fitting a sigmoidal scaling curve and estimating scaling parameters A and B from early training dynamics. This approach is used constantly throughout the analysis in [1] to judge the scalability of RL recipes without performing full training runs (i.e., 16K-100K GPU hours).

Scaling Behaviors of LLM RL Post-Training [2]

(from [2])

In [2], authors investigate scaling behaviors of RL post-training using the full Qwen-2.5 model suite—both base and instruct models—ranging from 0.5B to 72B parameters. As in [1], this paper studies the impact of factors like model size, data volume, and compute on the performance of models trained with RL. However, this analysis focuses specifically on the mathematical reasoning domain, uses only the vanilla GRPO algorithm, and adopts a different scaling formulation. From the analysis in [2], we learn that RL follows a predictive power-law relationship between test loss and compute or data; see above.

Scaling formulation. The scaling law formulation in [2] fits a relationship between test loss—defined as the error rate (i.e., error rate = 1 - accuracy) on an in-domain validation set—and compute or data. As shown below, RL scaling behavior is modeled using a log-linear power law between the test loss L, model size N, and a resource budget X. Here, the resource budget can either be the amount of compute C or the amount of data D used during RL training.

Core scaling formula (from [2])

In the figure below, we plot this power law—using both log-log scale and linear scale to make interpreting the plots easier—for different values of the learning efficiency. We use a fixed value of E(N) = 1.0 in this plot for simplicity. As we can see, performance improves log-linearly as the resource budget X increases, and higher learning efficiency K(N) leads to a steeper decrease in test loss.

Plotting the scaling law with varying learning efficiencies

Performance extrapolation. As we might have inferred, the scaling formulation used in [1] is quite different from the pretraining scaling laws that we learned about before. More specifically, the scaling trends in [1] can only extrapolate the results of a specific training run to a higher compute regime—we are predicting what will happen if we continue the RL training process for longer. In contrast, the power law in [2] enables multiple extrapolation regimes:

Inter-model: fit the scaling law using data from training runs with smaller models (i.e., 0.5B to 32B Qwen-2.5 models) and predict the performance of a larger model (i.e., Qwen-2.5-72B).
Intra-model: fit the scaling law using the early training trajectory of a model and predict its performance for the remainder of training.

Both kinds of extrapolation are validated in [2] across base and instruct model variants of several sizes, demonstrating that RL training follows predictable scaling trends across model size N, compute C, and data volume D. Scaling plots shown in [2] always provide both inter and intra-model extrapolation results.

More on learning efficiency. In our above scaling law expression, we should notice that the learning efficiency term has a dependence upon N—learning efficiency follows a saturating trend with model size. Put simply, this means that i) larger models have higher learning efficiency, but ii) marginal efficiency gains begin to diminish with increasing model size. As shown in the plot below, this formulation matches empirical observations. In practice, learning efficiency follows a saturating S-curve—similar in structure to the scaling law formulation proposed in [1]6—that plateaus at a maximum learning efficiency of K_max.

(from [2])

Experimental setup. All experiments in [2] use vanilla GRPO—with a KL divergence term—and the verl training framework. Scaling laws are empirically fit and validated on results from over 60 models, including base and instruct variants from the Qwen-2.5 series with sizes ranging from 0.5B to 72B parameters. The model family is fixed to ensure that only parameter count N and data volume D are changing. RL training is conducted over 50K samples taken from the mathematics subset of guru-RL-92K, which performs extensive deduplication and difficulty filtering. Additionally, authors in [2] sort problems by increasing difficulty—as assessed by the pass rate of Qwen2.5-7B-Instruct—to form a data curriculum of increasing problem difficulty as RL training progresses. Following standard practice for fitting scaling laws, we compute test loss on an in-domain dataset of 500 held-out problems sampled from the training distribution.

Compute-constrained regime. The power law formulation in [2] can be used to characterize scaling behavior under a fixed compute budget. Given a compute budget C, we are interested in the optimal model size N that minimizes the test loss. In [2], the compute budget is estimated via cumulative training FLOPs C = 6 × N × T, where T is the number of tokens processed during training. T is related to the data volume D, but whereas D counts data samples, T measures the total token volume. T is inferred from fixed values of C and N. We can study the relationship between compute C and model size N by running RL training with various model sizes and compute budgets, then fitting scaling laws on the results. We use compute as our resource budget (i.e., X = C) for these scaling laws; see below.

(from [2])

In these plots, we can observe the results of inter-model (top plot) and intra-model (bottom plot) extrapolation using the scaling law formulation proposed in [2]. When studying scaling trends for smaller models (i.e., 0.5B to 32B parameters), we see that the best performance under a fixed compute budget is usually achieved by using the largest model. Larger models (i.e., 32B and 72B parameters) violate this trend: the 32B model performs best at lower compute budgets, but a crossover occurs at higher compute budgets after which the 72B model performs better.

“In contrast to the immediate dominance of larger models in smaller parameter regimes, the 32B model outperforms the 72B counterpart initially under equivalent compute budgets, as the smaller model size inherently enables more training steps. We believe this observation reveals a latent trade-off between model scale and training steps in compute constrained scenarios.” - from [2]

This crossover arises from the fact that learning efficiency k(N) saturates for larger models. Given a fixed compute budget C, a smaller model can train for a larger number of steps relative to a larger model. Therefore, the larger model must have significantly improved learning efficiency in order to outperform the smaller model. As we see in the scaling analysis above, this is true until we reach 72B scale, at which point efficiency gains saturate and the 32B model is able to exceed the performance of the larger model under tight compute constraints.

Data-optimal scaling. Given that LLM training is usually bottlenecked by the availability of high-quality data, we also want to understand the optimal model size for RL training given a fixed data budget D7. To do this, we can train models with various model sizes N and data budgets D, then fit the scaling laws—where data is our resource budget (i.e., X = D)—to these results; see below. The conclusion from this analysis is simple: for a fixed amount of data, larger models demonstrate superior sample efficiency and consistently achieve lower test loss. We also see that scaling laws accurately extrapolate performance in all regimes considered.

(from [2])

If we remove data and compute constraints (i.e., train models to convergence on sufficiently large datasets), test loss monotonically decreases with model size—bigger models are better given enough data and compute. However, this trend does not follow a power law; see below. Smaller models show weaker gains, indicating diminishing returns for training smaller models to convergence.

(from [2])

Data reuse. In addition to the scaling analysis, authors in [2] test if repeating data during training is problematic. These experiments fix the total data budget D_total but vary the number of unique data samples such that D_total = τ × D_unique, where τ is a data reuse factor. As shown below, we learn in [2] that performance is primarily determined by D_total rather than D_unique. In fact, test loss is relatively insensitive to τ, and we see that there is no significant degradation in performance until larger reuse factors (i.e., τ = 25).

However, unique data is not sampled randomly in these experiments. To ensure that data subsets are sufficiently diverse, authors partition the training set into difficulty subsets and preserve the data difficulty distribution across subsets of different sizes. Additionally, the same data curriculum is maintained by ordering data based upon difficulty. The robustness of RL training to data reuse is likely dependent upon the diversity, quality, and difficulty of the unique samples.

Optimally Scaling Sampling Compute for LLM RL [3]

(from [3])

Scaling laws can be applied to RL training in many ways. The work we have seen so far shows that performance during RL follows a sigmoidal trajectory [1] and scales in a predictable manner with model size under fixed compute budgets [2]. However, these results, while informative, do not directly recommend how we can practically allocate a fixed compute budget for RL training in a similar manner to pretraining scaling laws. Inspired by this, authors in [3] perform a prescriptive analysis of optimal compute allocations for RL. Specifically, the analysis in [3] focuses on understanding how to optimally allocate sampling compute—or the amount of compute spent generating completions for on-policy RL.

“We study the compute-optimal allocation of sampling compute for on-policy RL methods in LLMs, framing scaling as a compute-constrained optimization over three resources: parallel rollouts per problem, number of problems per batch, and number of update steps.” - from [3]

Sampling compute. The relationship between compute and performance is less straightforward in RL relative to pretraining. For both pretraining and RL, the training process involves a sequence of training—or model update—steps. At each pretraining step, a single forward and backward pass is performed. On the other hand, an RL training step includes multiple components:

Data collection: sampling completions from the current policy.
Optimization: updating the policy over collected data.

With this in mind, we can model the total compute cost of an RL training run as C = B_p × n × M, where B_p is the number of unique prompts per batch, n is the number of rollouts generated per prompt, and M is the number of steps taken during RL training. The analysis in [3] primarily focuses on compute spent on sampling completions (B_p and n) rather than sequential training steps (M).

Scaling laws for sampling. Given the compute footprint for RL outlined above, our goal is to better understand how varying the allocation of a fixed compute budget C_0 across the three factors B_p, n, and M impacts model performance. The scaling analysis is conducted in [3] by sweeping over settings of B_p ∈ {2^5, 2^6, …, 2^10} and n ∈ {2^3, 2^4, …, 2^11}, where both of these parameters are uniformly sampled via grid search on a log scale. Due to hardware constraints, a maximum effective batch size (B_p × n ≤ B_max) is also enforced.

From a given (B_p, n) setting, we can perform a single RL training run to capture increasing settings of M throughout training. Following scaling law best practices, model performance is evaluated during training by measuring reward on an in-domain validation set. The evaluation results in a training run are sub-sampled to only include record-breaking points—defined as points in the reward curve that exceed all prior rewards—along the learning trajectory; see below. By considering only record-breaking points when modeling an RL training run, we fit our scaling law to the frontier of the reward trajectory for this run.

(from [3])

To robustly identify record-breaking points in the reward trajectory, rewards are separated into discrete bins, and the first point at which the reward enters a new bin is selected as the record-breaking point. Once the cleaned reward trajectory is available, we model this single training run with a sigmoidal scaling law—similar to the scaling law formulation used in [1]. This gives us a collection of scaling laws for RL training curves with different settings of B_p and n, allowing the optimal configuration to be identified at each compute level from run-specific curves.

Building on these run-specific scaling laws, we can also fit a scaling law on the optimal settings identified for each compute level. Namely, we can use a similar sigmoidal scaling law to model how the optimal value of n varies according to our compute budget C, allowing us to extrapolate optimal training settings at higher compute budgets. In theory, the same approach can be used for B_p, but no clear pattern is observed in practice for the optimal value of B_p.

Experimental settings. The scaling analysis described above is conducted using Qwen2.5-7B-Instruct, Qwen3-4B-Instruct, and Llama3.1-8B-Instruct as base models. All RL training runs use binary outcome rewards and the vanilla GRPO optimizer. Guru-Math is used as the primary dataset and is split into easy and hard subsets by assessing the difficulty of each prompt—judged by the accuracy of the base model over 16 rollouts (i.e., Avg@16). The difficulty distribution is shown below with the easy and hard subsets shaded in blue and orange, respectively. Empirical scaling analysis is performed separately on both easy and hard data subsets in [3] to observe how difficulty distributions impact trends in scaling.

(from [3])

In order to make the RL training process stable, the correct regularization strategy is needed. Interestingly, we see in [3] that optimal regularization is difficulty-dependent. Authors consider adding both a KL divergence and entropy bonus to the RL training objective. On easy problems, the entropy bonus helps to prevent premature entropy collapse in the policy. However, using an entropy bonus on difficult problems can actually cause an entropy explosion by pushing the policy towards rare but successful reasoning trajectories, making it better to remove regularization entirely. As shown below, the following regularization strategy is found to yield the most stable results8:

Apply both the entropy bonus and KL divergence—which helps to delay entropy explosion—in tandem when training on the easy dataset.
Use no regularization when training on the hard dataset.

(from [3])

In addition to difficulty-dependent regularization, the learning rate must be increased with the batch size to ensure stable training. In particular, a square root scaling rule is used for the learning rate in [3], which increases the learning rate proportionally to the square root of the batch size; see below.

(from [3])

How should we allocate compute? The primary takeaway from the scaling analysis in [3] focuses upon the number of rollouts to sample (n) for each prompt in a batch. As the compute budget increases, the optimal setting of n increases as well, eventually saturating at higher compute budgets; see below. In other words, allocating increased compute towards sampling more rollouts per prompt yields better results compared to just training the model for longer. Interestingly, the exact scaling law also depends on the problem difficulty—smaller optimal values of n are observed when training on a harder dataset.

(from [3])

This trend holds for all base models across both easy and hard training datasets. Intuitively, scaling n has a different impact depending on the problem difficulty:

Sampling more rollouts on easy problems can sharpen performance (i.e., improve Avg@K) for problems that are already solvable and improve policy robustness by lowering the probability of an incorrect rollout.
Increasing the number of rollouts increases exploration and, in turn, aids in discovering rare solutions to difficult problems in order to improve the ratio of problems that can be solved (i.e., Pass@K) by the policy.

Interestingly, B_p has only a moderate performance impact when kept within a reasonable range and is found to primarily influence training stability.

Although the optimal setting of n scales with increasing compute, the exact shape of this scaling law—and the point at which it saturates—changes depending on the exact training setup. Therefore, while the scaling trends hold across different settings, the exact scaling parameters must be fit to the particular RL training setup being used. Practically, authors in [3] recommend the following approach for determining an optimal compute allocation in RL:

Execute RL training runs at lower compute budgets by varying the value of B_p and n but restricting the value of M.
Fit a scaling law, using the approach described above, from these results.
Infer the optimal value of n from the scaling law.
Choose the minimum value of B_p that yields stable training.
Invest the remaining compute budget into additional training steps M.

This approach provides us with a predictable process for extrapolating the optimal compute allocation for RL from lower-budget experiments.

Comparing RL and Pretraining Scaling Laws

We now have a detailed understanding of scaling laws for both pretraining and RL. However, one of the primary takeaways from this overview is the fact that a “scaling law” is completely different between these two domains. To close, we will briefly discuss the key ways that scaling laws differ between pretraining and RL, explain why these differences exist, and outline key takeaways from research on RL scaling that remain useful despite the overall messiness of RL.

Measuring performance. Pretraining scaling laws predict a particular metric: the cross entropy loss (or another related entropy metric) measured over an in-domain, held-out validation set. This performance metric is stable and is typically computed over a large, diverse dataset (i.e., a random sample of the pretraining corpus). Such a stable, diverse, and specific metric provides the perfect y-axis for fitting a scaling law and allows us to clearly define the impact of specific design decisions on the resulting model. RL scaling laws make an attempt to retain this robustness; e.g., performance is computed over an in-domain validation set. However, RL scaling laws typically use the reward (or accuracy) of the policy9 as the underlying metric to which scaling laws are fit. This is a downstream performance metric that can fluctuate substantially depending on the domain being studied, the benchmark being used, and the composition of data in that benchmark. As a result, scaling laws for RL tend to be more noisy and domain specific relative to those used for pretraining, which capture a more general trend in model performance.

Defining compute. Pretraining has a very clean compute footprint that is usually estimated with the number of training FLOPs C = 6 × N × D. This clean definition of compute provides an obvious x-axis for our scaling law. In contrast, RL compute is difficult to define due to the presence of both sampling and policy updates. The exact definition of compute used in RL scaling laws may change depending on the paper we are reading. For example, some papers derive a FLOP-like metric similarly to pretraining [3], while others rely on the number of GPU hours used [1]. Either way, the wall-clock time of RL training varies substantially depending on the framework being used, which means that the relationship between GPU hours and raw compute is not consistent. These factors must be considered when fitting a scaling law for RL because they cause the details of scaling laws to change depending on the exact setup being used.

Intra and inter-model extrapolation. Pretraining scaling laws are used to fit trends in performance across many model training runs with different settings to understand how model size, data volume, and compute impact the results of training. Such an approach allows us to cleanly extrapolate the results of costly training runs and use these predictions to reason about how compute should be optimally allocated. In RL, we actually fit two kinds of scaling laws that are used to extrapolate performance in different ways (i.e., inter-model and intra-model extrapolation). Inter-model extrapolation is the primary focus of pretraining scaling laws, whereas intra-model extrapolation is not usually addressed. The main reason intra-model extrapolation is necessary for RL is the sensitivity of the training process. In addition to understanding inter-model trends, we need to be able to predict whether a particular training configuration is viable or not.

Lack of standardization. The design space for RL algorithms is quite large: there are simply more “knobs” to tweak relative to pretraining. Additionally, we lack a comprehensive understanding of which design decisions meaningfully impact the scaling properties of RL. Although we have seen several papers that study the impact of design decisions on RL scaling, the findings from these papers—despite being informative—do not address the fact that scaling trends for RL are coupled to the exact training setup being used. Slight changes in the configuration for RL can completely change the scaling trends we observe. For this reason, most RL scaling laws are bespoke—the recommendations offered by one specific analysis may not hold in a different environment. As a result, findings can be difficult to replicate or extend, thus slowing scientific progress on the topic.

Practical takeaways. Despite the fact that RL scaling laws tend to be messy and bespoke, there are still several useful trends that we can learn from the papers in this overview:

The scaling behavior of RL is predictable within a given setup. Intra-model extrapolation works well and can be used to judge the viability of your setup during the early training phases. Inter-model extrapolation is also effective and can yield useful insights, though these insights may not always transfer across different training configurations.
The impact of design decisions on RL is not singular. Some decisions impact learning efficiency, while others impact asymptotic model performance. This distinction is important because degradations in efficiency can be solved by simply training for longer, while a degradation in asymptotic performance may not be trivially recoverable. Interestingly, many recent GRPO variants seem to primarily benefit learning efficiency and stability [1].
Using larger models yields consistently positive results in the RL scaling laws we have seen, though the presence of compute constraints can create interesting tradeoffs. When training with less data or compute, we may actually benefit from using a smaller model due to the fact that learning efficiency saturates with model size.
To invest more compute into RL, we can i) run training for more steps or ii) use more inference compute at each step. Interestingly, even though the compute cost of RL is dominated by inference, most scaling laws suggest that allocating more compute to sampling completions is helpful. RL training is surprisingly robust to data reuse, benefits from large batch sizes, and scales predictably as we sample more completions per prompt in a batch.

New to the newsletter?

Hi! I’m Cameron R. Wolfe, Deep Learning Ph.D. and Staff Research Scientist at Netflix. This is the Deep (Learning) Focus newsletter, where I help readers better understand important topics in AI research. The newsletter will always be free and open to read. If you like the newsletter, please subscribe, consider a paid subscription, share it, or follow me on X and LinkedIn!

Subscribe now

Bibliography

[1] Khatri, Devvrit, et al. “The art of scaling reinforcement learning compute for llms.” arXiv preprint arXiv:2510.13786 (2025).

[2] Tan, Zelin, et al. “Scaling behaviors of llm reinforcement learning post-training: An empirical study in mathematical reasoning.” arXiv preprint arXiv:2509.25300 (2025).

[3] Cheng, Zhoujun, et al. “IsoCompute Playbook: Optimally Scaling Sampling Compute for LLM RL.” arXiv preprint arXiv:2603.12151 (2026).

[4] Shao, Zhihong, et al. “Deepseekmath: Pushing the limits of mathematical reasoning in open language models.” arXiv preprint arXiv:2402.03300 (2024).

[5] Zheng, Chujie, et al. “Group sequence policy optimization.” arXiv preprint arXiv:2507.18071 (2025).

[6] Yu, Qiying, et al. “Dapo: An open-source llm reinforcement learning system at scale, 2025.” URL https://arxiv. org/abs/2503.14476 1 (2025): 2.

[7] Liu, Zichen, et al. “Understanding r1-zero-like training: A critical perspective.” arXiv preprint arXiv:2503.20783 (2025).

[8] Chen, Aili, et al. “Minimax-m1: Scaling test-time compute efficiently with lightning attention.” arXiv preprint arXiv:2506.13585 (2025).

[9] F. Yao, L. Liu, D. Zhang, C. Dong, J. Shang, and J. Gao. Your efficient rl framework secretly brings you off-policy rl training, Aug. 2025. URL https://fengyao.notion.site/off-policy-rl.

[10] Chen, Aili, et al. “Minimax-m1: Scaling test-time compute efficiently with lightning attention.” arXiv preprint arXiv:2506.13585 (2025).

[11] Piché, Alexandre, et al. “Pipelinerl: Faster on-policy reinforcement learning for long sequence generation.” arXiv preprint arXiv:2509.19128 (2025).

[12] Shao, Zhihong, et al. “Deepseekmath: Pushing the limits of mathematical reasoning in open language models.” arXiv preprint arXiv:2402.03300 (2024).

[13] Kaplan, Jared, et al. “Scaling laws for neural language models.” arXiv preprint arXiv:2001.08361 (2020).

[14] Hoffmann, Jordan, et al. “Training compute-optimal large language models.” arXiv preprint arXiv:2203.15556 10 (2022).

[15] Guo, Daya, et al. “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.” arXiv preprint arXiv:2501.12948 (2025).

Although it is true that GRPO is dominant in open research, it is probable that closed frontier labs are using different algorithm variants.

For example, Olmo 3 uses a total batch size of either 512 or 1,024 for RL training with 8 rollouts per prompt and either 64 or 128 prompts per batch.

For example, Olmo 3 mentions that models use 5-14× more compute for inference compared to policy updates during RL training.

More details on why this particular KL divergence term was adopted for GRPO can be found in this discussion of the DeepSeekMath paper.

Performance is measured as an average pass rate computed over 16 generations per prompt over a validation set of 1,000 prompts. Validation performance is measured after every 100 RL training steps.

The main difference between these formulations is the fact that the S-curve used for the learning efficiency in [2] has a fixed steepness exponent of B = 1.

Here, the value of D corresponds to the number of unique examples in the dataset.

Authors note in [3] that scaling law trends are robust to the regularization strategy—proper regularization only helps to keep training stable.

Several different variants of accuracy can be used as well; e.g., Pass@K or Avg@K.

The Anatomy of an LLM Benchmark

Cameron R. Wolfe, Ph.D. — Mon, 30 Mar 2026 09:33:10 GMT

(from [2, 3, 4, 10, 12])

Throughout the history of AI research, progress has been measured—and accelerated—by high-quality benchmarks. AI is an empirical field that is driven by discovering interventions that improve performance on key benchmarks. For large language models (LLMs) in particular, creating useful benchmarks is hard due to rapidly advancing model capabilities. Tough evaluations are regularly saturated as new models are released, creating the need for continual evolution toward harder problems and new dimensions of performance. Despite the pivotal role of benchmarking in driving progress, evaluation has traditionally received less attention compared to core modeling research. Additionally, creating high-quality benchmarks requires unique skills that are emphasized less heavily in the literature. This overview aims to solve these problems by providing an extensive survey of useful LLM benchmarks and the techniques—including both practical tricks and more recent directions of research—used to create them.

Disclaimer. Agent and coding benchmarks are notably absent from this overview. These domains are rapidly advancing and require unique evaluation techniques that have led to the creation of completely new areas of research in LLM evaluation. Due to their depth, these topics will require an overview of their own, and several useful resources on these topics are already available.

Dissecting Popular LLM Benchmarks

The best way to understand how LLM benchmarks are created—and how we can create a useful benchmark for our own task of interest—is to simply study details of the most popular and effective LLM benchmarks. In this section, we will select a wide variety of LLM benchmarks, including both recent benchmarks and those that have been around for a while, and outline the following characteristics:

How the data is sourced
How data quality is ensured
How model performance is measured
How each benchmark has evolved as models have improved

Admittedly, this section is far from comprehensive—a vast number of LLM benchmarks exist, and surveying them all would be impossible. Instead, this section optimizes for diversity and aims to provide a wide view of the different kinds of benchmarks that exist and the various strategies that are commonly used to create useful evaluation datasets across these many different domains.

Massive Multitask Language Understanding (MMLU) [1]

“To succeed at our test, future models should be well-rounded, possess extensive world knowledge, and develop expert-level problem solving ability. These properties make the test likely to be an enduring and informative goalpost.” - from [1]

MMLU is one of the most widely used general knowledge benchmarks for LLMs. The data curation strategy for MMLU is simple: questions are sourced from freely available online sources and manually curated by graduate and undergraduate students. The benchmark contains ~16K questions divided into 57 subjects1 that span various topics like STEM, humanities, social sciences, and more. The full MMLU benchmark contains a development set of five examples per subject (i.e., used for few-shot prompting), a validation set of 1.5K questions, and the main test set. For each task, we have a minimum of 100 questions in the test set.

Data format. The questions within the MMLU benchmark use a multiple choice format, and models are evaluated using a zero or few-shot prompting strategy. Authors of the benchmark specifically avoid open-ended generation due to the increased evaluation complexity. Multiple choice correctness can be validated with string matching, allowing MMLU to be evaluated using accuracy. Several example questions from MMLU are provided below for reference.

(from [1])

Difficulty. Some subjects are separated into sub-tasks based on their difficulty level. More specifically, MMLU defines subjects at elementary, high school, college, and professional levels, where difficulty is inferred from the source of the questions. For example, the professional subset of the Psychology domain pulls from the exam for professional practice in Psychology, whereas the high school subset pulls from advanced placement exams (i.e., tests for high school students). Notably, not all subjects have a task for each difficulty level.

“Human-level accuracy on this test varies. Unspecialized humans from Amazon Mechanical Turk obtain 34.5% accuracy on this test. Meanwhile, expert-level performance can be far higher. For example, real-world test-taker human accuracy at the 95th percentile is around 87% for US Medical Licensing Examinations… We estimate that expert-level accuracy is approximately 89.8%.” - from [1]

As we might expect, human-level accuracy on MMLU varies significantly based on the human, domain, and level of difficulty being considered. Given that MMLU is still popular even today, several extensions have been proposed (e.g., MMLU-Pro [2] and MMLU-Redux [3]) to diagnose quality issues and to keep the benchmark from becoming saturated by newly-released LLMs over time.

“[Benchmark performance] has begun to plateau, making it increasingly difficult to discern differences in model capabilities. This paper introduces MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. Additionally, MMLU-Pro eliminates the trivial and noisy questions in MMLU.” - from [2]

MMLU-Pro. We learn in [2] that MMLU has a non-negligible ratio of easy (i.e., knowledge-only or low reasoning) questions, as well as some questions that are flawed or incorrect. To avoid saturation and reduce noise, MMLU-Pro [2] reconstructs the benchmark in order to make it more accurate, difficult, and discriminative. The 57 subjects from MMLU are consolidated into a set of 14 broader domains, and the majority of easy questions are removed from MMLU-Pro using model-based difficulty filtering. A pool of eight models is tested on each question, and any question that the majority of models answer correctly—5,886 questions in total—is removed. From here, the remaining MMLU questions are supplemented with harder questions from sources like TheoremQA and SciBench, yielding a final benchmark of ~12K questions; see below.

(from [2])

For new data sources, questions are converted into a multiple choice format by asking GPT-4-Turbo to extract a correct answer and generate distractor answers. The result of this process is manually verified by asking human annotators to compare extracted answers to the original solution for each question. To reduce the impact of random guessing, the number of choices for each question is also expanded from four to ten—this is referred to as “option augmentation” in [2].

After data filtering and curation, MMLU-Pro undergoes an extensive quality control phase with multiple stages of verification by humans and LLMs. The quality control process aims to identify bad questions, incorrect answers, and false positive distractors. Human validation is performed first, then Gemini-1.5-Pro flags any remaining issues for a second stage of human review.

(from [2])

The full curation pipeline for MMLU-Pro is depicted above. MMLU-Pro still uses accuracy as the main performance metric, though we can also separately examine accuracy within each specific domain. Most LLMs perform worse on MMLU-Pro relative to MMLU—the benchmark is more difficult and has headroom before saturation—and model capability gaps tend to be more noticeable. We also see in [2] that MMLU-Pro offers improved prompt stability and benefits from advanced reasoning techniques (e.g., chain of thought prompting).

(from [3])

MMLU-Redux. An in-depth quality audit of the MMLU benchmark is performed in [3] over a subset of 100 questions randomly sampled from each MMLU task (i.e., 5,700 questions in total). Quality issues are categorized using a hierarchical error taxonomy; see above. This taxonomy contains five error categories that are used to granularly categorize questions with poor quality or incorrect ground truth. When necessary, questions are re-annotated and verified according to the original source material or, when the original source is absent, a trusted source (e.g., government websites). We see in [3] that an estimated 6.49% of MMLU questions contain errors, but the ratio of errors varies between subjects; e.g., 57% of Virology questions were flagged due to quality issues; see below.

(from [3])

The result of this sampling and re-annotation procedure is MMLU-Redux, a subset of 5,700 manually inspected MMLU questions. For several high-error subjects, authors monitor agreement across three separate annotators using Cohen’s Kappa. Re-annotation agreement is found to be strong even on difficult subjects, providing confidence in the quality of the human-audited data. The aim of this effort is not to produce a harder version of MMLU but rather to audit (and fix or discard) existing questions for quality and accuracy—MMLU-Redux is an updated subset of MMLU that can be adopted for more reliable evaluation.

We see in [3] that removing incorrect evaluation data meaningfully impacts performance and model rankings; see below. For example, Llama-3.1-405B improves from 16th to first in rank for Virology and Qwen-2-72B-Instruct drops from first to eighth place for College Chemistry when only evaluating on correct instances from MMLU-Redux—these results suggest improved reliability.

(from [3])

GPQA: A Graduate-Level Google-Proof Q&A Benchmark [4]

GPQA is another popular LLM benchmark that takes a different approach from MMLU. Namely, GPQA is a much smaller dataset: the extended version contains 596 questions, while the main and diamond subsets contain 448 and 198 questions, respectively. Rather than providing broad coverage, GPQA focuses on curating a small number of expert-verified questions that are difficult to solve even with internet access (i.e., a “Google-proof” benchmark). Three primary domains are covered—Biology, Chemistry, and Physics—each of which is divided into several sub-domains2. Similarly to MMLU, however, GPQA does adopt a multiple choice question format with four answers per question.

“We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are “Google-proof”).” - from [4]

Expert curation. The data from GPQA is manually curated by a group of 61 human experts that each have—or are pursuing—a PhD in a relevant field. The data curation pipeline for GPQA is depicted below. To begin, experts in each domain write a set of candidate questions. These questions are written from scratch, rather than being collected from existing exams or datasets. As a guiding principle, experts are specifically asked to write questions that are:

Difficult.
Answerable by experts in the same domain.
Not possible for non-experts to answer, even with internet access.

Questions are always written such that they can be answered with or without choices being presented, thus enabling GPQA to be easily extended to an open-ended generation format in the future. In addition to writing each question, a written explanation is provided for both the answer and all distractors.

(from [4])

After a question is written, two separate domain experts interact with it. The first expert solves and validates the question, then suggests possible revisions. After the writer revises the question based on suggestions, a second domain expert answers the revised question. Finally, three different non-expert validators—selected from the group of experts for other, non-overlapping domains—try to answer the question with unrestricted internet access, spending a minimum of 15 minutes and nearly 40 minutes on average answering each question.

“The process consists of four main stages: question writing, expert validation, question revision, and non-expert validation.” - from [4]

Verification principles. The GPQA curation process validates both correctness and difficulty. Correctness is handled via expert validation and revision, while difficulty is assessed based on the ability of non-experts to solve questions. The results of these two stages are used to define the different subsets of GPQA:

GPQA Extended: full dataset (546 questions).
GPQA Main: questions where at least one expert agrees with the answer and at most two non-experts answer the question correctly (448 questions).
GPQA Diamond: questions where both experts agree with the answer and at most one non-expert answers the question correctly (198 questions).

As shown below, the resulting subsets are quite difficult, with experts achieving around 70-80% accuracy and non-experts a much lower accuracy of 30-40%.

(from [4])

Beyond the Imitation Game Benchmark (BIG-Bench) [5]

“BIG-bench… includes a set of 204 or more language tasks. As reflected in the BIG-bench review criteria, benchmark tasks are novel, cover a diverse range of topics and languages, and are not fully solvable by current models.” - from [5]

BIG-Bench explores a community-based strategy for curating difficult LLM evaluation tasks. The benchmark was openly constructed on Github, where researchers were asked to contribute tasks by creating a pull request. Each task was then manually reviewed in a corresponding PR discussion according to detailed submission criteria; e.g., correctness, difficulty, decontamination, and justification (i.e., why is this an important task for LLMs to solve?). The version of BIG-Bench outlined in [5] contains 204 tasks that were curated by 405 authors. The set of included tasks is incredibly broad, covering topics like math, coding, reasoning, science, and more; see here for a summary of task domains.

Task interface. Unlike the benchmarks we have seen so far, BIG-Bench does not have any unified data format—tasks have varying formats ranging from multiple choice to open-ended generation and multi-turn (interactive) chat. In order to handle the diversity of tasks present in BIG-Bench, authors introduce a standard API structure that is used by all tasks. This API specifies two task types:

JSON: defined by a JSON file containing a list of input-output examples.
Programmatic: defined by a Python function that can interact directly with the model over multiple chat turns and compute custom metrics.

By using these standardized structures for all tasks, we can easily evaluate any public model or onboard new tasks with minimal implementation changes. The distribution of BIG-Bench tasks follows an 80-20 split between JSON and programmatic task types. In programmatic tasks, we interact with the model via two standard functions:

generate_text: generate a text continuation from the model.
cond_log_prob: compute log probabilities of a target given input.

The model can be queried multiple times within a programmatic task, enabling support for multi-turn chat or iterative tasks within BIG-Bench. Each task must have a minimum of 32 evaluation samples, though authors are encouraged to create much larger tasks; see below for a distribution of task sizes.

(from [5])

Performance metrics. Given that BIG-Bench tasks follow a variety of formats, we cannot evaluate all tasks with a unified performance metric like accuracy. Instead, a suite of standard metrics is provided for all tasks, and programmatic tasks are even allowed to define their own custom metrics. In [5], authors list the following performance metrics as being used in BIG-Bench:

Exact String Match.
Multiple Choice Accuracy.
Text Similarity Metrics (e.g., BLEU, BLEURT, or ROUGE).
Multi-Category Brier Score: evaluates the calibration—a measure of how well confidence3 aligns with observed correctness—of a model’s outputted probabilities on options for a multiple choice question.
Expected Calibration Error: another calibration metric that measures how well the model’s accuracy matches the probability assigned to a response in the multiple choice setting.

Interestingly, BIG-Bench even allows multiple evaluation metrics to be defined per task, but one metric must be defined as the primary metric. Additionally, each task must specify a high and low reference score on the primary metric. Using this information, we can normalize each task’s preferred metric using the high and low reference scores. Then, we can compute aggregate performance over the entire benchmark by averaging normalized metrics across tasks—this approach summarizes benchmark performance with a single score in the range [0, 100].

(from [5])

As shown above, all models at the time of BIG-Bench’s proposal performed well below human baseline performance; see above. Although performance improves with model scale, all models perform poorly in an absolute sense, indicating that the benchmark was quite difficult to solve for models at that time. Human performance metrics in the above plot—reported as both a max and mean score across multiple annotators—were collected using a team of expert annotators that were given full internet access. However, properly measuring human performance is difficult given the breadth of tasks present in BIG-Bench.

“While we report mean and max human rater scores for all tasks evaluated by raters, care must be taken when interpreting these metrics. We do not claim that these scores are the best possible achievable by a human, or even that these scores are the best achievable by these particular evaluators… For example, if a task requires knowledge of programming, how do we weight scores of evaluators who do not know how to program?” - from [5]

BIG-Bench Lite. The size and breadth of BIG-Bench makes it computationally expensive to run. To solve this, authors in [5] provide a smaller task subset, called BIG-Bench Lite, to use for faster evaluation. This subset is made up of 24 JSON-style tasks that are chosen via a manual selection process that considers task diversity and inclusion of specific task types (e.g. coding or non-English tasks).

BIG-Bench Hard (BBH). Less than a year after the release of BIG-Bench, LLMs had already begun to surpass average human performance on the majority of tasks. BIG-Bench Hard [6], a difficult subset of the BIG-Bench dataset, was created in response to these quick improvements in capabilities. The steps used to select the tasks within BIG-Bench Hard are outlined in the table below.

(from [6])

All tasks in BIG-Bench Hard are derived from BIG-Bench. Initially, tasks are filtered according to several heuristics; e.g., not containing too many subtasks, having too few evaluation examples, or using evaluation metrics beyond multiple choice or exact match accuracy. Any task without a human performance baseline is also removed, and the remaining task subset is further refined by only retaining tasks where models underperform humans. From here, tasks are then manually inspected to remove any tasks that are overly difficult or out of scope4, leaving us with the final set of 23 tasks in BIG-Bench Hard; see below.

(from [6])

Despite focusing on a much smaller set of difficult tasks—about 10% of the original benchmark—that have a standard format, BIG-Bench Hard is mostly able to maintain the breadth of BIG-Bench. The tasks present in BIG-Bench Hard can be roughly categorized into natural language (e.g., detecting translation errors or recommending movies) and algorithmic (e.g., evaluating boolean expressions or performing multi-step arithmetic) tasks. When examining model performance on BIG-Bench Hard, we see that the models considered in [6] usually surpass average human performance but fall short of the best performance of a human. However, the best LLMs today achieve almost perfect accuracy on BIG-Bench Hard.

Given that BIG-Bench is constructed as a community effort, benchmark tasks have a high level of variance—cleanliness and quality fluctuate, and each task may have different metadata. Tasks are selected based on both quality and difficulty by using a combination of heuristics and manual inspection. Additionally, BIG-Bench Hard restricts the benchmark to tasks that use an exact match or multiple choice format. This choice is made to simplify the analysis of chain of thought prompting by enabling the use of a unified prompt format across different tasks. In this way, BIG-Bench Hard does not solely maximize difficulty—it identifies a subset of hard tasks that also work well with chain of thought prompting.

(from [6])

As shown above, several top models at the time of release for BIG-Bench Hard noticeably underperform the average human baseline. This gap can be closed in many cases via chain of thought prompting, but benchmark performance still falls short of maximum human performance for even the largest models.

BIG-Bench Extra Hard (BBEH). The BIG-Bench family is one of the few reasoning-focused evaluation suites that prioritizes general reasoning rather than math and coding. However, both BIG-Bench and BIG-Bench Hard were saturated by early 2025, with top reasoning models achieving nearly perfect scores on both benchmarks. As a solution, BIG-Bench Extra Hard was created by replacing each of the BIG-Bench Hard tasks with a corresponding task that tests a similar category of reasoning capabilities but is significantly more difficult.

“BIG-Bench Extra Hard replaces each task in BIG-Bench Hard with a novel task that probes a similar reasoning capability [with] increased difficulty.” - from [5]

Examples of new reasoning skills tested by BIG-Bench Extra Hard include many-hop reasoning, long context reasoning, properly handling distractors, finding errors in reasoning traces, reasoning under constraints, and more. To perform well on BIG-Bench Extra Hard, models must command a breadth of different reasoning capabilities. An itemized list of the reasoning tasks present in BIG-Bench Extra Hard is provided in the figure below. Each task matches the general reasoning domain of some corresponding task in BIG-Bench Hard, ensuring that the diversity of BIG-Bench Hard is preserved while increasing task difficulty.

(from [7])

As seen in the middle column of the table, tasks in BIG-Bench Extra Hard are sourced from a variety of existing reasoning benchmarks and manually chosen according to their topic and difficulty. When curating the benchmark, authors aim to solve the following known issues with BIG-Bench Hard:

Many tasks have high random chance performance due to the presence of multiple choice questions with a small number of options (e.g., ~35% of tasks have binary output and ~20% of tasks use multiple choice with <5 options).
Some tasks permit shortcuts that allow the task to be “solved” without actually reasoning through a proper solution.
Task inputs tend to be very short—around 700 characters on average—across BIG-Bench Hard tasks, which is unrealistic compared to how LLMs are typically used in practice.
True multi-hop reasoning is rarely tested in BIG-Bench Hard due to limitations in LLM capabilities when the benchmark was created.

Ideally, we would like to solve all of these issues while expanding the set of reasoning capabilities being tested by the benchmark. BIG-Bench Extra Hard tasks contain 200 questions—except for DisambiguationQA, which has only 120. Although the task selection process was mostly manual, data was curated using a combination of manual human inspection with model assistance. Two models are used—a general purpose model and a reasoning model (both Gemini-based)—to iteratively evaluate data that is selected for each task. Tasks that were easily solved by the reference models were either i) discarded and replaced with more difficult tasks or ii) enhanced with harder reasoning examples. This process continued until both models achieved an accuracy below 70% on each task.

“In most cases, we tried to use the reference models only as a black box that provided feedback on the difficulty of our tasks. In some cases, however, making tasks more difficult required looking into the approach adopted by the model.” - from [7]

The combination of human and model oversight in BIG-Bench Extra Hard is interesting and provides motivation for unique ways in which humans can interact with LLMs to curate better evaluation data. For example, authors in [7] even mention manually inspecting reasoning traces from the models to help them think of more difficult examples that would actually challenge the model. Tasks in BIG-Bench Extra Hard have significantly expanded context compared to the prior benchmark, have negligible random chance performance, and provide a lot of headroom in performance even for top models (e.g., o3-mini); see below.

(from [7])

IFEval [8] and IFBench [9]

(from [8])

The IFEval [8] benchmark tests LLM instruction following capabilities, with an emphasis on instructions that are objectively verifiable (i.e., as opposed to instructions that are more subjective). For example, if we instruct an LLM to generate an output containing 100 to 200 words, we can easily verify whether this instruction was followed by using a basic script. However, verifying whether an LLM obeys a certain tone specification in its output is less straightforward.

“The task of precise instruction following evaluates a language model’s ability to perform a task t, such as summarization or creative writing, while adhering to one or more output constraints c, which can be automatically verified.” - from [9]

To start, 25 instructions—structured as verifiable constraint templates for the model’s output—are manually curated based on practicality and verifiability; see below.

(from [8])

From these instructions, evaluation samples are curated as follows:

Create a set of base prompts5.
Combine these base prompts with one to three randomly selected verifiable instructions by concatenating instructions to the end of the prompt.
Use few-shot prompting and manual inspection to identify instruction combinations that are illogical or contain conflicts.
Use few-shot prompting to rephrase each prompt and, in turn, improve the diversity of instructions in the benchmark.
Manually review all rephrased prompts.

Exact details of the data curation process are not fully outlined in [8]. However, we know from the information provided that a model-in-the-loop approach is used with manual human review to ensure quality. To measure performance, a binary verification check is created for each instruction that can be used to determine if a model followed an instruction or not. Instruction-level binary verification signals can be used to compute the following strict metrics:

Instruction-level strict accuracy: the percentage of all individual instructions that the model follows.
Prompt-level strict accuracy: the percentage of prompts for which the model follows all instructions.

Additionally, several loose metrics are considered in [8] that perform verification under a variety of transformations to the model output (e.g., removing markdown and removing the first or last lines). After applying a transformation, we can compute instruction and prompt-level accuracy similarly to before, resulting in a loose version of each metric. An instruction is considered solved if it passes verification after any of the possible transformations that are tested.

“The new constraints we introduce were created manually – sourced by collecting feedback from LM users beyond the authors on the types of constraints they have tried with models, or manually written to cover core instruction following skills. Then, we filtered constraints for the benchmark to those that can be easily paired with a verification function written in Python, making for reproducible evaluation and training tools.” - from [9]

The IFEval benchmark only tests 25 instructions and, therefore, risks overfitting to a small set of constraints. As a solution, IFBench [9] proposes an expanded set of 58 verifiable, manually-curated constraints. When deriving new constraints, authors i) inspect feedback from LLM users on instruction following issues, ii) focus on core areas of instruction following, iii) emphasize difficult constraints, and iv) only use constraints that can be verified with a Python function. Going further, an additional set of 29 constraints (IFTrain) are provided for training purposes. These training constraints can be used for RLVR training, enabling investigation into the generalization properties of instruction following.

The 58 constraints in IFBench are grouped into seven categories—count, ratio, words, sentence, format, custom, and copy—that cover a broad range of instruction following skills. To create prompts for these instructions, authors take unseen prompts from WildChat and combine them with either one or two constraints from the expanded set. Every test prompt is manually inspected by a human annotator to ensure constraint compatibility, and the final benchmark consists of 300 total prompts. As shown below, performance on IFBench is noticeably lower than on IFEval, indicating some level of overfitting to specific constraints.

(from [9])

Authors in [9] provide a potential reason for the overfitting to IFEval constraints. Many LLMs have curated training data that specifically targets instruction following capabilities. Most of this training data is synthetically generated because precise instruction following can be deterministically verified. Given the popularity of IFEval, model developers often adopt the same constraint taxonomy when generating synthetic instruction following data; see Nemotron-4 340B as an example. As a result, some models may be explicitly trained to follow the same constraints being tested by IFEval, leading to inflated performance metrics6.

AlpacaEval [13]

Judge prompt from AlpacaEval (source)

AlpacaEval is a pairwise instruction following benchmark that measures model performance by using an LLM judge to compare candidate model completions to those of a baseline model; see above. The most recent version of AlpacaEval uses GPT-4-Turbo as both the baseline and judge model. The data used in AlpacaEval is sourced from the earlier AlpacaFarm dataset, which contains a total of 805 prompts derived by combining the evaluation sets from:

Despite the variety of data sources, most of this data is curated using a similar approach. For example, Self-Instruct proposes a synthetic data generation strategy for instruction tuning, but prompts from the evaluation dataset for Self-Instruct are manually written by human experts. Similarly, Anthropic helpfulness is a human preference dataset, while the Vicuna and Koala test sets are manually curated by researchers working on the projects. The only outlier of these evaluation sets is Open Assistant, which is derived from crowdsourced human conversations with an LLM, rather than being curated by experts.

“AlpacaEval is an LLM-based automatic evaluation that is fast, cheap, and reliable. It is based on the AlpacaFarm evaluation set, which tests the ability of models to follow general user instructions. Responses are compared to reference responses by the provided GPT-4 based auto-annotators [to compute a win rate]. AlpacaEval displays a high agreement rate with ground truth human annotations.” - AlpacaEval

After the initial release of AlpacaEval, several follow-up versions of the benchmark were published, but the underlying evaluation data did not change much. Instead, subsequent improvements to AlpacaEval focused on changing the reference and judge models to improve the benchmark’s correlation with human preferences. Full code and updates to AlpacaEval can be found here.

Math Evaluation

Many evaluation datasets exist in the math domain, and most of them are either i) expert-curated or ii) drawn from test banks for math competitions. For example, GSM-8K contains 8.5K human-written grade school math problems, while MATH contains 12.5K questions compiled from high school math tests. Additionally, the American Invitational Mathematics Examination (AIME), which is commonly used to evaluate LLMs, is released every year with a set of 15 new questions. Questions from the American Mathematics Competitions (AMC) are also commonly used for LLM evaluation. Solutions to questions in these benchmarks are usually graded with an automatic verifier or exact string matching.

The benchmarks outlined above have been saturated by modern LLMs, but many frontier-level math benchmarks have been recently proposed:

FrontierMath contains hundreds of expert-crafted problems at the cutting edge of mathematical research that require hours or days to be solved by an expert-level researcher.
RealMath is a continuously-evolving benchmark that automatically updates with new problems derived from research papers and discussion forums.
MathArena is an evolving benchmark that evaluates LLMs on math competition problems soon after their release to avoid contamination risk.
OmniMath contains 4.5K competition-level math problems that have been annotated by human experts, covering a diverse range of topics (i.e., over 30 sub-domains) and difficulty levels.

Solutions to questions in these benchmarks are still commonly evaluated with automatic verifiers, but this is not always the case. For example, proof-based questions in MathArena are manually checked by human experts. Despite the impressive math capabilities of modern LLMs, most of these frontier-level math benchmarks have not yet been fully saturated. However, LLMs are advancing rapidly in their capabilities, so several of these datasets are designed in a way that enables continual evolution in order to avoid contamination and saturation.

Iteratively Improving a Benchmark

When studying the benchmarks outlined above, we see several examples of iterative benchmark refinement. Benchmarks become saturated and less informative over time, which is usually addressed by releasing an improved benchmark. To create such an improved benchmark, there are several common techniques and directions that are usually followed, such as:

Difficulty-based refinement: curating more difficult tasks or data to use for evaluation within a benchmark.
Quality-based refinement: identifying and fixing issues in the benchmark (e.g., mislabeled data, vague or unrealistic questions, poor format, etc.).
Diversity-based refinement: expanding the scope of questions and topics covered by a particular benchmark.

Usually, these directions of improvement are handled via manual human review, a model-in-the-loop approach, or some combination of both. In some cases, we can even design a benchmark in a way that continually evolves over time without too much manual effort (e.g., RealMath and MathArena). However, the range of techniques that can be used for iterative benchmark improvement is vast—there is a lot to learn in this area. To provide pointers for future learning, a set of useful resources for benchmark improvement is listed below:

Do Large Language Model Benchmarks Test Reliability?: corrects labeling errors in common LLM benchmarks to better measure LLM reliability.
Improving Model Evaluation using SMART Filtering of Benchmark Datasets: a framework for systematically identifying and filtering evaluation data that is too easy, similar to other questions, or possibly contaminated.
From Crowdsourced Data to High-Quality Benchmarks: an LLM-based approach for post-processing crowdsourced data into high-quality evaluation samples.
Reliable and Efficient Amortized Model-based Evaluation: a model-based approach for difficulty filtering and difficult question generation.
Evidence-Centered Benchmark Design for NLP: an evidence-backed framework for properly designing evaluation benchmarks.
Evaluation Guidebook (from Hugging Face): a practical field guide for evaluating LLMs, assessing benchmark quality, and curating evaluation data.

There are also many papers that have been proposed for optimally selecting subsets of benchmark data to improve efficiency [14, 15, 16, 17].

Advanced Benchmarking for LLMs

Now that we understand practical details for constructing LLM benchmarks, we will take a deeper look at some advanced techniques for LLM evaluation that have been proposed in recent research. Specifically, we will focus on a set of papers that use Item Response Theory (IRT) to select the most informative data for evaluation. Coming from the field of psychometrics, IRT uses statistical modeling to dynamically measure how an individual’s latent abilities interact with the properties of an item (or question) to determine the probability of a correct response. Although IRT is commonly applied in standardized testing environments, the same concepts have been adopted by LLM researchers. We can directly apply techniques from IRT to LLM evaluations by considering the LLM as our individual and the evaluation dataset as our standardized test!

In the context of LLM evaluations, IRT considers a model l, dataset items i, and the probability p_il that model l gets item i correct. We can use a variety of different models—usually just different variants of logistic regression—to predict this probability. IRT models include parameters for both the model and the item being evaluated. Whereas model parameters capture the capabilities of a given model, item parameters capture the following properties:

Difficulty: whether the item is easy or difficult to answer correctly.
Discrimination: whether answer correctness has a strong relationship with the capability level of a model.

By capturing these properties within our IRT model, we gain a rich description of our evaluation data that can be directly applied to benchmark improvement. For example, items with low discrimination are often problematic (e.g., due to mislabeling), and we can consider filtering out items that are too easy from the evaluation process. Within this section, we will see several IRT formulations that demonstrate a broad set of potential applications to the evaluation process.

tinyBenchmarks: Evaluating LLMs with Fewer Examples [11]

“Evaluating the performance of a single LLM on HELM costs over 4K GPU hours (or over $10K for APIs). Benchmarks like AlpacaEval also require a commercial LLM as a judge to perform evaluation, further increasing the costs… evaluation of a single model is often performed many times to… explore different prompting strategies or a wider range of hyperparameters.” - from [11]

To mitigate excessive inference costs during evaluation, an IRT-based approach called tinyBenchmarks is proposed in [11] that intelligently samples evaluation data in a way that maintains the accuracy of a model’s performance metrics. We assume access to a dataset of historical evaluation results that can be used for selection and performance estimation. More specifically, this dataset contains items i and models l, where each item and model combination has a binary score Y_il ∈ {0, 1}. We can also handle continuous evaluation results in the range [0, 1]—nearly any evaluation setting can be converted into this format by normalizing scores—by simply binarizing scores according to a fixed threshold.

Baselines. There are a few simple and effective approaches that can be adopted to sample a subset of data from an evaluation dataset:

Stratified random sampling: ensure proportional representation across benchmark sub-domains by randomly sampling a subset of evaluation samples separately within each subdomain.
Correctness-based clustering: sample evaluation data based on patterns in correctness by representing each item i as a vector of correctness scores for each model l, performing K-means clustering on these vectors, and selecting the evaluation samples closest to each cluster centroid.

Despite their simplicity, these techniques have notable drawbacks. Stratified sampling leads to high variance and uncertainty when the number of samples is small, while correctness-based clustering tends to suffer from the curse of dimensionality if we have evaluation results from a large model pool.

IRT model. In [11], IRT is used to derive a much smaller representation of our evaluation data that can be more effectively used to both select samples and estimate performance. We define item i using two parameters:

α_i: captures the skills required to solve item i.
β_i: captures the overall difficulty of item i.

Similarly, we describe model l with the parameter θ_l, which captures model capabilities. From here, we define a multidimensional IRT model, which predicts the probability p_il that item i will be answered correctly by model l; see below. We can fit the IRT model—or learn the correct values for all of the model and item parameters—by using our historical evaluation dataset as training data.

Two parameter multidimensional IRT model (from [11])

As we can see, the center point of this equation is the inner product of the item and model parameter, which captures how well the capabilities of a model match those needed for an item. Intuitively, a model is more likely to answer an item correctly if it has strong capabilities in the same directions required to solve an item and vice versa. Additionally, we add an extra bias term to this inner product to account for overall item difficulty before passing the full expression through a sigmoid (or logistic) function to yield a probability in the range [0, 1].

“The IRT model creates a meaningful representation for each example i based on their difficulty and the abilities required to respond to those examples correctly. This approach immediately solves the dimensionality problem, since E_i is low-dimensional… IRT should represent which examples have similar difficulty and require similar abilities.” - from [11]

Once fitted, parameters of the IRT model naturally provide a d + 1 dimensional vector E_i = (α_i, β_i) that can be used to represent items in our evaluation dataset. This representation is low dimension (d < 16 in [11]) compared to vectors used for correctness-based clustering, thus solving issues related to the curse of dimensionality. The IRT model is used in two ways in [11]:

To perform cluster-based sampling, similarly to correctness-based clustering (but with embeddings from the IRT model E_i).
To predict model performance over items—this is more efficient than actually running the evaluation itself.

p-IRT estimator. In [11], the two approaches described above are used in tandem to efficiently estimate model performance on an evaluation set. Assume we want to evaluate a new model l’ on an existing evaluation set for which we already have an IRT model fitted. We can use clustering to identify “anchor points”—or high-signal evaluation samples—in the data and evaluate our model only on these samples. The number of anchor points is a hyperparameter that can change with our evaluation budget. We can then keep our existing item parameters fixed in the IRT model and only train the parameter for our new model θ_l’, using real evaluation results on our anchor points as training data. After obtaining θ_l’, we can predict performance on the remaining items by using our IRT model.

Efficiently estimating evaluation metrics with p-IRT (from [11])

A formal description of this approach, called the p-IRT estimator in [11], is outlined above. Put simply, we are interested in measuring the model’s actual performance on the full evaluation set, but running an entire benchmark is expensive. Instead, we use IRT model parameters to obtain K anchor points via clustering—where K is much smaller than the full dataset size—and only evaluate our model on these anchor points. Then, we can estimate performance on the rest of the evaluation dataset using the IRT model and derive an overall performance estimate by averaging real and predicted evaluation results; see above.

Beyond the p-IRT estimator, we can estimate performance with a sample average of the model’s performance on the anchor points only. This sample average has low bias because we are using correctness values obtained from our model on the actual evaluation data. However, the variance of the sample average is high when the number of anchor points K is small. On the other hand, the p-IRT estimator is biased—due to the fact that our IRT model is not perfectly accurate—but has low variance. Therefore, we can create an estimator that combines the strengths of both approaches by taking a convex combination of each estimate; see below.

IRT++ estimator (from [11])

This revised estimator is referred to as IRT++ in [11]. The per-item weight in this expression is optional but can be used to assign non-uniform weights to anchor points. For example, this weight can correspond to the ratio of evaluation samples present in the cluster used to derive a given anchor point. In [11], λ lies in the range [0, 1], and the optimal value of λ depends upon several factors (e.g., the number of anchor points and the variance of our performance estimate). The value of λ is derived in [11] by using a heuristic proposed in prior work.

Efficient evaluation. To test the efficacy of IRT-based performance estimation, four benchmarks are considered—Open LLM Leaderboard, MMLU [1], HELM, AlpacaEval 2.0 [13]—and we compare the estimated and actual performance on each benchmark. Training data for the IRT model is collected from a large number of LLMs—395 models for Open LLM leaderboard and MMLU, 30 models for HELM, and 100 models for AlpacaEval 2.0—to ensure the quality of the IRT model. The LLMs are split into training and test sets using two approaches:

Random: randomly sample a subset of LLMs to use for testing.
Date-based: use the most recent LLMs for testing.

As shown below, the proposed IRT-based estimators perform well across all scenarios considered. With as few as 100 anchor points per sub-domain of the evaluation set—a reduction of 140× for MMLU and 160× for the Open LLM Leaderboard—we can estimate performance with less than 2% error.

(from [11])

Fluid Language Model Benchmarking [12]

Most LLMs are evaluated in a static fashion (i.e., by computing accuracy on a fixed dataset). Whereas raw accuracy treats every evaluation sample equally, IRT estimates a model’s underlying capabilities, taking into account factors like the difficulty and discrimination of each question. Leveraging this insight, authors in [12] propose an approach called Fluid Benchmarking that uses an IRT model to dynamically select evaluation data for a particular model. The key idea behind this approach is that the value of an evaluation sample depends upon a model’s capabilities. Instead of assuming there is a single best subset of examples on which to evaluate an LLM, Fluid Benchmarking dynamically selects the most informative evaluation examples for a particular model and, in turn, provides a more accurate estimate of that model’s performance.

“Fluid benchmarking is based on the insight that the relative value of benchmark items depends on an LM’s capability level… a hard question might be too difficult for a weak LM, but informative for a strong LM.” - from [12]

Unidimensional IRT. Similarly to before, the approach in [12] fits an IRT model using a dataset of historical evaluation data derived from evaluating a large set of models on a benchmark of interest. However, a different IRT model structure is used in [12]. As shown below, this is again a two-parameter IRT model that is used to predict binary evaluation outcomes, but we use unidimensional—as opposed to the multidimensional approach used in [11]—model and item parameters. Authors mention testing a multidimensional IRT approach in [12] but found that this formulation performs poorly compared to a unidimensional IRT model.

Two-parameter unidimensional IRT model (from [12])

Despite the different IRT model structure used in [12], the purpose of these parameters remains the same:

θ_l: a scalar parameter that represents the capability of model l.
α_i: a scalar parameter that captures the discrimination7 of item i.
β_i: a scalar bias that represents the difficulty of item i.

(from [12])

The Fluid Benchmarking approach proposed in [12] is depicted above. There are two main phases for obtaining a benchmark result:

An offline (or historical) phase, where we fit item and model parameters in the IRT model from leaderboard-style results on our benchmark.
An online phase, where we learn the parameter of a new model given a subset of evaluation results for this model on our benchmark.

The IRT model is initially fit using an offline dataset of evaluation results. Given a new model l’, we first evaluate this model on a subset of our evaluation set to obtain some training data for the new model parameter θ_l’. Similarly to [11], we then leave item parameters fixed and fit only the new model parameter θ_l’ by using the actual evaluation data collected with our new model for training.

By examining the structure of our IRT model, we can intuitively understand how item parameter values can influence the value of θ_l’. Easy questions have a small (or negative) difficulty parameter β_i, so answering them correctly has minimal impact on θ_l’. On the other hand, correct answers to a difficult question will meaningfully impact the value of θ_l’. The same arguments hold in reverse for incorrectly-answered questions: answering a difficult question incorrectly is not a big deal, but easy questions will impact θ_l’ when answered incorrectly. The value of the discrimination parameter α_i impacts the magnitude of updates to θ_l’. Highly-discriminative items have large values of α_i, leading them to meaningfully impact the value of θ_l’ and vice versa.

Estimating performance. Instead of measuring performance with accuracy metrics, Fluid Benchmarking directly uses the value of θ_l’ as the performance metric for a model. While accuracy simply captures the ratio of items answered correctly in a benchmark, Fluid Benchmarking asks an inverse question: What capability level of our model is most likely to produce the pattern of incorrect and correct answers we observed? By answering this question, we can estimate performance in a way that meaningfully considers the difficulty and discrimination of each item in our evaluation set. Raw accuracy on a discrete evaluation dataset is a common proxy for measuring model capabilities. However, Fluid Benchmarking [12] forgoes this proxy, instead using IRT to directly estimate model capabilities.

“IRT draws upon existing LM evaluation results to enrich benchmarks with information about item difficulty and discrimination, which is leveraged to dynamically select items that match an LLM’s capability level… This contrasts with… static benchmarking, which assumes a globally optimal set of evaluation items for all LMs.” - from [12]

Dynamic sampling. The final detail necessary to understand Fluid Benchmarking is the data selection process. As mentioned previously, we use a subset of real evaluation results to estimate the parameter for a new model θ_l’. These items that are evaluated could be taken from a static evaluation set—this is a common approach in practice. However, Fluid Benchmarking argues that the set of items used for evaluation should be dynamically selected based on the model. For a weaker model, easier items will be more informative and vice versa.

(from [12])

Evaluation items are selected in [12] by computing the Fisher information of each item in the dataset. This metric prioritizes items that are most informative for a particular model by considering i) item discrimination and ii) item difficulty with respect to the capability level of the model being evaluated. Notably, the Fisher information changes depending on the capability level of a model. The figure above illustrates changes in the Fisher information during the training process. As the model continues training, it becomes more capable, leading to changes in the Fisher information that prioritize the selection of more difficult examples.

To select evaluation data based on the Fisher information, authors in [12] propose the following set of steps:

Start with an empty evaluation set.
Compute the Fisher information of all items.
Select the item with the highest Fisher information.
Compute the true evaluation score of this item.
Re-fit the model parameter using this new data.
Repeat the above steps until your evaluation budget is reached.

While most LLM evaluations are static, Fluid Benchmarking is dynamic—the data used for evaluation is adapted based on each model being evaluated. Such an approach demonstrates the incredible potential of IRT for both selecting data and measuring performance, as well as its overall versatility as a tool. Notably, a very similar data selection approach is adopted by the more recent ATLAS paper.

Does this work? In [12], authors focus on evaluating model checkpoints during the pretraining process. Six different open LLMs in the 7B parameter range are selected, and checkpoints are taken from these models evenly throughout their training process to arrive at a set of 102 LLMs for fitting the IRT model. All evaluation experiments are performed on the Open LLM leaderboard, which is a composite leaderboard comprised of six different benchmarks. A separate IRT model is fit for each benchmark in the leaderboard. As shown below, Fluid Benchmarking provides a stable and accurate estimate of model capabilities and is found to be effective for a wide range of different evaluation budgets.

(from [12])

DatBench: Discriminative, Faithful, and Efficient VLM Evaluations [10]

“We identify critical failure modes that violate faithfulness and discriminability, misrepresenting model capabilities: (i) multiple-choice formats reward guessing, do not represent downstream use-cases, and saturate early as models improve; (ii) blindly-solvable questions which can be answered without images, constitute up to 70% of some evaluations; and (iii) mislabeled or ambiguous samples compromise up to 42% of examples in certain datasets.” - from [10]

Most popular Vision-Language Model (VLM) benchmarks have limitations that make research and progress difficult. Problems with these benchmarks include:

Data quality issues (e.g., incorrect labels or low-resolution images) that make solving certain questions overly difficult or impossible.
Blindly-solvable questions that can be solved purely based upon text priors without using the actual image.
Multiple choice questions that are easily reward hacked via guessing and do not much the generative style in which most VLMs are deployed.

Beyond these issues, the evaluation process alone is beginning to consume non-negligible compute for most models. LLM research is empirical, and as much as 20% (or even more) of total model development costs can be spent running evaluations. Based on this trend, we want to avoid wasted compute and ensure that the data in these benchmarks is actually useful for discerning model capabilities. Authors in [10] aim to solve these issues by developing and applying a targeted data curation approach over a wide set of VLM benchmarks to create DatBench, a composite benchmark that prioritizes high signal evaluation examples for VLMs.

(from [10])

Source data. The curation process in [10] begins with a large set of 33 evaluation datasets for VLMs that span the capability groups depicted above. A set of 27 state-of-the-art models ranging from 1-10B parameters is evaluated over these datasets, yielding a dataset of model evaluation results to use for data curation. From here, DatBench is constructed via a multi-step filtering process:

Converting multi-choice questions into a generative format.
Removing blind-solvable questions.
Filtering examples with incorrect or ambiguous ground truth.
(Optional) Identifying examples that yield maximum discrimination.

The last step of the pipeline is optional but can be used to sample a smaller amount of data that retains the ability to detect differences in model capabilities. Two different evaluation suites are created in [10]—DatBench and DatBench-Full—that cover distinct evaluation modes:

High-efficiency evaluation over a subset of data for rapid iteration.
High-quality evaluation over all data for cases with relaxed computational constraints and a need for better coverage.

For example, DatBench is most useful for ablation experiments, as we can lower inference costs and run faster experiments while still providing a useful capability signal. On the other hand, DatBench-Full can be used for final model reporting, which is run less often but requires comprehensively capturing the performance of a model. We will now outline each of the above curation steps in more detail.

Multiple choice to generative conversion. Practically, most VLMs are used in a generative fashion, where users ask questions to a model and the model generates a response for the user. However, many benchmarks used to evaluate VLMs ask question in a multiple choice format. Such a format can artificially inflate VLM performance due to the random guessing and the fact that selecting an answer is generally easier than generating that same answer from scratch.

(from [10])

DatBench reformulates multiple choice questions into a generative format where the VLM generates an answer that is verified against a ground truth answer using an LLM judge. In cases where multiple choice is structurally necessary, authors in [10] rely upon a circular evaluation approach. As shown in the figure above, converting multiple choice questions into a generative format leads to a noticeable drop in model performance, indicating that generative evaluation is harder for current VLMs and better reflects the current state of model capabilities.

Removing blind-solvable questions. One key insight from [10] is the fact that a surprising number of VLM evaluation samples can be solved without using any visual data; see below. Models can rely upon language priors to solve questions (or provide a high-probability guess), thus inflating the performance of VLMs with strong language backbones. To identify these cases, we can re-run evaluation while removing image inputs to identify those that are blind-solvable. In [10], the entire suite of 27 models is run in a blind fashion, and any questions that can be solved by at least one model are removed. Though this filtering approach is aggressive, the likelihood of a correct blind answer in a generative setup is relatively low, and the curation process begins with a large source dataset.

(from [10])

“In the first stage, we flag examples that all evaluated models answer incorrectly. Unanimous failure across a diverse suite of models typically indicates either a data quality issue or a genuinely difficult frontier case, both of which warrant closer inspection. In the second stage, a strong VLM judge (GPT-5.2) verifies each flagged sample with access to the ground-truth answer as privileged information.” - from [10]

Quality filtering. A two-stage pipeline is used in [10] to identify incorrect, low quality, and ambiguous evaluation data; see below. In the first stage, we flag any evaluation examples that are not solved by any model in the suite. These samples are usually either i) a data quality issue or ii) a valid frontier evaluation case.

(from [10])

To differentiate between these cases, we perform a second stage of filtering based upon a frontier-level VLM judge. In this stage, every flagged example is passed through the judge to determine whether it is correct an unambiguous. Such an approach is reliant upon the asymmetry of verification (i.e., verifying a provided solution to a problem should be easier than generating a valid solution).

(from [10])

In an effort to prioritize quality over quantity, any data identified as ambiguous, incorrectly labeled, or unsolvable due to insufficient image resolution is removed. As shown above, this stringent filtering policy results in relatively high ratios of discarded data in certain domains. For example, over 42% of the spatial reasoning data is removed from DatBench due ambiguity or issues with data quality

Discriminative selection. Given increasing costs of evaluation, we would like to sample an evaluation subset to reduce costs without degrading discriminability—or the ability to identify differences in performance. One common approach is to sub-select evaluation samples while optimizing for rank correlation to find a smaller evaluation dataset that ranks models in the same way. However, this approach is prone to overfitting on a particular evaluation suite. An evaluation subset can preserve model rankings while still having noisy data that does not genuinely capture difference in model capabilities—we prioritize model rankings without deeply capturing the kind of data that is actually being selected.

“The core optimization problem is not merely to maintain ranking stability, but to maximize total discrimination. By ensuring every sampled example possesses high discriminative power, we can implicitly guarantee robust ranking while maximizing the information content per inference token.” - from [10]

Authors in [10] propose a solution to these problems that is based upon IRT. Directly applying IRT to VLM evaluation would work poorly, as we do not have enough data. Specifically, each data point would need to be evaluated with hundreds of different models in order to fit a stable IRT model. We do not have anywhere near this amount of data—only 27 models are used in [10] and getting access to hundreds of state-of-the-art VLMs would be very difficult (if not impossible).

Point-biserial correlation

Instead of directly using IRT, data in [10] is selected based on information density, as captured by the point-biserial correlation (r_pb); see above. Computed per evaluation example, r_pb captures the relationship between scores on a single data point and global performance. As explained in [10]: “An item with high r_pb is one that strong models consistently answer correctly and weak models consistently miss; conversely, a low or negative r_pb indicates a noisy item.” The left term in the above equation captures the relative difference in global performance of models that get a given data point correct or incorrect, while the right term captures the ratio of models that get the data point correct or incorrect.

(from [10])

We select evaluation data in [10] by prioritizing examples with high r_pb per domain. To measure the total discriminative power of an evaluation subset, we can divide the total sum of selected r_pb score by the sum of r_pb scores across all data. As shown above, selecting data based upon r_pb allows us to preserve 90% of total discriminability with only 40% of the data, whereas rank correlation metrics saturate almost immediately. Interestingly, we also see that selecting all data is not optimal from the perspective of discriminative power. Noisy data (i.e., with low or negative r_pb) is left until the end of the selection process in [10].

The IRT-inspired approach is used to select 80% of evaluation data in [10], while the final 20% is manually reserved for frontier examples with low discriminative power. Namely, there exists a subset of evaluation data that has been validated by the LLM judge but is not answered correctly by any model. Any example in this subset will receive a low r_pb score because of the low ratio of correct model responses. However, such data captures legitimate frontier evaluation scenarios that should not be completely ignored within our evaluation dataset.

(from [10])

Key findings. Evaluation results on both DatBench and the original benchmarks are plotted above. Results on DatBench have a larger performance spread relative to those of the original benchmarks. For example, scores on general benchmarks range from 10-65% for DatBench versus 65-80% for original benchmarks, showing that DatBench mitigates benchmark saturation. In fact, just converting multiple choice questions to a generative format causes as much as a 35% performance drop. DatBench is found to yield a 13× speedup in the evaluation process while roughly matching the discriminative power of the original benchmarks.

(from [10])

We can also repurpose the evaluation artifacts created by the DatBench pipeline to diagnose common failure modes of VLMs. Specifically, authors in [10] make the following observations:

A tradeoff between perception and reasoning exists in VLMs. Models that perform well on higher-level semantic processing tasks have degraded low-level perceptual fidelity. Finding a model that balances performance on both semantic and perceptual tasks is rare.
An “overthinking” problem exists within current VLMs, meaning that significantly fewer tokens are used when answering questions correctly versus incorrectly; see above. This problem is especially pronounced in reasoning models, where we see that an average length of a correct and incorrect response is 425.2 and 1,196.9 tokens, respectively.
The dependence of VLMs upon language priors, which can be measured via the performance difference between normal and blind evaluation, varies per capability; see below. For example, counting and grounding rely heavily upon vision information, but math and spatial reasoning are found to rely more upon language priors to guess a correct answer.

Although many VLM benchmarks are shown to be noisy and inflated in [10], we can learn a lot about current state-of-the-art by addressing these problems and selecting evaluation data that accurately captures model performance. Once we can identify shortcomings in performance (e.g., overthinking and perceptual gaps), improving model capabilities in these specific areas is much easier.

Keys to Creating a Useful Benchmark

We have studied a wide variety of LLM benchmarks and evaluation techniques in this overview. Given the many practical details peppered throughout the papers we have seen, we can gain a lot by considering the common concepts that continually arise across disparate benchmarks. By identifying these trends, we can (hopefully) identify key design principles for making a useful benchmark.

Domain Taxonomy. Most popular LLM benchmarks categorize their data into a fixed set of domains and sub-domains. By doing this, we make it easier to debug an LLM’s performance, as we can compute domain-level metrics within the benchmark. Additionally, organizing a benchmark into such a taxonomy naturally ensures that data is diverse and covers a decent breadth of topics. Leveraging a taxonomy can also make the evolution of a benchmark simpler over time by granularly measuring saturation at a domain level and enabling researchers to individually evolve each domain (e.g., as in BIG-Bench Extra Hard).

Human annotation. Despite the prevalence of synthetic data within LLM research, nearly all successful evaluation benchmarks rely on human experts to annotate data in some way. Some benchmarks begin with questions written by human experts (e.g., FrontierMath), while others leverage human opinions to measure question difficulty or accuracy (e.g., GPQA). Even when synthetic data is being used, human verification of data quality is usually helpful (e.g., IFEval and IFBench). In fact, review by human experts is even used in some cases to improve the quality of large-scale data obtained from noisy sources (e.g., crowdsourcing). Even today, manual inspection is one of the most effective tools for LLM evaluation.

Model-in-the-loop. Although humans play a massive role in the evaluation process, augmenting human efforts with an LLM can be beneficial. For example, LLMs are often used for difficulty filtering by simply identifying the questions that they get wrong. Additionally, trends in model performance allow us to fit IRT models and even identify less informative subsets of data (e.g., blind-answerable data in DatBench). Model-based approaches are helpful for identifying areas of a benchmark that may contain mistakes that can be routed to human review. We can also use LLMs to efficiently generate or reformat evaluation data that is later verified by a human annotator (e.g., MMLU-Pro adopts such a strategy).

Data quality. The best evaluation benchmarks tend to pull from high-quality data sources. For example, popular math benchmarks include questions that are taken directly from recognized math competitions, and reasoning benchmarks like BIG-Bench are sourced from vetted sources such as other proven datasets (as in BIG-Bench Extra Hard) or questions that have been extensively verified with human review (as in the original BIG-Bench). In fact, manually written questions from human experts are another commonly-used source of evaluation data, but we must implement measures to ensure data quality. The GPQA curation pipeline is a great example of an effective system for ensuring data quality and difficulty.

Realistic. Benchmarks are an imperfect proxy for measuring what we actually care about: the capabilities of an LLM. Depending on the questions that it tests, a benchmark may or may not accurately reflect the true performance of an LLM in the real world. Ideally, we want our benchmark to accurately capture an LLM’s true capabilities on a given task. To achieve this, we should make sure that evaluation data is as realistic as possible. One great example of how to achieve this goal is CursorBench, a coding benchmark that directly sources evaluation data from real coding agent sessions in Cursor and constantly releases new benchmark versions to better capture recent trends in agent usage.

Evolution. The capabilities of frontier-level LLMs are advancing rapidly, which can lead to benchmark saturation. In order to remain relevant, a good benchmark must evolve (and improve) over time. One of the best examples of this trend is BIG-Bench, which was already saturated less than a year after its initial release. Instead of simply allowing the benchmark to become irrelevant, improved versions were consistently released, such as BIG-Bench Hard and BIG-Bench Extra Hard. Many datasets can remain relevant and useful if we are willing to adjust the difficulty and scope of the benchmark as LLMs improve.

New to the newsletter?

Subscribe now

Bibliography

[1] Hendrycks, Dan, et al. “Measuring massive multitask language understanding.” arXiv preprint arXiv:2009.03300 (2020).

[2] Wang, Yubo, et al. “Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.” Advances in Neural Information Processing Systems 37 (2024): 95266-95290.

[3] Gema, Aryo Pradipta, et al. “Are we done with mmlu?.” Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025.

[4] Rein, David, et al. “Gpqa: A graduate-level google-proof q&a benchmark.” First conference on language modeling. 2024.

[5] Srivastava, Aarohi, et al. “Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.” Transactions on machine learning research (2023).

[6] Suzgun, Mirac, et al. “Challenging big-bench tasks and whether chain-of-thought can solve them.” Findings of the Association for Computational Linguistics: ACL 2023. 2023.

[7] Kazemi, Mehran, et al. “Big-bench extra hard.” Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.

[8] Zhou, Jeffrey, et al. “Instruction-following evaluation for large language models.” arXiv preprint arXiv:2311.07911 (2023).

[9] Pyatkin, Valentina, et al. “Generalizing verifiable instruction following.” arXiv preprint arXiv:2507.02833 (2025).

[10] Joshi, Siddharth, et al. “DatBench: Discriminative, Faithful, and Efficient VLM Evaluations.” arXiv preprint arXiv:2601.02316 (2026).

[11] Polo, Felipe Maia, et al. “tinyBenchmarks: evaluating LLMs with fewer examples.” arXiv preprint arXiv:2402.14992 (2024).

[12] Hofmann, Valentin, et al. “Fluid language model benchmarking.” arXiv preprint arXiv:2509.11106 (2025).

[13] Dubois, Yann, et al. “Length-controlled alpacaeval: A simple way to debias automatic evaluators.” arXiv preprint arXiv:2404.04475 (2024).

[14] Vivek, Rajan, et al. “Anchor points: Benchmarking models with much fewer examples.” Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 2024.

[15] Xu, Cong, et al. “Data efficient evaluation of large language models and text-to-image models via adaptive sampling.” arXiv preprint arXiv:2406.15527 (2024).

[16] Perlitz, Yotam, et al. “Efficient benchmarking (of language models).” Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024.

[17] Kipnis, Alex, et al. “metabench--A Sparse Benchmark of Reasoning and Knowledge in Large Language Models.” arXiv preprint arXiv:2407.12844 (2024).

See page 15 of the MMLU paper [1] for an itemized list of all 57 tasks.

See pages 5-6 of the GPQA paper [5] for a list of all sub-domains.

Here, we interpret the probability score assigned by the model to a certain multiple choice answer option as the model’s confidence in that option.

A full list of filtering criteria and associated rationales can be found in Appendix D on Page 48 of the BIG-Bench Hard paper.

The paper does not explicitly state how the base prompts are sourced. Authors just mention that they “generate a set of base prompts”.

The level of overfitting on IFEval might also be caused by the simple fact that this benchmark is constantly tested by model developers as new models are being created. Therefore, new models are naturally selected based on their performance on this benchmark (and other popular benchmarks).

An item that is more discriminative creates a separation between stronger and weaker models. This is an item that, if answered correctly, indicates that a model is capable.

Applying Statistics to LLM Evaluations

Cameron R. Wolfe, Ph.D. — Mon, 09 Mar 2026 09:33:37 GMT

(from [1, 2, 3])

Research on large language models (LLMs) is empirically driven. For this reason, model evaluations play a pivotal role in the field’s progress. We improve models by making changes, evaluating them, and iterating. Despite their foundational role, however, evaluations are usually handled in a naive manner. In most cases, we just test a model’s performance over a finite evaluation dataset and directly compare performance metrics to those of other models with no consideration for whether these results are statistically significant or not. Such an approach leads to incorrect or misleading interpretations of evaluation results. As researchers, we want to avoid mistaking noise for progress and instead equip ourselves with the statistical tools needed to run informative model evaluations.

“Language models are measured in the literature by evaluations, or evals. Evals are commonly run and reported with a highest number is best mentality; industry practice is to highlight a state-of-the-art result in bold, but not necessarily to test that result for any kind of statistical significance.” - from [1]

In this overview, we will build a statistical foundation for LLM evaluations from the ground up. To begin, we will review basic statistical ideas with a practical focus on the topics that are most useful for model evaluations. We will then take a deeper look at how these ideas can be directly used to interpret LLM evaluation results in an uncertainty-aware manner. Specifically, we will cover a set of statistical best practices for model evaluation and implement each of them to show how they can be concretely applied. Although it may seem daunting, taking a statistically grounded approach to model evaluation is not especially difficult and can help us make faster progress by avoiding spurious results.

Basic Statistics for LLM Evaluations

In order to develop a statistical framework for LLM evaluations, we need to first learn about the fundamental tools from statistics that can be used to create such a framework. This section will cover a selection of topics related to the properties of random variables, such as computing the mean or variance and constructing a confidence interval. After covering the fundamentals, we learn how these ideas can be applied to properly analyze LLM evaluation results in the next section.

Random Variables and Estimators

A random variable X is defined as a quantity that has a value dependent upon chance. We can take several independent samples from the distribution {x_1, x_2, …, x_n}, and the values of these observations will be sampled from the distribution of X (i.e., x_i ~ X). We define the mean (or average) of this random variable via the expectation, which can be computed in a continuous or discrete fashion as shown in the figure below. Additionally, we can compute a sample mean by averaging the values of n observations sampled from the distribution.

Mean and sample mean

Formally, the lower case letters x_i represent concrete values sampled from a distribution, while upper case letter X_i denotes the i-th random variable in our sample—this is a notational detail, but it’s worth covering to avoid confusion. For example, if we evaluate our LLM on n questions, X_i is a random variable that represents the distribution of possible scores for question i1, while x_i is an actual evaluation score observed for a single evaluation run. We can also define the sample mean in terms of random variables as shown in the equation below. We use an uppercase X̄ in this case because we are defining a random variable.

Sample mean with random variables

The distribution of our random variable X also has variance Var(X), which describes how “spread out” the distribution is around the mean. In this overview, we will assume that this variance is finite (i.e., less than infinity). If we have a distribution with high variance, then samples taken from this distribution will be more spread out around the mean and vice versa; see below for an illustration.

The expression for Var(X) is provided below. Similarly to the sample mean, we can also estimate variance using a fixed set of samples from our distribution X—this is how the variance is usually computed in practical settings. We can also compute the standard deviation σ by taking the square root of the variance. The variance and standard deviation describe the variability of individual samples from X.

Variance and standard deviation

While variance measures the variability of a single random variable X, covariance measures how two random variables X and Y vary together. Intuitively, if these variables vary in the same direction (e.g., they are both above or below their means at the same time), then their covariance will be positive and vice versa. A covariance near zero indicates there is no clear relationship between X and Y. We can also compute a sample covariance similarly to the sample variance shown above. Expressions for covariance and sample covariance are provided below.

Covariance and sample covariance

The law of total variance is a useful identity that decomposes the variance of a random variable X with respect to another random variable Y; see below. For the purposes of this overview, this law is useful because it lets us separate multiple sources of randomness in an evaluation result. Later, we will use it to decompose the variance of an evaluation score into two key components:

Variability due to the question sampled for evaluation.
Within-question variability arising from stochastic generation by the LLM or an LLM judge.

The law of total variance

Standard Error and Sample Means

If we repeatedly draw samples from X and compute the sample mean, we will get a different result every time. The resulting sample means form a sampling distribution (i.e., basically a list of sample means we have drawn). The standard deviation of this sampling distribution is called the standard error of the sample mean. Put simply, the standard error is just the standard deviation over sample means. While the standard deviation captures variability in individual data points x_i sampled from X, the standard error captures variability in the sample mean estimator (i.e., the spread of sample means after computing it multiple times with different samples). A formal definition of the standard error is provided below, as well as an estimator for the standard error that uses sample standard deviation2 because the true value of σ is rarely known in practice and must be estimated.

Standard error

This standard error equation makes the assumption that samples drawn from X are independent and identically distributed (IID). Independence implies that Cov⁡(X_i,X_j) = 0 for i≠j, and identical distribution implies that each X_i has the same variance Var(X). From this assumption and a few other properties of the variance, we can derive the above expression for the standard error as shown below. The assumption of IID samples is not always satisfied—we should only use this expression when the samples being drawn are truly independent.

Full derivation of standard error (SE) expression

Within this derivation, we use the variance of a sum identity, which can be generally expressed as shown in the equation below. This identity allows us to capture the (non-zero) covariance terms within our variance expression.

Variance of a sum identity

Bernoulli variables. Let’s assume X is a Bernoulli random variable, meaning that our scores are binary x_i ∈ {0, 1}. In this case, our standard error expression can be simplified even further. To begin, we know that E[X] = 1×P(X=1) + 0×P(X=0) = P(X=1). Given that the values of X are either zero or one, it is also true that E[X^2] = E[X] because x^2 = x when x = 0 or x = 1.

We can easily plug these two identities into our prior expression for the variance Var(X) = E[X^2] - (E[X])^2 = Pr(X=1) - (Pr(X=1))^2 = μ(1 - μ), where μ is the mean of X. Practically, we can estimate μ by taking a sample mean X̄. Then, we can plug this simplified Var(X) into our previous formula for the standard error, yielding the simplified expression shown below. Therefore, we can use this simpler standard error expression when the values of X are binary.

Standard error of Bernoulli variable

Law of Large Numbers and the Central Limit Theorem (CLT)

The law of large numbers is a fundamental concept in statistics that builds upon our prior definition of the sample mean. Given a random variable X, we are often interested in its true mean μ. This mean can be estimated with the sample mean over n samples, but this is a random estimate that can differ from μ. The law of large numbers tells us that as the value of n increases, the sample mean will approach (i.e., converge in probability) the true mean μ; see below.

Expression for the law of large numbers

The law of large numbers only tells us that the sample mean will eventually settle around μ with sufficiently large n. It does not tell us how much the sample mean differs from the true mean at finite n or how quickly we converge to μ as n increases. We can express the intuition for the law of large numbers as follows: with enough data, our estimator (i.e., the sample mean) approaches the true mean.

Standardization and z-score. Given a random variable X (or a realized value x), we can standardize by subtracting the mean μ and dividing by the standard deviation σ; see below. This process produces a standardized random variable Z (or a realized value z). The z-score z3 indicates how many standard deviations—in units of σ—the value x lies above (z > 0) or below (z < 0) the mean μ.

Any variable or value can be standardized in this way. For example, we will next standardize the sample mean while formulating the Central Limit Theorem.

The Central Limit Theorem (CLT) goes beyond the law of large numbers by describing how our sample mean estimates will be distributed around the true mean μ. Our random variable X has a mean of μ, and we estimate this mean with a sample mean. We know from our prior derivation that this sample mean has a variance of σ^2 / n (assuming IID random variables and finite variance σ^24).

Using this mean and variance, we can standardize the sample mean to obtain Z_n by subtracting the mean and dividing by the standard error; see below. The denominator of Z_n is our previous equation for standard error—this is just the standard deviation of our sample mean! We rarely know the actual value of σ, so we can estimate the true value with the sample standard deviation s.

The Central Limit Theorem (CLT)

The CLT tells us that the distribution of Z_n will converge to a standard normal distribution 5—meaning a normal distribution with a mean of zero and variance of one—as the value of n increases. Stated differently, this means that the distribution of our sample mean becomes approximately normal with sufficiently large n, as shown in the orange distribution above. From this information, we know that the standard deviation of the sample mean distribution will decrease proportionally to 1 / sqrt(n) and the error of our sample mean estimate is on the order of σ / sqrt(n)—the standard deviation of the above distribution.

Confidence Intervals

Consider a random variable X with a true mean μ that we estimate with the sample mean X̄_n computed from n samples. To quantify the uncertainty of this estimate, we can next compute a 95% confidence interval that has the following form: x̄_n ± y. This confidence interval indicates that if we repeated the sampling procedure many times and recomputed this confidence interval each time, 95% of the resulting confidence intervals would contain the true mean μ. Our goal is to find the value of y that statistically yields such a 95% confidence interval. To find a formula that allows us to compute this confidence interval, we actually need to combine all of the ideas we have learned so far.

First, let’s consider our sample mean estimator X̄_n. Assuming IID samples with finite variance, we know from the CLT that this estimator has an approximately normal distribution N(μ, σ^2 / n) assuming that the value of n is sufficiently large, as well as a standard error given by SE(X̄_n) = σ / sqrt(n). When computing a 95% confidence interval, we consider a normal distribution N(0, 1) and try to find a bound that includes 95% of the probability mass for this distribution; see below for an illustration.

95% CI for a standard normal distribution

Given a standard normal distribution, we have P(|Z| < 1.96) = 0.95. This is a two-sided confidence interval, meaning 2.5% of the total 5% of probability mass outside our confidence interval is allocated to each side of the distribution. In most cases, however, we will want to compute a confidence interval for a non-standard normal distribution. To do this, we can just standardize the distribution as discussed previously. For example, given our distribution N(μ, σ^2 / n) from the CLT, we can derive a standardized variable Z that follows a standard normal distribution. From here, we can just transform the confidence interval with the same standardization process; see below.

Computing 95% CI for a normal distribution

This approach yields a formula—based upon our sample size and the standard error of our sample mean—that can be used to compute a 95% confidence interval.

A Statistical Approach to LLM Evaluations [1]

Now that we have built a solid statistical foundation, we can use these ideas to create a framework for LLM evaluations that better quantifies uncertainty. In doing this, we can be more confident in our model evaluations and understand whether certain evaluation results are legitimate or just caused by noise. Our discussions will be based on a seminal paper from Anthropic [1] that provides several key recommendations for performing LLM evaluations in a way that is grounded in statistics, rather than just comparing raw performance metrics.

“Fundamentally, evaluations are experiments; but the literature on evaluations has largely ignored the literature from other sciences on experiment analysis and planning. This article shows researchers with some training in statistics how to think about and analyze data from language model evaluations.” - from [1]

Statistical framing for LLM evaluations. In theory, when evaluating an LLM, there exists a super-population of questions (illustrated below) that exhaustively covers all the ways in which the LLM can be evaluated. Practically speaking, any evaluation dataset represents only a finite subset of questions from this super-population, as represented by the red shaded region in the figure below.

Sampling from a super-population

This framing can be used to rethink our perspective on model evaluations. Instead of trying to maximize the performance of our model on a finite benchmark, we should be trying to improve an underlying skill of the model. Any evaluation dataset captures a corresponding skill imperfectly, as it is only a finite sample from the super-population that is associated with that skill.

Key recommendations. There are a set of concrete recommendations proposed in [1] that outline how one can approach LLM evaluations in a rigorous manner. We first outline these recommendations here, then spend the rest of this section explaining each of them in more depth:

When questions are IID, LLM evaluation results should be accompanied by standard errors that are computed using the CLT.
If questions are not IID (e.g., drawn from related clusters or groups), then our CLT standard error formula is no longer valid and we should instead compute a clustered standard error.
To reduce the variance of evaluation results, we can re-sample outputs from the LLM multiple times—or even analyze next token probabilities—to better account for the variance of each individual evaluation result.
When comparing two models, we can perform analysis of their paired difference (i.e., rather than just providing separate, aggregated evaluation scores over the dataset) to yield a more confident result.

Preliminaries. The evaluation dataset in [1] is assumed to contain n questions, and each question receives an evaluation score s_i; e.g., a binary correctness signal or an LLM-as-a-Judge score. A score can be decomposed as s_i = x_i + ϵ_i, where x_i is the expected score (i.e., E[s_i] = x_i) and ϵ_i adds randomness to the score. We assume zero-mean randomness (i.e., E[ϵ_i|i] = 0) that does not change the expected score. Put simply, this setup models a non-deterministic evaluation setting. Notably, LLM evaluation is fundamentally non-deterministic, as it involves sampling from the next token distribution of one or more LLMs (i.e., the model being evaluated and possibly an LLM judge).

Standard Errors and the CLT

The simplest case when analyzing evaluation results is when each question i is independent. Our goal in analyzing an evaluation result is to understand the true performance of our model, represented by the mean score μ = E[s] = E[x]6 from our super-population. We only have access to a finite set of scores from our evaluation dataset. However, we know from the law of large numbers that we can estimate the true mean by taking a sample mean s̅ over a finite set of evaluation scores. This estimator approaches μ as the value of n becomes larger.

Standard error and confidence interval for LLM evaluations (from [1])

In other words, taking an average score over a large number of independently-sampled questions generally provides a good estimate of a model’s true performance. However, “good” is difficult to quantify, and how do we know if n is sufficiently large? To quantify uncertainty, we can use the CLT to compute the standard error for our sample mean; see above. As we can see, this expression is identical—other than switching x with s—to our previously-derived standard error expression. We can also derive a confidence interval from the standard error similarly to before.

Standard error with a Bernoulli variable (from [1])

If we assume a Bernoulli distribution—meaning that for all i we have s_i ϵ {0, 1}—this expression can be simplified even further; see above. However, the Bernoulli formula requires that scores are truly binary (i.e., not fractional7).

“We suggest reporting the standard error of the mean alongside (beneath) the mean when reporting eval scores.” - from [1]

Now that we know how to compute these quantities for an LLM evaluation, the recommendation in [1] is simple: just report this standard error and the number of samples n alongside the actual evaluation result. Computing this standard error is not difficult—it requires forming a sample estimate of the standard deviation of s. A toy example of the proposed reporting structure for two models evaluated over three evaluation datasets is provided in the table below for reference.

(from [1])

From the standard error, we can compute a confidence interval for each model’s evaluation metric. These intervals summarize uncertainty in the estimated mean performance. When comparing models, non-overlapping confidence intervals suggest a real performance difference, but overlapping intervals do not by themselves rule one out. A precise comparison requires directly analyzing the difference between the models, which we will handle in a future section.

As an example, confidence intervals for the table above have been computed below for all model and dataset combinations. We see here that all models have overlapping confidence intervals. In future sections, we will learn methods that can be used to compare models with a greater level of precision.

Confidence intervals for model evaluation scores

Bootstrapping is another common approach to use for evaluating machine learning models (including LLMs) that proceeds as follows:

Sample n question scores with replacement.
Compute the sample mean s̅.
Repeat steps 1-2 multiple times.
Measure the standard deviation of these sample means.
Use this standard deviation as an estimate of the standard error.

While this approach is valid and commonly used in LLM evaluations, authors in [1] argue that bootstrapping is unnecessary when the CLT is valid. Therefore, we can just use the CLT when questions are sampled independently, n is sufficiently large, and the variance of our scores is finite. However, the CLT does fall short when n is small—the handling of this evaluation regime is discussed extensively in [2].

Clustered Errors

“We show how to use clustered standard errors, a technique developed in the social sciences, to account for the dependence and correlation structure present in question clusters.” - from [1]

If questions are not sampled independently, the standard error expression from the CLT is no longer valid. In this case, the CLT underestimates uncertainty—our confidence intervals are too narrow. We are evaluating on n questions, but some of the questions are actually related to each other. As a result, the “effective” number of evaluation questions is smaller than n, thus increasing the standard error. Some practical examples of non-independent questions include:

The same prompt in different languages.
Prompts that reference the same document or source.
Questions that are generally related in format or topic.

To avoid underestimating uncertainty, authors in [1] recommend using a clustered standard error. We use s_{i, c} to denote the score for question i in cluster c. The cluster-adjusted standard error assumes that clusters are independent: questions in a cluster can be correlated, but questions across clusters cannot.

To evaluate an LLM on these clusters, we still compute the sample mean across all question scores S̅, but we modify our standard error expression. Before, we assumed that scores S_i were IID, which implies that Cov(S_i, S_j) = 0 when i ≠ j. When questions are clustered, we no longer have zero covariance, so we need to adjust our derivation of the standard error; see below.

The above clustered standard error equation interpolates between two cases:

Scores within a cluster are perfectly correlated and each cluster is treated as if it were a single question i.
Scores within a cluster have no correlation, so our expression reduces to the original standard error expression from the CLT.

“The clustered standard error acts as a kind of sliding scale between cases where scores within a cluster are perfectly correlated (in which case each cluster acts as a single independent observation) and perfectly uncorrelated (in which case the clustered standard error is equivalent to the unclustered case). The intra-cluster correlations… are captured by the triple summation (over clusters and cross-terms within clusters).” - from [1]

When questions are not sampled independently, authors in [1] recommend reporting cluster-adjusted standard errors, as well as the number of questions n and the number of clusters C; see below. Similarly to before, the cluster-adjusted standard error can be used to compute a confidence interval. In practice, the clustered standard error may be drastically larger than the CLT standard error. For example, authors provide a concrete example in [1] where the standard error increases by 3× when accounting for clusters. Failing to consider whether questions are actually independent can drastically impact the interpretation of evaluation results.

(from [1])

We assume that questions are sampled independently in future sections unless stated otherwise. However, we can use similar steps as outlined above to derive most results in a cluster-adjusted fashion. Many of the derivations extend to the clustered setting once the covariance structure is accounted for appropriately.

Reducing Variance

We now understand how to compute standard errors and confidence intervals for our evaluation results. The next reasonable question to ask is: What can we do to reduce the standard error? First, recall that our evaluation score is defined as s_i = x_i + ϵ_i, where we have E[s_i] = x_i and Var(ϵ_i) = σ_i^2. To answer this question, we begin with our expression for the standard error and perform a decomposition with the law of total variance; see below.

To apply the law of total variance, we use the following two random variables:

A random variable over evaluation scores S.
A random variable over the question that gets sampled I.

We apply the law of total variance by conditioning S on I, where X_I = E[S|I] is the expected score for the sampled question I. We can then further simplify the equation using known properties of the mean and variance of a score.

(from [1])

This derivation yields the variance expression shown above, which provides some actionable insights. First, we see that the simplest method for reducing variance is simply increasing n—evaluating over a larger set of questions naturally improves reliability. Additionally, Var(x) captures the variability in the mean score across our evaluation dataset—this is a fundamental property of our super-population that cannot be easily changed. In simple terms, this quantity captures the spread in question difficulty across all possible evaluation questions. However, there are several approaches we can explore for decreasing the value of E[σ_i^2].

Resampling can be used to reduce score variance when evaluating any model. Instead of generating and scoring a single output per question, we generate and score K outputs for the same question i (i.e., by sampling multiple completions from the LLM). In [1], authors assume that resampled scores for a fixed question i are IID. After sampling K scores, we can take an average of the scores S̅_i, which decreases the score variance by a factor of K; see below for a full derivation.

Therefore, resampling—or producing K scores for question i to yield a mean score S̅_i—provides a linear reduction of the within-question variance σ_i^2 compared to using a single score. The variance for our sample mean has two key terms—Var(x) and E[σ_i^2]—that are summed together in the numerator. As mentioned before, Var(x) is not mutable, so to reduce variance we can—in addition to increasing n—increase the value of K until E[σ_i^2 / K] ≪ Var(X). By doing this, the within-question variance term shrinks toward zero and the variance of our sample mean approaches Var(x) / n; see below.

Token probabilities. If an evaluation metric can be computed from the model’s next token probabilities, we can replace a sampled score with its conditional expectation—basically just the probability of the correct response—and remove the within-question variance (i.e., meaning that σ_i^2 = 0). Using output token probabilities, we can easily compute the probability of a response from our LLM. For example, if our response is just a single token (e.g., a multiple choice answer), then we know that the probability for this score is the probability of that token within the LLM’s next token distribution. If our response is more complex (i.e., multiple tokens), then we can also compute the probability of the entire response via the product of probabilities for each individual token; see below.

Probability of a multi-token response

For a question i, we will refer to the probability of a response to this question as p_i. If we have access to this probability, then we can use s_i = x_i = p_i, and the variance term for our score goes away (i.e., σ_i^2 = 0). As a result, directly using token probabilities is an effective variance reduction technique. In [1], authors also recommend against changing the sampling temperature—both for resampling and with token probabilities—because this alters the underlying response distribution and, in turn, the evaluation target. For this reason, these results study a different model configuration that is not fully comparable to our original LLM.

“We recommend a two-pronged variance-reduction strategy. When next-token probabilities are available, and the LLM eval can be conducted using next-token probabilities (i.e. without token generation), compute the expected score for each question, and compute the standard error of expected scores across questions. When next-token probabilities are not available, or the answer requires a chain of thought or other complex interaction, choose a K such that E[σ_i^2] / K ≪ Var(x) and compute the standard error across question-level mean scores. In neither case should the sampling temperature be adjusted for the sake of reducing variance in the scores.” - from [1]

Going further, we should note that this approach cannot be used in all cases. First of all, many closed LLMs do not provide direct access to token probabilities. Even if these probabilities are available, using them to compute p_i can be complex depending on the evaluation setup. For example, long-form responses with many tokens—though their probability can be computed—will usually be evaluated with an LLM judge, which uses a sampling procedure of its own and, therefore, adds variability into the resulting score. Additionally, recent reasoning models output a reasoning trajectory alongside their final response, which makes computing the output probability more complicated. In these cases, correctly computing p_i is not straightforward, and we cannot assume zero variance by setting x_i = p_i. In these cases, the resampling strategy described above is a better approach.

Model Comparisons

Now that we deeply understand how to analyze an evaluation score for a single model, we need to focus more on properly comparing the evaluation results of multiple different models. Usually, the goal of evaluation is to understand the performance of a model with respect to other models; e.g., determining if a new model version is better than the current or creating a leaderboard of the best models for a certain evaluation task. Although the techniques we have learned about so far can be applied to comparing evaluation results, we can usually make comparisons more statistically efficient by performing a pairwise analysis.

Difference of means. As we saw when learning about standard errors and confidence intervals, a common comparison heuristic is to compute separate confidence intervals for multiple models and check whether they overlap. If two 95% confidence intervals do not overlap, then there is a statistically significant difference between the evaluation results. As we will see, however, this test is actually overly conservative for detecting performance differences—intervals can overlap even when there is a statistically significant difference in mean scores. Instead, we can analyze the difference in mean between two models; see below. We will refer to the two models being compared as model A and model B for simplicity.

We can compute the standard error of the estimated difference in mean scores; see below. The standard error of the estimated difference in means is the square root of the sum of the variances of the mean estimators for models A and B.

In this derivation, we use the variance of a difference identity, as expressed below. This identity is a special case of the variance of a sum identity we saw previously. In [1], authors consider an unpaired comparison where S̅_A and S̅_B are treated as estimates from independent evaluation runs (e.g., computed on independent question samples) such that Cov(S_A, S_B) = 0. This unpaired assumption could be violated (e.g., if models are evaluated over the same set of questions)—we should use the paired analysis from the next section in this case.

Variance of a difference identity

We can easily compute a 95% confidence interval using this standard error. To determine if one model is better than the other, we can check whether this confidence interval overlaps with a value of zero. If the 95% confidence interval does not include zero, then—assuming the true difference is zero—there is less than a 5% chance that we would observe a difference this extreme. Our expression for computing a 95% confidence interval has been copied below for convenience.

For model A to outperform model B according to this confidence interval, the difference in the mean score of models A and B must be greater than 1.96 × sqrt(SE_A^2 + SE_B^2). If we compute separate confidence intervals for each model, then this same difference must be greater than 1.96 × (SE_A + SE_B), which is stricter. In this way, checking overlap of separate confidence intervals is conservative, while constructing a confidence interval using the difference—and checking whether it excludes zero—is a better test.

Paired difference. If models A and B evaluate on the same set of questions, we can further reduce variance by analyzing the question-level differences in scores. To begin, we can define question-level paired score differences as shown below.

We can then estimate the standard error of question-level score differences by drawing upon our same standard error expression used previously; see below. We can then use this standard error to compute confidence intervals like before.

Standard error of the mean score difference (from [1])

We can compute this standard error as shown above, but we are mostly interested in understanding whether this expression provides a meaningful reduction in variance. Ideally, we want the above paired standard error to be smaller than that of the difference of means so that we can better detect statistically significant model differences. To determine if this is the case, we can expand the above variance expression using the variance of a difference identity; see below. Unlike the prior unpaired analysis, we no longer assume that this covariance is zero.

The variance reduction for the above expression depends upon whether the mean scores of models A and B are correlated or not. If so, then this covariance term will be positive and we will see a corresponding reduction in variance. Intuitively stated, a positive correlation indicates that models A and B agree on whether certain prompts are easy or hard (i.e., per-item scores are directionally similar).

“Because eval question scores are likely to be positively correlated, even across unrelated models, paired differences represent a “free” reduction in estimator variance when comparing two models. We therefore recommend using the paired version of the standard error estimate wherever practicable.” - from [1]

In practice, most LLMs tend to agree on per-prompt difficulty, so analyzing paired differences is a useful approach that can offer meaningful reductions in variance. In [1], authors recommend reporting pairwise differences, standard errors, confidence intervals, and score correlations between models; see below.

(from [1])

Practical Implementation

Although we have learned a lot of statistics throughout this discussion, actually implementing these ideas—once we understand them—does not add much extra complexity to the evaluation process. Computing standard errors and confidence intervals is straightforward and, once an implementation is available, can be readily adopted as a standard practice for model evaluations. However, we must be wary of the key assumptions being made when computing the standard error to avoid overconfidence; e.g., questions that are not independent require a cluster-adjusted standard error. A reference implementation of the techniques we have learned so far is provided below. This implementation outlines how all of the recommendations proposed in [1] can be applied when evaluating an LLM.

import math
import numpy as np


#################################
# Evaluation settings and scores
#################################

# model names
model_a_name = "Galleon"
model_b_name = "Dreadnought"

# example scores for the models (n = 10)
model_a = np.array([1, 1, 0, 1, 0, 1, 1, 0, 1, 1], dtype=float)
model_b = np.array([1, 0, 0, 1, 0, 1, 0, 0, 1, 1], dtype=float)

# form toy clusters
# two clusters, assignment based on even / odd index
clusters = np.array([i % 2 for i in range(len(model_a))])

# z score for 95% confidence interval
Z_95 = 1.96


####################
# Utility functions
####################

def mean_score(scores):
    return float(np.mean(scores))

def sample_sd(scores):
    return float(np.std(scores, ddof=1))

def ci_95(mean, se):
    return (mean - Z_95 * se, mean + Z_95 * se)

def fmt_pct(x):
    return f"{100 * x:.1f}%"

def fmt_pct_paren(x):
    return f"({100 * x:.2f}%)"

def fmt_ci(ci):
    return f"({100 * ci[0]:.2f}%, {100 * ci[1]:.2f}%)"


########################################################
# CLT SE
#
# Standard CLT SE for the sample mean: SE = s / sqrt(n)
# where s is the sample standard deviation.
########################################################

def clt_standard_error(scores):
    n = len(scores)
    return sample_sd(scores) / math.sqrt(n)


#####################################################
# Clustered SE
#
# Cluster-adjusted standard error:
# \sqrt{
#     SE_{CLT}^{2}
#     +
#     \frac{1}{n^{2}}
#     \sum_{c}\sum_{i}\sum_{j\neq i}
#     (s_{i,c} - \overline{s})(s_{j,c}-\overline{s})
# }
#####################################################

def clustered_standard_error(scores, clusters):
    n = len(scores)
    s_bar = np.mean(scores)

    # clt variance
    se_clt_sq = clt_standard_error(scores) ** 2

    # within-cluster cross terms
    cross_terms = 0.0
    for c in np.unique(clusters):
        idx = (clusters == c)
        residuals = scores[idx] - s_bar
        cross_terms += residuals.sum()**2 - np.sum(residuals**2)

    var_hat = se_clt_sq + (cross_terms / (n**2))
    return math.sqrt(var_hat)


######################################
# Summary statistics for single model
######################################

def summarize_model(scores, clusters):
    mean = mean_score(scores)

    se_clt = clt_standard_error(scores)
    ci_clt = ci_95(mean, se_clt)

    se_cluster = clustered_standard_error(scores, clusters)
    ci_cluster = ci_95(mean, se_cluster)

    return {
        "mean": mean,
        "se_clt": se_clt,
        "ci_clt": ci_clt,
        "se_cluster": se_cluster,
        "ci_cluster": ci_cluster,
        "n": len(scores),
        "num_clusters": len(np.unique(clusters)),
    }


###########################################################
# Comparing models w/ difference in means and separate SEs
###########################################################

def difference_in_means(scores_a, scores_b, clusters_a=None, clusters_b=None):
    mean_a = mean_score(scores_a)
    mean_b = mean_score(scores_b)
    diff = mean_a - mean_b

    # CLT version
    se_a_clt = clt_standard_error(scores_a)
    se_b_clt = clt_standard_error(scores_b)
    se_diff_clt = math.sqrt(se_a_clt ** 2 + se_b_clt ** 2)
    ci_diff_clt = ci_95(diff, se_diff_clt)

    # Clustered version
    if clusters_a is not None and clusters_b is not None:
        se_a_cluster = clustered_standard_error(scores_a, clusters_a)
        se_b_cluster = clustered_standard_error(scores_b, clusters_b)
        se_diff_cluster = math.sqrt(se_a_cluster ** 2 + se_b_cluster ** 2)
        ci_diff_cluster = ci_95(diff, se_diff_cluster)
    else:
        se_diff_cluster = None
        ci_diff_cluster = None

    return {
        "diff": diff,
        "se_diff_clt": se_diff_clt,
        "ci_diff_clt": ci_diff_clt,
        "se_diff_cluster": se_diff_cluster,
        "ci_diff_cluster": ci_diff_cluster,
    }


##########################################################
# Comparing models with paired instance-level differences
##########################################################

def paired_differences(scores_a, scores_b, clusters=None):
    diffs = scores_a - scores_b
    mean_diff = mean_score(diffs)

    se_paired_clt = clt_standard_error(diffs)
    ci_paired_clt = ci_95(mean_diff, se_paired_clt)

    if clusters is not None:
        se_paired_cluster = clustered_standard_error(diffs, clusters)
        ci_paired_cluster = ci_95(mean_diff, se_paired_cluster)
    else:
        se_paired_cluster = None
        ci_paired_cluster = None

    # Correlation between model scores across questions
    corr = float(np.corrcoef(scores_a, scores_b)[0, 1])

    return {
        "diffs": diffs,
        "mean_diff": mean_diff,
        "se_paired_clt": se_paired_clt,
        "ci_paired_clt": ci_paired_clt,
        "se_paired_cluster": se_paired_cluster,
        "ci_paired_cluster": ci_paired_cluster,
        "correlation": corr,
    }


############################################
# Functions to recreate tables from paper
#
# Numbers will not match exactly because we
# hard-code model scores at top of script
############################################

def print_table_2_style(model_name, summary):
    print(f"| {model_name:12s} | "
          f"mean = {fmt_pct(summary['mean']):>6s}  "
          f"SE = {fmt_pct_paren(summary['se_clt']):>8s}  "
          f"95% CI = {fmt_ci(summary['ci_clt'])}  "
          f"n = {summary['n']}")

def print_table_3_style(model_name, summary):
    print(f"| {model_name:12s} | "
          f"mean = {fmt_pct(summary['mean']):>6s}  "
          f"clustered SE = {fmt_pct_paren(summary['se_cluster']):>8s}  "
          f"95% CI = {fmt_ci(summary['ci_cluster'])}  "
          f"n = {summary['n']}, clusters = {summary['num_clusters']}")

def print_table_5_style(model_name, baseline_name, paired_results, clustered=False):
    if clustered:
        se = paired_results["se_paired_cluster"]
        ci = paired_results["ci_paired_cluster"]
        label = "paired clustered"
    else:
        se = paired_results["se_paired_clt"]
        ci = paired_results["ci_paired_clt"]
        label = "paired CLT"

    print(f"| {model_name:12s} | baseline = {baseline_name:12s} | "
          f"diff = {fmt_pct(paired_results['mean_diff']):>6s}  "
          f"SE = {fmt_pct_paren(se):>8s}  "
          f"95% CI = {fmt_ci(ci):>18s}  "
          f"corr = {paired_results['correlation']:.3f}  "
          f"[{label}]")


################################
# Run all evaluation statistics
################################

def main():
    summary_a = summarize_model(model_a, clusters)
    summary_b = summarize_model(model_b, clusters)

    print("=" * 90)
    print("Raw toy data")
    print("=" * 90)
    print(f"{model_a_name}: {model_a}")
    print(f"{model_b_name}: {model_b}")
    print(f"clusters:  {clusters}")
    print()

    print("=" * 90)
    print("Table 2 style: CLT standard errors / confidence intervals")
    print("=" * 90)
    print_table_2_style(model_a_name, summary_a)
    print_table_2_style(model_b_name, summary_b)
    print()

    print("=" * 90)
    print("Table 3 style: Clustered standard errors / confidence intervals")
    print("=" * 90)
    print_table_3_style(model_a_name, summary_a)
    print_table_3_style(model_b_name, summary_b)
    print()

    diff_means = difference_in_means(model_a, model_b, clusters, clusters)
    paired = paired_differences(model_a, model_b, clusters)

    print("=" * 90)
    print("Model comparison method 1: difference in means")
    print("=" * 90)
    print(f"Naive / CLT version:")
    print(f"  diff = {fmt_pct(diff_means['diff'])}")
    print(f"  SE(diff) = {fmt_pct_paren(diff_means['se_diff_clt'])}")
    print(f"  95% CI = {fmt_ci(diff_means['ci_diff_clt'])}")
    print()

    print(f"Clustered version:")
    print(f"  diff = {fmt_pct(diff_means['diff'])}")
    print(f"  SE(diff) = {fmt_pct_paren(diff_means['se_diff_cluster'])}")
    print(f"  95% CI = {fmt_ci(diff_means['ci_diff_cluster'])}")
    print()

    print("=" * 90)
    print("Model comparison method 2: paired instance-level differences")
    print("=" * 90)
    print(f"Question-level differences (A - B): {paired['diffs']}")
    print(f"Correlation(A, B) = {paired['correlation']:.3f}")
    print()

    print(f"Paired CLT version:")
    print(f"  mean diff = {fmt_pct(paired['mean_diff'])}")
    print(f"  SE(diff) = {fmt_pct_paren(paired['se_paired_clt'])}")
    print(f"  95% CI = {fmt_ci(paired['ci_paired_clt'])}")
    print()

    print(f"Paired clustered version:")
    print(f"  mean diff = {fmt_pct(paired['mean_diff'])}")
    print(f"  SE(diff) = {fmt_pct_paren(paired['se_paired_cluster'])}")
    print(f"  95% CI = {fmt_ci(paired['ci_paired_cluster'])}")
    print()

    print("=" * 90)
    print("Table 5 style: pairwise reporting")
    print("=" * 90)
    print_table_5_style(model_a_name, model_b_name, paired, clustered=False)
    print_table_5_style(model_a_name, model_b_name, paired, clustered=True)

main()

Key Takeaways

In this overview, we have learned a wide variety of tools for evaluating LLMs in an uncertainty-aware manner. To close, we will summarize what we’ve learned by outlining how each of these tools can be used when evaluating an LLM. In the simplest case, we can draw upon the CLT to derive a standard error and confidence interval along with our evaluation results. However, there are a few cases in which this approach will not yield valid results:

If the value of n is small, then the CLT-based standard error expression is overly confident. We can solve this issue by evaluating over a larger dataset or using another approach (e.g., the Bayesian methods outlined in [2]) that is better equipped to deal with smaller n.
If evaluation questions are not independent, then we can derive a cluster-adjusted standard error to account for the relationship between questions in our evaluation dataset.

When comparing models that are evaluated on the same questions (i.e., a paired setup), we can apply the same approaches over their question-level differences to provide a more statistically efficient estimate of which model performs better.

To reduce evaluation variance, we can use resampling, where K is selected such that E[σ_i^2 / K] ≪ Var(X). In some settings, token probabilities can be used to compute the expected score—or the probability of the ground truth answer— directly, thus reducing within-question variance. Such an approach has been shown in several concurrent works [1, 3, 4] to improve the stability of evaluation results. When creating an evaluation dataset, we can use power analysis—or just adopt the sample size formula from [1]—to determine the number of samples needed. We can also rearrange the sample size formula to find the minimum detectable effect δ that can be measured with a given dataset, which helps us to determine whether certain evaluations are even worth running at all.

New to the newsletter?

Hi! I’m Cameron R. Wolfe, Deep Learning Ph.D. and Senior Research Scientist at Netflix. This is the Deep (Learning) Focus newsletter, where I help readers better understand important topics in AI research. The newsletter will always be free and open to read. If you like the newsletter, please subscribe, consider a paid subscription, share it, or follow me on X and LinkedIn!

Subscribe now

Bibliography

[1] Miller, Evan. “Adding error bars to evals: A statistical approach to language model evaluations.” arXiv preprint arXiv:2411.00640 (2024).

[2] Bowyer, Sam, Laurence Aitchison, and Desi R. Ivanova. “Position: Don’t Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints.” arXiv preprint arXiv:2503.01747 (2025).

[3] Madaan, Lovish, et al. “Quantifying variance in evaluation benchmarks, 2024.” URL https://arxiv. org/abs/2406.10229 (2024).

[4] Heineman, David, et al. “Signal and noise: A framework for reducing uncertainty in language model evaluation.” arXiv preprint arXiv:2508.13144 (2025).

The evaluation process is stochastic, so if we re-run the evaluation on this question multiple times we can observe a different result!

Previously, we introduced the sample variance, denoted as s^2. The sample standard deviation, denoted as s, is simply the square root of this expression.

The z-score refers to the realized value z of the random variable Z.

The reason we must assume variance σ^2 is finite is so that this expression is well-defined and exists. The standard deviation and standard error are not finite or meaningful when variance is infinite.

We write the normal distribution as N(x, y), where x is the mean of the normal distribution and y is the variance.

Here, the unconditional versions of s and x (i.e., without the i subscript) is used, so we are taking this expectation over the entire super-population.

More specifically, if we are reporting a metric in the range [0, 1] (e.g., an f1 score), then this formula cannot be used. These are fractional scores rather than binary scores with a value of either 0 or 1.

Practically, monotonicity is computed by taking the sequence of scores throughout training and measuring the Kendall rank correlation between this sequence and a perfectly monotonic sequence (i.e., a sequence in which the model’s performance increases at every checkpoint throughout training).

In the context of LLMs, a Cloze task refers to a fill-in-the-blank test where the LLM is given context (e.g., a paragraph or sentence) with missing tokens and expected to predict the missing information.

Rubric-Based Rewards for RL

Mon, 16 Feb 2026 10:33:41 GMT

(from [1, 2, 3, 5, 16])

Many of the recent capability gains in large language models (LLMs) have been a product of advancements in reinforcement learning (RL). In particular, RL with verifiable rewards (RLVR) has drastically improved LLM capabilities by using rules-based, deterministic correctness checks (e.g., passing the test cases for a coding problem) as a reward signal. Deterministic verifiers allow RLVR to provide a reliable reward signal that is more difficult to exploit compared to the neural reward models that were traditionally used for RL with LLMs. Such improved reliability has made stable RL training possible at scale, enabling the creation of powerful reasoning models with extensive RL training. Despite these benefits, verifiable rewards also have limitations—the same properties that make RLVR reliable confine it to domains with clean, automatically-checkable outcomes.

“While lots of efforts have been paid on RLVR, many high-value applications of LLMs, such as long-form question answering, general helpfulness, operate in inherently subjective domains where correctness cannot be sufficiently captured by binary signals.” - from [3]

Many important applications (e.g., creative writing or scientific reasoning) are not verifiable, making RLVR difficult to apply directly. To address this gap, we need reward signals that preserve RLVR’s scalability and reliability while still working in non-verifiable settings. Rubric-based rewards are a promising step in this direction: they decompose desired model behavior into structured, interpretable criteria that an LLM judge can evaluate and aggregate into a multi-dimensional reward. By creating prompt-specific rubrics that specify the evaluation process in detail, we can derive a more reliable reward signal from LLM judges and, therefore, use RL training to improve model capabilities even in highly subjective domains. For this reason, rubric-based RL training, which we will cover extensively in this overview, has become one of the most popular topics in current AI research.

From LLM-as-a-Judge to Rubrics

Before learning about how rubrics can be used for RL training, we need to build a background understanding of LLM-as-a-Judge and the different setups that can be used to evaluate open-ended problems with an LLM. At the end of the section, we will connect these ideas to rubrics and RL training by overviewing existing RL training techniques and how they are being extended to non-verifiable domains.

LLM-as-a-Judge

Prior to the LLM era, many evaluation metrics used for generative tasks (e.g., BLEU or ROUGE) were quite brittle. These metrics use n-gram matching (or embedding-based matching as in BERTScore) to compare a model’s output to a golden reference answer. Though this approach works relatively well, there are some fundamental problems that arise with reference-based metrics:

We always require a reference answer in order to perform evaluation.
Our output must be similar to this reference answer to perform well.

As we know, LLMs are capable of solving many different tasks, and most of these tasks are open-ended in nature. For example, we can use the same LLM to do creative writing or to answer medical questions. Although these problems are quite different, they do have a fundamental similarity: there are many ways to answer a question correctly. Traditional reference-based metrics struggle to handle such nuanced scenarios where divergence from a chosen reference answer does not imply that an output is bad. As a result, we have seen from several papers that reference-based metrics tend to correlate poorly with human preferences.

“LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain.” - from [7]

LLM-as-a-Judge is a reference-free metric that prompts a foundation model to perform evaluation based upon specified criteria. Although it has limitations, this technique shows high agreement in many settings with human preferences and is capable of evaluating open-ended tasks in a scalable manner (i.e., minimal implementation changes are required). To evaluate a new task, we simply need to create a new prompt that outlines the evaluation criteria for this task. LLM-as-a-Judge was originally proposed after the release of GPT-4. This metric quickly gained popularity due to its utility and simplicity, culminating in the publication of an in-depth technical report [7]. Today, LLM-as-a-Judge is a widely-used technique in LLM evaluation; e.g., AlpacaEval, Chatbot Arena, Arena-Hard, and more.

LLM-as-a-Judge prompt formats (from [7])

Scoring setups. When performing evaluation with an LLM, there are a few different scoring setups that are commonly used (shown above):

Pairwise (preference) scoring: the judge is presented with a prompt and two model responses and asked to identify the better response.
Direct assessment (pointwise) scoring: the judge is given a single response to a prompt and asked to assign a score; e.g., using a 1-5 Likert scale.
Reference-guided scoring: the judge is given a golden reference response in addition to the prompt and candidate response(s) to help with scoring.

This list of scoring setups is not exhaustive, but most scoring setups for LLM-as-a-Judge use some variant or combination of the above techniques. For example, we can derive a pairwise score by scoring two responses independently and comparing their scores. In most cases, we also pair LLM-as-a-Judge with chain-of-thought prompting by asking the model to explain its evaluation process before providing a final score. Not only do such explanations make the evaluation process more interpretable, but they also improve the scoring accuracy of the LLM. Practically, implementing this change can be as simple as adding “Please provide a step-by-step explanation prior to your final score” to your prompt.

“We identify biases and limitations of LLM judges. However, we… show the agreement between LLM judges and humans is high despite these limitations.” - from [7]

Biases of LLM-as-a-Judge. Despite the effectiveness of LLM-as-a-Judge, this technique has several limitations of which we need to be aware. Fundamentally, the LLM judge is an imperfect proxy for human evaluation. By using a model for evaluation, we introduce several sources of bias into the evaluation process:

Position bias: the judge may favor outputs based upon their position within the prompt (e.g., the first response in a pairwise prompt).
Verbosity bias: the judge may assign better scores to outputs based upon their length (i.e., longer responses receive higher scores).
Self-enhancement bias: the judge tends to favor responses that are generated by itself (e.g., GPT-5 can assign higher scores to its own outputs).
Capability bias: the judge struggles with evaluating responses to prompts that it cannot itself solve.
Distribution bias: the judge may be biased towards certain scores in its scoring range (e.g., on a 1-5 Likert scale the judge may output mostly 3’s).

In addition to these biases, LLM judges are generally sensitive to the details of their prompt. Therefore, we should not simply write a prompt and assume proper evaluation. We must calibrate our evaluation process, collect high-quality human labels, and tune our prompt to align well with human judgment; see here.

There are several techniques we can adopt to combat scoring bias; e.g., in-context learning to better calibrate the judge’s score distribution, randomizing position and sampling multiple scores (i.e., position switching), providing high-quality reference answers, or using a jury of multiple LLM judges. For further details on LLM-as-a-Judge, a full overview of the topic is available at the link below.

LLM Evaluation with Rubrics

(from [15])

The prompts used for LLM-as-a-Judge in the above section are quite simple. We just describe the evaluation task at a high level and let the LLM judge output a score. However, scoring with a single, general prompt is not always the best approach. Prior work [15] has shown that we can significantly improve the reliability of LLM evaluation by:

Creating several per-criterion scoring prompts.
Providing a step-by-step description of the evaluation process.

Put simply, providing a granular scoring prompt is beneficial, and we need not stop here. We can create judge prompts targeted to each domain, task, or instance. Increasing the granularity of LLM-as-a-Judge in this way is where the idea of a rubric arises. A rubric is just a scoring prompt that provides a detailed set of criteria by which a response is evaluated; see below. In many cases, rubrics are prompt (or instance)-specific, meaning that a tailored rubric is created for each prompt-response pair being evaluated. These prompt-specific rubrics are often synthetically generated with an LLM—potentially with human intervention.

(from [1])

As we can see above, rubrics are usually checklist-style and separated into a list of distinct criteria. Each of these criteria captures a single quality dimension that can be evaluated with an LLM judge. Additionally, in many setups, weights are defined for each criterion to simplify the aggregation of criteria-level scores. Given the similarity of rubrics and vanilla LLM-as-a-Judge, the emergence of rubrics is hard to attribute to a single paper. Rather, the use of rubrics was a slow transition that occurred over time as LLM-as-a-Judge prompts became more granular.

“HealthBench is a rubric evaluation. To grade open-ended model responses, we score them against a conversation-specific physician-written rubric composed of self-contained, objective criteria. Criteria capture attributes that a response should be rewarded or penalized for in the context of that conversation and their relative importance.” - from [16]

In recent work, prompt-specific rubrics have become heavily used for evaluation in expert domains. For example, HealthBench [16] evaluates the quality of medical conversations according to physician-written rubrics that are specific to each conversation; see below. These rubrics focus on detailed and objective criteria—each associated with a weight—that can be verified with an LLM to yield a binary (pass or fail) score. MultiChallenge [17]—a multi-turn chat benchmark focused on tough edge cases like iterative editing, self-coherence, and instruction retention—develops prompt-specific rubrics to improve benchmark reliability, finding that rubrics improve agreement between expert humans and LLM judges.

(from [16])

In this overview, we will go beyond the use of rubrics for evaluation and instead focus on the application of rubrics for deriving a reward signal in RL training. One of the biggest risks when using LLM-as-a-Judge-derived rewards for RL training is reward hacking—LLM judges have known biases that can be exploited. However, we see above that detailed rubrics help to make the evaluation process more reliable, thus reducing risks associated with reward hacking.

RL with Verifiable (and Non-Verifiable) Rewards

Though RL training has long been used for LLMs, the role of RL in LLM training pipelines has become more central with the recent advent of reasoning models. In general, there are two common RL paradigms used for LLMs:

Reinforcement Learning from Human Feedback (RLHF) trains the LLM using RL with rewards derived from a reward model trained on human preferences.
Reinforcement Learning with Verifiable Rewards (RLVR) trains the LLM using RL with rewards derived from rule-based or deterministic verifiers.

The main difference between RLHF and RLVR is how we assign rewards—RLHF uses a reward model, while RLVR uses verifiable rewards. Aside from this difference, both are online RL algorithms with a similar structure; see below. For details on the inner workings of RL optimizers, please see prior posts on PPO and GRPO.

Impact of RLVR. Recent progress in reasoning models has been driven largely by reinforcement learning with verifiable rewards (RLVR), which derives a reward signal during RL training from deterministic (or programmatic) rules that can be reliably checked (e.g., passing unit tests for code or matching a known numerical answer in math). Using rules-based rewards lowers our risk of reward hacking because we are using a hard rule to derive our reward rather than an LLM-based reward model. As a result, we can run larger-scale RL runs (i.e., over more data and for a larger number of iterations) with less risk of training instability.

Verifying a math problem with exact string matching

On the other hand, the same property that makes RLVR so powerful—the dependence on reliable, rules-based rewards—limits its applicability. Practically, we can only use RLVR on tasks with clean ground-truth labels1 that can be checked automatically. Luckily, several important tasks fall into this category (e.g., math and coding). However, there are many other tasks we would like to solve but are subjective and difficult to verify. Due to this need for verification, we see that LLMs have advanced quickly in certain verifiable capabilities, while gains on non-verifiable tasks have been less uniform. To solve this issue, we need to develop an approach for extending recent advances in RL training to non-verifiable tasks.

“In RLVR, rewards are derived from deterministic, programmatically verifiable signals—such as passing unit tests in code generation or matching the correct numerical answer in mathematical reasoning. While effective, this requirement for unambiguous correctness largely confines RLVR to domains with clear, automatically checkable outcomes.” - from [2]

Open-ended domains. We typically turn to RLHF for training LLMs in open-ended settings. RLHF replaces deterministic verifiers with a learned reward model trained on preference data; see below. Preference data can be collected for any domain by simply sampling multiple completions for each prompt and having a human (or model) select the better of the two. We can drastically increase domain coverage by using RLHF. However, relying upon preference data and reward models introduces notable difficulties and failure modes:

A large volume of preference data must be collected.
We lose granular control over the alignment criteria—preferences are expressed in aggregate over a large volume of data rather than via explicit criteria.
The reward model can overfit to artifacts (e.g., response length, formatting, etc.) and generally introduces more risk of reward hacking.

Basic structure of preference data

RLHF is a general technique, but it is usually used in practice for improving broad, subjective properties; e.g., helpfulness, harmlessness, or style. For complex, open-ended tasks, the reward signal tends to be multi-dimensional. Traditional reward modeling captures these quality dimensions via a single preference label, which eliminates our ability to specify quality dimensions at a more granular level. One could collect criterion-level preferences to solve this issue, but doing so requires training (and maintaining) separate reward models per criterion and increases the volume of data that must be collected. A natural alternative is to make evaluation dimensions explicit by using a rubric to ground the reward in structured, interpretable criteria rather than a single judgment.

Rubrics-as-Rewards. The idea of deriving a reward from a rubric-based LLM judge is one of the current frontiers of RL research—it presents an opportunity to extend RLVR to arbitrary open-ended tasks. Although this area of research is still nascent and evolving quickly, the idea of using rubrics for RL is not new! Similar ideas have already been proposed for better handling the safety alignment of LLMs. During LLM alignment, we have a detailed list of safety specifications that describe the desired behavior of the model. These specifications are changed frequently as new needs or failure cases arise in practice. The dynamic nature of safety criteria makes applying a standard RLHF approach difficult—the preference data must be adjusted or re-collected each time that our criteria change.

(from [14])

To avoid the need for constant data collection, methods like Constitutional AI [13] and Deliberative Alignment [14] show that a reliable reward signal can be derived directly from the safety specifications themselves. More specifically, we can provide safety criteria as input to a strong reasoning model that is used to generate data or evaluate model outputs according to these criteria. Due to the strong instruction following capabilities of frontier-level reasoning models, this approach is capable of providing a reliable reward signal for safety training.

“Collecting and maintaining human data for model safety is often costly and time-consuming, and the data can become outdated as safety guidelines evolve with model capability improvements or changes in user behaviors. Even when requirements are relatively stable, they can still be hard to convey to annotators. This is especially the case for safety, where desired model responses are complex, requiring nuance on whether and how to respond to requests.” - from [9]

This approach avoids the need to re-collect data as criteria change. Rather, we just maintain a clear, itemized list of safety criteria—basically a safety rubric—that can be provided as input to the alignment system. Instead of collecting data, we focus on creating a “constitution” that dictates the behavior of our model. Once this constitution is available, we rely upon an LLM judge to apply the necessary supervision for achieving this desired behavior. This approach is both dynamic and interpretable, but it can only be applied in domains where the LLM judge is known to perform well. Extending similar techniques to arbitrary domains, which we will explore for the remainder of this post, is a non-trivial research problem.

Using Rubrics for RL

We now have a detailed understanding of LLM-as-a-Judge, rubrics, and their application to RL training. Next, we will extend these ideas by overviewing a broad collection of recent papers that study the application of rubrics to RL training. Many papers have been written on this topic in quick succession. As we will see, however, much of this work shares a similar flavor. Slowly, rubric-based RL has become more effective across a wider variety of tasks, enabling powerful reasoning models to achieve impressive gains even in non-verifiable domains.

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains [1]

“Rather than using rubrics only for evaluation, we treat them as checklist-style supervision that produces reward signals for on-policy RL. Each rubric is composed of modular, interpretable subgoals that provide automated feedback aligned with expert intent. By decomposing what makes a good response into tangible, human-interpretable criteria, rubrics offer a middle ground between binary correctness signals and coarse preference rankings.” - from [1]

RLVR is effective in verifiable domains with a clear correctness signal like math or coding, but there are many domains in the real world that are not strictly verifiable (e.g., science or health). For these domains, we need a more versatile reward mechanism—such as an LLM judge or reward model—that can handle open-ended problems that lack a clear or verifiable answer. Going beyond a vanilla LLM-as-a-Judge setup, we see in [1] that prompting the LLM judge with a rubric composed of structured, instance-specific—meaning unique to each prompt—criteria benefits the model’s performance in on-policy RL training.

Creating rubrics. Rubrics in [1] are checklist-style and cover multiple criteria that are specific to each prompt being scored. The checklist for a rubric contains K total criteria c_i, each with a corresponding weight w_i. A criterion is defined as a binary correctness check that can be validated using an LLM judge. We can also recover an RLVR setup by assuming K = 1 and letting c_1 be a deterministically verifiable reward signal with weight w_1 = 1.0.

Explicit versus implicit rubric aggregation

We refer to this approach of using rubrics to generate a reward signal for RL as Rubrics-as-Rewards (RaR). There are two approaches we can use to evaluate a rubric and derive a reward for RL training (shown above):

Explicit aggregation: each criterion is independently evaluated using an LLM judge, and the final reward is derived by summing and normalizing the weighted score of each criterion.
Implicit aggregation: all criteria along with their weights are passed to an LLM judge, which is asked to derive a final reward that considers all information.

Explicit aggregation provides more granular control over the weight of each criterion, which can aid in interpretability but requires tuning and can be fragile. In contrast, the implicit aggregation approach delegates the reward aggregation process—including handling the weights of each criterion—to the LLM judge.

(from [1])

Generating rubrics. All instance-specific rubrics used in [1] are generated by an LLM; see above. When generating rubrics, guiding principles are provided to the model with respect to how rubrics should be created. Namely, rubrics must i) be grounded in guidance from human experts, ii) be comprehensive (i.e., span many dimensions of quality), iii) specify per-criterion importance (e.g., factuality is more important than style), and iv) use self-contained criteria (i.e., criteria should not depend on one another). Given these desiderata and a golden (expert-curated) reference answer for a prompt, the LLM then generates a rubric that includes:

7-20 self-contained criteria.
A numeric or categorical (i.e., essential, pitfall, important, or optional2) weight for each of these criteria.

Numeric weights provide fine-grained control over criterion importance, but categorical weights, each of which are mapped to a numerical score, are more interpretable—both for humans and the LLM—which leads them to be used for experiments in [1]. Once generated, a rubric can be used as a reward function by passing it to an LLM judge and performing explicit or implicit aggregation.

“We generate rubrics using OpenAI’s o3-mini and GPT-4o, conditioning generation on reference answers from the underlying datasets to approximate expert grounding. The resulting collections—RaR-Medicine and RaR-Science—are released for public use.” - from [1]

Experimental settings. In [1], authors see rubrics as an opportunity to provide flexible, scalable, and interpretable reward signals for RL in real-world domains that go beyond verifiable problems like code and math. Moving in this direction, two non-verifiable domains are considered in [1]: medicine and science. Prompts and rubrics used for RL in [1] are sampled from a mixture of public datasets, such as NaturalReasoning, SCP-116K, and GeneralThought-430K. This data is further curated to create two datasets for RaR training in [1]:

RaR-Medicine: ~20K prompts focused on medical reasoning with instance-specific rubrics generated with GPT-4o.
RaR-Science: ~20K prompts curated to align with the problem categories from GPQA-Diamond with instance-specific rubrics generated by o3-mini.

All experiments use Qwen-2.5-7B as a base model and train with GRPO. Rewards are assigned using GPT-4o-mini with the instance-level rubrics described above. The proposed technique in [1], referred to as RaR-Implicit, uses LLM-generated, instance-specific rubrics with implicit aggregation as a reward signal. Several rubric-free and fixed-rubric baselines are also considered:

Base models: Qwen-2.5-7B and Qwen-2.5-7B-Instruct models are evaluated with no additional training.
Direct Assessment Judge: an LLM judge provides a direct assessment score for each response on a 10-point Likert scale—this is a standard LLM-as-a-Judge setup that does not use a granular, instance-specific rubric.
Reference-Based Judge: same as above, but the LLM judge is given a golden reference answer as context when generating a score.
RaR-Predefined: a fixed set of generic rubrics are used for all prompts with explicit aggregation and uniform criteria weights.
RaR-Explicit: instance-specific rubrics are used, but all criteria receive fixed weights based on their categorical importance label.

All models are evaluated on the GPQA-Diamond (Science) and HealthBench (medicine) benchmarks. For some smaller ablation experiments, RL training is performed on the training set of HealthBench rather than RaR-Medicine.

(from [1])

Do rubrics provide useful rewards? Across all experiments in [1], we see that using structured, rubric-based rewards during RL training is beneficial. Rubric-based rewards are especially impactful when using smaller LLM judges for RL training and are found to reduce variance in reward signals across different sizes of LLM judges. As shown above, rubric-based approaches outperform all rubric-free methods aside from the reference-based LLM judge, relative to which we only see marginal gains from rubrics. However, rubrics are found to yield a more notable gain over reference-based LLM judge rewards in later experiments that train on HealthBench; see below. We also see that implicit aggregation tends to outperform explicit aggregation by a small (but consistent) margin.

“Rubric-guided training achieves strong performance across domains, significantly outperforming Likert-based baselines and matching or exceeding the performance of reference-based reward generation.” - from [1]

(from [1])

These experiments also highlight the necessity of expert-curated references for generating rubrics—performance noticeably deteriorates without references, indicating purely synthetic rubrics are suboptimal. Predefined or generic rubrics are also found to perform quite poorly, indicating that prompt-specific criteria are useful for deriving high-quality rubrics. These best practices for creating better rubrics are also evaluated beyond their impact on RL training. In [1], authors show that rubrics created via their proposed approach have noticeably higher levels of agreement with preference annotations from human experts; see below.

(from [1])

Reinforcement Learning with Rubric Anchors [2]

“The success / failure hinges tightly on the diversity, granularity, and quantity of the rubrics themselves, as well as on a proper training routine and meticulous data curation.” - from [1]

Authors in [2] continue studying the application of RL to open-ended tasks using rubric-based rewards. They scale the rubric creation process to produce a dataset of ~10K rubrics curated by humans, LLMs, or a combination of both. Building on this dataset, a practical exposition of rubric-based RL is provided, ultimately arriving at a functional RaR training framework called Rubicon. Interestingly, simply increasing the number of rubrics—whether generated synthetically or with human assistance—yields only marginal gains. Instead, we must carefully curate high-quality rubrics, suggesting that the success of RaR heavily depends upon both rubric quality and the quality of the underlying training dataset.

Rubric system. Instead of using strictly instance-level rubrics, multiple scopes are considered in [2], including instance, task, and dataset-level rubrics. When generating data, the system in [2] (shown below) starts by constructing the rubric first. Data is synthesized only after the rubric is created so that it explicitly matches the rubric. Then, the combination of rubric and data is used for both RL training and evaluation. Tasks in [2] are selected according to the asymmetry of verification—verifying a candidate output should be much easier than generating it.

(from [2])

To ensure rubric quality, authors run dedicated ablation experiments for every set of rubrics that is generated to measure their impact on the training process. Each rubric is comprised of K criteria C = {c_1, c_2, …, c_K}. An example of a rubric created for evaluating open-ended or creative tasks is provided below. After evaluating each of these criteria, we are left with a multi-dimensional reward vector that can be aggregated to yield a final reward.

(from [2])

As a baseline, criteria-level rewards can be aggregated via a weighted average, but non-linear dependencies may exist between criteria that make a weighted average suboptimal. For this reason, authors in [2] consider the following advanced strategies for criteria aggregation:

Veto Mechanisms: failing on a critical dimension overrides any reward from other dimensions.
Saturation-Aware Aggregation: over-performing on a single dimension yields diminishing returns relative to a balanced reward across dimensions.
Pairwise Interaction Modeling: criteria are modeled together to capture inter-criteria relationships (i.e., synergistic or antagonistic effects).
Targeted Reward Shaping: rewards in high-performance regions are amplified to better capture differentials and avoid scores becoming compressed.

Training strategy. The data used in [2] is derived from a proprietary post-training corpus with ~900K examples. Prior to any training, offline difficulty filtering is performed to remove any examples on which the base model performs too poorly or already performs well3. From here, RL training progresses in two phases, each with a different curriculum:

The first phase focuses on instruction-following and programmatically-verifiable tasks to teach the LLM how to properly handle constraints.
The second phase extends the training process to more open-ended and creative tasks with a higher level of subjectivity.

While the first phase primarily relies upon static rubrics and verifiers, we must use reference-based rubrics—often with instance-specific criteria—for the second phase. Granular rubrics help to provide a more reliable reward signal on tasks that are highly subjective. This multi-stage training framework aims to progressively cultivate the capabilities of the model. When training jointly on all tasks, authors observe a “seesaw effect”—joint training actually reduces model performance relative to forming a multi-stage curriculum; see below.

(from [2])

Reward hacking is one of the biggest risks in a RaR setup. Whereas verifiable rewards are deterministic, neural reward models can be exploited, and the likelihood of our policy finding such an exploit increases in large-scale RL runs4. The Rubicon approach proposed in [2] combats reward hacking by performing an offline analysis of rollout data. After the first phase of RL training, authors examine rollouts that yield abnormally high rewards and create a basic taxonomy of recurring reward hacking patterns that are discovered. From this taxonomy, a specific rubric is created for preventing reward hacking—this rubric can also be iteratively refined over time. The addition of a reward hacking rubric improves training stability (i.e., avoids collapse into a reward-hacked state) and allows RL training to be conducted for a much larger number of training steps.

“Applying RL with rubrics from different task types could create conflicting objectives, leading to performance trade-offs — a phenomenon we refer to as the seesaw effect… training exclusively with instruction-following rubrics improves compliance but reduces creativity, while training exclusively with creativity and empathy rubrics enhances open-ended responses but harms strict adherence… These results suggest that simply combining all rubric types in a single RL run is likely to intensify such conflicts. To overcome this, we adopt a multi-stage RL strategy.” - from [2]

Rubicon-preview is a Qwen-3-30B-A3B base model that is finetuned in [2] using the Rubicon framework. This model excels in open-ended and humanities-related benchmarks. For example, we see below that Rubicon-preview achieves an absolute improvement of 5.2% compared to the base model on various instruction following, emotional intelligence, and writing benchmarks. Notably, Rubicon-preview also outperforms DeepSeek-V3-671B on most of these tasks, where an especially significant performance boost is observed on writing tasks.

(from [2])

The performance benefits of Rubicon-preview are also achieved with shocking sample efficiency—the model is only trained on ~5K data samples. By using an RaR approach, authors are also able to granularly control the style or voice of the resulting model. More specifically, a few case studies are presented in [2] that demonstrate the use of rubrics to guide the LLM away from the didactic tone that is common of chatbots and towards a human-like tone with more emotion.

(from [2])

Going further, creatively-oriented RaR training does not seem to damage the LLM’s general capabilities. As shown above, Rubicon-preview performs on par with or better than the original base model across a wide scope of benchmarks. Such a result should not come as a surprise given the natural ability of RL to avoid forgetting and retain the prior knowledge or skills of an LLM; see here.

OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment [3]

We’ve seen several papers that study the use of rubrics for RL training, where rubrics are generated—possibly with human intervention—and evaluated by an off-the-shelf LLM. Instead of focusing upon the downstream application of rubrics in RL, authors in [3] specifically analyze the rubric generation and evaluation process. To facilitate this study, an open dataset of prompt-rubric pairs, called OpenRubrics, is created for training both rubric generation models and rubric-based reward models. As we learned in [2], RaR training is highly dependent upon rubric quality. Creating better rubrics—and reducing the amount of human supervision in this process—makes RaR training more scalable and effective.

The rubric structure used in [3] is consistent with prior work. Namely, each rubric is comprised of K criteria, where each criterion is a rubric description that specifies one aspect of response quality. Two types of criteria are considered:

Hard rules: explicit or objective constraints (e.g., length or correctness).
Principles: higher-level qualitative aspects (e.g., reasoning soundness, factuality, or stylistic coherence).

Unlike prior work, rubrics in [3] do not use per-criterion weights and are used for pairwise comparison of two completions—as opposed to direct assessment. For a rubric R = {c_1, …, c_K} and two responses y_1 and y_2 to the same prompt x, we want our rubric-based reward model to provide a binary preference label (i.e., y_1 > y_2 or y_1 < y_2) by reasoning over the rubric criteria.

“We prompt the LLM to generate two complementary types of rubrics: hard rules, which capture explicit and objective constraints specified in the prompt, and principles, which summarize implicit and generalizable qualities of strong responses. This design allows the rubrics to capture both surface-level requirements and deeper dimensions of quality. Although hard rules are typically straightforward to extract, the principles are more subtle and require fine-grained reasoning.” - from [3]

Building OpenRubrics. The prompts and preference labels used for creating OpenRubrics are sourced from several public datasets (e.g., UltraFeedback, MegaScience, Medical-o1, instruction following data from Tulu-3, and more). For each of these datasets, preference data is obtained via domain-specific post-processing of the existing data. For example, the highest and lowest scoring responses form a preference pair for UltraFeedback, while for MegaScience and Medical-o1 completions are generated with a pool of LLMs and scored via a jury of different reward models to obtain preference pairs; see below.

(from [3])

Once this preference data is available, rubrics are generated using two key strategies proposed in [3] (shown above):

Contrastive Rubric Generation (CRG): an instruction-tuned LLM is provided both a prompt and a preference pair and asked to produce discriminative evaluation criteria by contrasting the chosen and rejected responses.
Rubric Filtering: rubrics are filtered by prompting an LLM to choose the preferred response given a preference pair and rubric as input and only retaining rubrics that yield agreement with human-provided preference labels (i.e., preference label consistency)5.

CRG and rubric filtering aim to create rubrics that are both prompt-specific and aligned with human preference examples, allowing them to serve as useful anchors for reward modeling. The result of this rubric generation and filtering approach is OpenRubrics, the key statistics of which are summarized in the plots below.

(from [3])

“After collecting the rubrics-based dataset, we proceed to develop a rubric generation model that outputs evaluation rubrics and a reward model Rubric-RM that generates final preference labels.” - from [3]

Rubric-RM. OpenRubrics provides a high-quality dataset of preference pairs and rubrics. In [3], this data is used to train two kinds of models (both of which are based upon Qwen-3-4B or Qwen-3-8B):

A rubric generation model—trained via SFT—that, given a prompt, can produce a discriminative rubric for predicting preference labels.
A reward model—also trained via SFT6—called Rubric-RM that can predict rubric-guided, pairwise preferences.

At inference time, these two models are used in tandem. Given a prompt, we first use the rubric generation model to produce our rubric. Then, Rubric-RM ingests this rubric, the prompt, and a pair of completions to generate a final preference prediction. We can also use majority voting (i.e., running this pipeline several times and taking the most frequently outputted score) to improve accuracy. Although using a two-stage pipeline increases inference costs, authors mention that costs can be decreased significantly by caching generated rubrics.

(from [3])

Comparison to other reward models. Rubric-RM is compared to a wide variety of other reward models and LLM-as-a-Judge approaches on several key evaluation benchmarks; see above. Rubric-RM tends to outperform similarly-sized baselines; e.g., the 8B variant gets 70.1% average accuracy, whereas the strongest 7B-scale reward model (RM-R1-7B) has an average accuracy of only 61.7%. These results are made even stronger with the use of majority voting. When comparing to the Qwen-3 base models, we see a noticeable uplift in preference scoring accuracy for Rubric-RM, highlighting the effectiveness of the finetuning strategy in [3].

“Rubric-RM excels on benchmarks requiring fine-grained instruction adherence… This demonstrates that rubrics capture nuanced constraints better than scalar reward models.” - from [3]

The gains from Rubric-RM are most pronounced on instruction-following tasks, which means that the rubrics in [3] work well for explicit evaluation criteria. On the other hand, this finding indicates less impact for subjective criteria, revealing that improving rubric supervision for open-ended tasks is still an open problem.

Application to post-training. Beyond evaluating Rubric-RM on reward modeling benchmarks, we can also measure the model’s downstream impact by using it as a reward signal in LLM post-training. Downstream evaluations in [3] only consider instruction following tasks (i.e., IFEval, InfoBench, and IFBench)—likely because this is the domain on which Rubric-RM excels—and use DPO for preference tuning. Rubric-RM is found to yield a boost over other reward models; see below.

(from [3])

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research [4]

“Deep research (DR) models aim to produce in-depth, well-attributed answers to complex research tasks by planning, searching, and synthesizing information from diverse sources” - from [4]

Rubrics are studied in the context of deep research (DR) agents in [4]. A DR agent is an LLM that is taught to perform multi-step research and produce long-form answers—or surveys—that answer a query with detailed information and citations. This idea was popularized by Gemini DR and followed shortly after by DR agents from OpenAI, Anthropic, and more. Though many closed models support DR mode, open models are behind in this area: most open DR models are either prompt-based or trained on short-form, search-intensive QA tasks (i.e., not reflective of frontier DR agents) with RLVR. To solve this, authors in [4] train Dr. Tulu-8B—a fully-open7 LLM agent for long-form, open-ended DR tasks—using a novel online RL technique that evolves instance-level rubrics alongside the policy throughout training.

(from [4])

Definition of DR. Before describing Dr. Tulu, we need to understand the basic mechanics of DR agents. Details of closed DR agents are not publicly disclosed, but we can discern from using these agents that they:

Heavily rely on search tools to ground their answers in external knowledge.
Output long answers (i.e., basically survey papers) with many citations.

Authors in [4] use these observations to formalize an action space for DR agents; see below. In this formulation, a DR agent has the ability to i) think, ii) call a set of search tools, iii) provide a final answer, and iv) insert citations into the final answer. For all actions, any context that is output (e.g., thinking traces or tool outputs) is just concatenated to the sequence being processed by the DR agent. The DR agent itself is just an LLM that performs tool use in this action space.

(from [4])

Rubrics for DR. Evaluating a DR agent is a tough task. These agents generate lengthy outputs with detailed information, so there are many ways that an output could be good or bad—a static or predefined set of rubrics will not capture the detailed quality dimensions required for this task. Additionally, evaluation varies depending on the query (e.g., asking for a vacation plan versus an AI research survey).

Given that most DR queries are knowledge-intensive, we must also verify key information against known world knowledge. For this reason, synthetically generating instance-specific rubrics with an LLM—as in [1, 3]—is insufficient. This approach relies upon the parametric knowledge of the LLM rather than grounding on external knowledge that can be used to verify correctness. Ideally, we should ground the evaluation process in knowledge retrieved via search tools rather than relying on the (incomplete) parametric knowledge of an LLM.

(from [4])

Evolving rubrics. To address the unique considerations of DR tasks, Dr. Tulu is trained using a modified rubric-based RL technique, called Reinforcement Learning with Evolving Rubrics (RLER), that derives a reward from instance-specific rubrics that i) evolve alongside the policy during training and ii) are grounded in knowledge from the internet; see above. Similarly to prior work, rubrics are defined as a set of weighted criteria. Each of these criteria can be scored with a separate LLM judge to derive a final score as shown below. This formulation matches the explicit aggregation strategy proposed in [1].

During training, we have a buffer of rubrics for each prompt that stores a set of evolving rubrics specific to that prompt. Within this buffer, we designate certain rubrics as active, and these active rubrics are used for deriving the reward in the current training iterations. To initialize the buffer, we first create a set of search-based rubrics using an LLM with access to search tools. These initial rubrics are used persistently—meaning they are always included in the active set of rubrics—throughout training. At each training step, we prompt an LLM to generate a set of new (or evolving) rubrics given a prompt, a group of corresponding rollouts, and the set of active rubrics for that prompt as context; see below. Specifically, there are two types of rubrics that can be created by the LLM:

Positive Rubrics: capture strengths of new relevant knowledge explored by the current policy but not yet present in any rubric.
Negative Rubrics: address common undesirable behaviors of the current policy (e.g., reward hacking).

Prompt for generating evolving rubrics (from [4])

During RLER, the number of evolving rubrics can become large. To avoid this, we maintain a subset of active rubrics—always containing the initial persistent rubrics—via an explicit management strategy that filters and ranks rubrics based on their discriminative power. To measure a rubric’s discriminative power, we rely upon the group of completions created for advantage computation in GRPO. During each policy update, the group of completions for a given prompt is scored using all active rubrics for that prompt, and rubrics with zero reward variance (i.e., no discriminative value) are removed. Remaining rubrics are ranked in descending order based on the standard deviation of rewards across the group. Only rubrics with the top K standard deviations—and persistent rubrics—remain active.

“Instead of trying to exhaustively enumerate all possible desiderata, our method generates rubrics tailored to the current policy model’s behaviors, offering on-policy feedback the model can effectively learn from. Furthermore, the rubrics are generated with retrieval, ensuring it can cover the needed knowledge to assess the generation.” - from [4]

The evolving rubrics in [4] are grounded in external knowledge and allow the reward for RL to adapt to the current state of our policy. As the model discovers new behaviors (e.g., a reward hack), these changes can be identified and captured in a new or modified rubric to maintain training fidelity. For this reason, we do not need to create a rubric a priori that exhaustively captures all desiderata for evaluation, which is difficult for DR tasks. Rather, this system can observe policy behavior and automatically incorporate key trends into new rubrics.

(from [4])

The rubric evolution process is found in [4] to have interesting characteristics, such as producing rubrics with measurably higher levels of specificity or even negative rubrics that penalize specific behaviors within the LLM; see above.

Dr. Tulu-8B is trained using a two-stage approach that includes a cold start SFT phase and online RL with GRPO. The Qwen-3-8B base model used in [4] does not yet possess the necessary atomic skillset (e.g., proper planning or citations) for solving DR tasks. If we were to begin RL training directly from this model, most rollouts would be of low quality, and the training process would likely struggle to efficiently discover high-reward solutions via exploration. To solve this issue, a cold start SFT phase is performed in [4] prior to RL training by sampling DR trajectories from a strong teacher model—in this case GPT-5 with a detailed system prompt describing the DR task—for supervised training. By finetuning the Qwen-3 base model on these trajectories, we allow the model to quickly learn a better initial policy for searching, planning, and citing sources prior to online RL. Given that most open DR agents are trained on short-form QA tasks, these supervised trajectories, which are openly available, are by themselves a useful artifact.

After cold start SFT, we perform online RLER using GRPO (with token-level loss aggregation) as the RL optimizer. Efficiently generating rollouts for online RL with a DR agent is a non-trivial systems problem due to output length and the frequency of tool calls. Rollouts are already the largest bottleneck in RL. Adding tool calls into the mix (i.e., “agentic” rollouts) makes this problem even worse. To improve efficiency, authors in [4] use one-step asynchronous RL training. Rollout generation and policy updates are performed at the same time, but policy updates are performed on rollouts from the prior training step. Additionally, tool calls are executed immediately to overlap generation and tool calling as much as possible.

“Tool requests are sent the second a given rollout triggers them, as opposed to waiting for the full batch to finish… Once a tool call is sent, we place that given generation request to sleep, allowing the inference engine to potentially continue to work on generating other responses while waiting for the tool response. This results in the generation and tool calling being overlapped wherever possible.” - from [4]

One other difficult aspect of RL training with a DR agent is the output lengths—generating long outputs (obviously) increases the time taken to produce a rollout. Plus, there can be high variance in output lengths. To mitigate this issue, sample packing is adopted during RL training, which improves efficiency by combining multiple outputs into a single, fixed length sequence. Finally, a few additional sources of heuristic rewards are used on top of RLER to encourage correct formatting and sufficient usage of search and citation tools by the agent.

(from [4])

Performance and efficiency. Dr. Tulu-8B is evaluated on several DR benchmarks (ScholarQA, HealthBench, ResearchQA, and DeepResearchBench), where we see that it substantially outperforms other open DR agents—even those that are larger (e.g., Tongyi-DR-30B-A3B)—and frequently matches the performance of the top proprietary systems. Additionally, Dr. Tulu-8B is smaller and cheaper compared to other systems. Notably, Dr. Tulu-8B is up to three orders of magnitude cheaper than OpenAI DR in some cases; e.g., costs are reduced from $1.80 per query to $0.0019 per query on ScholarQA8. Much of these savings come from the ability to call the correct tools and avoid excessive tool usage that drastically increases API costs. Not only does Dr. Tulu-8B generally make fewer tool calls, but authors observe in [4] that the model heavily calls free paper search tools for academic benchmarks while only using paid web search tools for more general queries.

Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training [5]

Rubrics are helpful for performing granular evaluation, assuming that the rubric we are using is of high quality. To curate a high-quality rubric, we rely upon human annotators or synthetic generation. Relying on human oversight would make it difficult to scale rubric curation. On the other hand, synthetic rubrics are scalable, but static models are often used to generate and evaluate these rubrics, which limits adaptation to new domains. To make this process more dynamic, a joint training procedure for rubric generation and evaluation is proposed in [5].

“Rubric-ARM [is] a framework that jointly optimizes a rubric generator and a judge using RL from preference feedback. Unlike existing methods that rely on static rubrics or disjoint training pipelines, our approach treats rubric generation as a latent action learned to maximize judgment accuracy. We introduce an alternating optimization strategy to mitigate the non-stationarity of simultaneous updates.” - from [5]

Rubric-ARM. There are two models being trained in this framework: a rubric generator and an LLM judge. These models are trained with an alternating RL framework that switches between training each model. This approach, called Rubric-ARM, jointly optimizes the generator’s ability to create a rubric and the judge’s ability to predict human-aligned preference scores given a rubric as input. By learning these components together (i.e., instead of using separate training pipelines), we allow them to co-evolve and reinforce each other throughout training.

A rubric is defined in [5] as a set of evaluation criteria that are conditionally generated given a prompt as input—no explicit per-criterion weights are defined. Given a rubric sampled from the rubric generator, the objective—for both the rubric generator and the judge—is to maximize the preference accuracy of scores output by the judge. Notably, Rubric-ARM only considers preference data. The LLM judge is trained to predict a preference label (i.e., instead of performing direct assessment) given a prompt and two possible completions as input.

(from [5])

Training pipeline. Prior to RL, Rubric-ARM performs a cold-start SFT phase that trains both the rubric generator and the judge over a synthetic dataset curated from a variety of open data sources (e.g., UltraFeedback, Magpie, and more). From here, we begin the alternating RL procedure that switches between training the rubric generator or judge while keeping the other fixed. Alternating the learning process gives each component a clear training signal by keeping the other fixed.

“To ensure stable joint optimization, Rubric-ARM employs an alternating training strategy that decouples the learning dynamics while preserving a shared objective. Training alternates between (i) optimizing the reward model with a fixed rubric generator to align with target preference labels, and (ii) optimizing the rubric generator with a fixed reward model to produce discriminative rubrics that maximize prediction accuracy.” - from [5]

At each training iteration t, we sample a batch of preference data. A rubric is then sampled—and cached for future use—with the rubric generator for each prompt in the batch. First, the rubric generator is kept fixed, and we perform RL training (with GRPO) to update the judge. The reward is defined as a sum of:

Preference accuracy: a binary score indicating whether the predicted label matches the ground-truth label.
Correct formatting: a heuristic that checks the judge’s trajectory for expected components (i.e., addressing each rubric criterion, providing per-criterion explanations, and finishing with an overall justification and decision).

Rubrics are generally sampled once and used for multiple judge optimization steps. After training the judge, we then freeze the judge’s weights and update the rubric generator. The rubrics used during this phase are cached, as the rubric generator was not trained during the prior phase. To train the rubric generator, we only use a preference accuracy reward based on whether the fixed judge is able to predict a correct preference label given the generated rubric. We learn from experiments in [5] that the optimization order is important. Training the rubric generator before the judge leads to noticeably degraded performance.

“Early-stage exploration by the rubric generator can dominate the learning dynamics. To mitigate this, we first stabilize the reward model under fixed rubrics before optimizing the rubric generator. This alternating schedule reduces variance and ensures robust optimization.” - from [5]

Application to post-training. The rubric generator and judge obtained from Rubric-ARM can also be applied to LLM post-training. Beginning with a set of prompts, we do the following:

Sample a rubric for each prompt with the rubric generator.
Sample two completions for each prompt using our current policy.
Score the completions using the judge with the above rubric9.
Perform DPO using preference data created with the above steps.

We are not restricted to offline training either! The above steps can easily be generalized to a semi-online DPO setup by regularly sampling new, on-policy completions and performing DPO training in phases to increase the freshness of preference data. We can even perform fully-online RL by modifying the above steps with a pairwise RL approach [6]. More specifically, we do the following for each prompt:

Sample a deterministic (baseline) completion with greedy decoding.
Sample a group of rollouts using a normal sampling procedure.

Once we have these completions, we use them to derive a direct assessment reward from the pairwise comparisons predicted by the LLM judge. To do this, Rubric-ARM creates preference pairs between each rollout in the group and the baseline completion. Then, our reward is defined as whether Rubric-ARM correctly predicts the greedy baseline as the rejected completion; see below.

Computing a reward for online RL from pairwise preferences (from [5])

“Rubric-ARM outperforms strong reasoning-based judges and prior rubric-based reward models, achieving a +4.7% average gain on reward-modeling benchmarks, and consistently improves downstream policy post-training when used as the reward signal.” - from [5]

How does this perform? Rubric-ARM is trained on the general-domain portion of OpenRubrics [3]. Both the rubric generator and LLM judge use Qwen-3-8B as a base model, and a two-stage rubric judging process—including generating and evaluating the rubric—is used at inference time. Rubric-ARM is compared to several open and closed LLM judges, as well as an SFT baseline trained on the same data (i.e., the Rubric-RM model [3]). Metrics on a wide variety of alignment-related reward modeling benchmarks are provided below. As we can see, Rubric-ARM outperforms all other open models and matches or exceeds the performance of most closed judges. Additionally, Rubric-ARM improves the performance of the SFT baseline by 4.8% absolute, indicating that alternating RL is helpful for discovering more discriminative rubrics and improving judge performance.

(from [5])

Rubric-ARM is also tested on WritingPreferenceBench, an out-of-distribution benchmark, where we see that the system generalizes well to other domains and continues to outperform baselines even on a very open-ended task (i.e., creative writing). Authors also run several ablation experiments, where we learn that:

The optimization order for alternating RL is important; i.e., training the rubric generator first (instead of the judge) degrades preference accuracy by 2.4% with the largest regressions seen on instruction-following tasks.
Removing the format reward used for the judge is harmful; i.e., LLM judges trained with only correctness rewards perform 2.2% worse than those trained on a combination of correctness and format rewards.

Similar results hold true when Rubric-ARM is used for LLM post-training. Rubric-ARM yields a boost in policy performance in both online and offline alignment scenarios, and policies trained with Rubric-ARM outperform those trained with other open models. Of the methods that are considered, iterative DPO with Rubric-ARM yields the best results, indicating that Rubric-ARM excels in creating high-quality preference data for LLM post-training; see below.

(from [5])

Conclusion

Rubrics decompose desired LLM behavior into self-contained criteria that an LLM judge can score and then aggregate into an overall evaluation or reward. Put simply, rubrics are a practical middle ground between deterministic verifiers and preference labels that allow us to extend RLVR beyond verifiable domains while retaining granular control over output quality. The work we have studied suggests rubric rewards are most reliable when criteria are specific (often instance-level), grounded (via references or retrieval), and carefully curated (usually with human oversight). In more advanced setups, rubrics can also be updated based on on-policy behavior, allowing the rubric to adapt instead of becoming stale or exploitable. Despite promising results, key challenges remain; e.g., reducing reliance on human supervision and improving robustness in highly subjective domains. As reasoning models and LLM judges become more capable, however, rubric-based RL is becoming a viable and general tool across a wider variety of domains.

New to the newsletter?

Subscribe now

Bibliography

[1] Gunjal, Anisha, et al. “Rubrics as rewards: Reinforcement learning beyond verifiable domains.” arXiv preprint arXiv:2507.17746 (2025).

[2] Huang, Zenan, et al. “Reinforcement learning with rubric anchors.” arXiv preprint arXiv:2508.12790 (2025).

[3] Liu, Tianci, et al. “Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment.” arXiv preprint arXiv:2510.07743 (2025).

[4] Shao, Rulin, et al. “Dr tulu: Reinforcement learning with evolving rubrics for deep research.” arXiv preprint arXiv:2511.19399 (2025).

[5] Xu, Ran, et al. “Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training.” arXiv preprint arXiv:2602.01511 (2026).

[6] Xu, Wenyuan, et al. “A Unified Pairwise Framework for RLHF: Bridging Generative Reward Modeling and Policy Optimization.” arXiv preprint arXiv:2504.04950 (2025).

[7] Zheng, Lianmin, et al. “Judging llm-as-a-judge with mt-bench and chatbot arena.” Advances in neural information processing systems 36 (2023): 46595-46623.

[8] Viswanathan, Vijay, et al. “Checklists are better than reward models for aligning language models.” arXiv preprint arXiv:2507.18624 (2025).

[9] Mu, Tong, et al. “Rule based rewards for language model safety.” Advances in Neural Information Processing Systems 37 (2024): 108877-108901.

[10] Gupta, Taneesh, et al. “CARMO: Dynamic Criteria Generation for Context Aware Reward Modelling.” Findings of the Association for Computational Linguistics: ACL 2025. 2025.

[11] Wu, Mian, et al. “Rlac: Reinforcement learning with adversarial critic for free-form generation tasks.” arXiv preprint arXiv:2511.01758 (2025).

[12] Xie, Lipeng, et al. “Auto-rubric: Learning to extract generalizable criteria for reward modeling.” arXiv preprint arXiv:2510.17314 (2025).

[13] Bai, Yuntao, et al. “Constitutional ai: Harmlessness from ai feedback.” arXiv preprint arXiv:2212.08073 (2022).

[14] Guan, Melody Y., et al. “Deliberative alignment: Reasoning enables safer language models.” arXiv preprint arXiv:2412.16339 (2024).

[15] Liu, Yang, et al. “G-eval: NLG evaluation using gpt-4 with better human alignment.” arXiv preprint arXiv:2303.16634 (2023).

[16] Arora, Rahul K., et al. “Healthbench: Evaluating large language models towards improved human health.” arXiv preprint arXiv:2505.08775 (2025).

[17] Deshpande, Kaustubh, et al. “Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier llms.” Findings of the Association for Computational Linguistics: ACL 2025. 2025.

Notably, this need to create ground truth labels for verification means that RLVR is still dependent upon access to validated data!

The numerical weights used for categories of importance in [1] are as follows: {Essential: 1.0, Important: 0.7, Optional: 0.3, Pitfall: 0.9}

Offline difficulty filtering is a popular approach used by papers like DAPO (in the form of dynamic sampling) or Olmo 3, which uses a nearly identical technique.

In particular, running RL for a very long time allows the model to continue exploring and (eventually) find an exploit to hack the neural reward model.

This is basically a form of rejection sampling that is anchored on human data!

In this case, the preference label is binary, so we can treat this as a next token prediction problem. For example, the reward model can predict a token of 0 or 1 to indicate its preference ranking. This is in contrast to the standard definition of a reward model, which uses a ranking loss for training.

All code, data, models, and technical details are openly released for Dr. Tulu-8B, which is consistent with other fully-open releases from Ai2.

These costs consider both hosting costs of the model on OpenRouter and the costs of any API calls made by the DR agent when generating its final answer.

More specifically, authors in [5] score each example twice, where the order of completions are flipped when generating the two scores. Then, only data that yields the same score for both orderings is retained for training.

Continual Learning with RL for LLMs

Cameron R. Wolfe, Ph.D. — Mon, 26 Jan 2026 10:33:14 GMT

(from [1, 2, 3, 6, 11])

Continual learning, which refers to the ability of an AI model to learn from new tasks and data over time, has become a popular topic in the discussion of Artificial General Intelligence (AGI). Put simply, general intelligence should be adaptable, which has led some to believe that continual learning abilities are a prerequisite for AGI. The reasoning behind this argument is clear—dynamically adapting to arbitrary tasks (i.e., “on-the-job” learning) is a common trait of humans—but rigorously studying this concept is hard. In the real world, continual learning is unstructured, noisy, and open-ended. In order to make meaningful progress, we must transform this complex process into a more structured empirical setting.

“LLMs don’t get better over time the way a human would. The lack of continual learning is a huge huge problem. The LLM baseline at many tasks might be higher than an average human’s. But there’s no way to give a model high level feedback. You’re stuck with the abilities you get out of the box.” - Dwarkesh Patel

To do this, we can pull from decades of prior research on the topic of continual learning for neural networks [10]. Although much of this work predates LLMs, such research provides a foundational understanding of continual learning and addresses key questions that are still relevant in the modern era:

Why is continual learning difficult?
How should we structure continual learning experiments?
Which techniques are effective in practice?

In this overview, we will bridge decades of continual learning research with more recent work on LLMs to develop a comprehensive perspective on the topic. While core concepts (e.g., catastrophic forgetting, experimental frameworks, method categories, etc.) carry over directly, continual learning for LLMs is unique because of scale. Even simple techniques become complex systems problems when considering the vast data and prior knowledge of modern LLMs. As we will learn, however, continual learning is not disjoint from current LLM research. Rather, existing post-training techniques—especially on-policy reinforcement learning (RL)—can naturally mitigate catastrophic forgetting, providing hope that continual learning is within reach given the current trajectory of LLM research.

Basics of Continual Learning

LLM training pipeline

The continual learning paradigm is starkly different from how neural networks are typically trained: for several epochs over a large, fixed dataset. Modern LLM training pipelines already include a mix of offline and more iterative components. Some stages (e.g., pretraining) closely resemble classical offline training, while others (e.g., iterative RLHF or online RL) begin to capture aspects of continual learning. In this section, we will develop a foundational understanding of continual learning—how it is studied, common experimental frameworks, and the major categories of methods proposed for both LLMs and neural networks more broadly.

Catastrophic Forgetting

Historically, the difficulty of continual learning does not stem from a model’s inability to learn new tasks, but rather its tendency to degrade in performance on old tasks when training on new data. For example, running supervised training of an LLM over a new dataset will quickly enhance its in-domain performance. But, the same model may significantly deteriorate in its performance across general benchmarks or tasks that were observed previously in the training process.

“Disruption of old knowledge by new learning is a recognized feature of connectionist models with distributed representations. However, the interference is sometimes described [as] mild or readily avoided. Perhaps for this reason, the interference phenomenon has received surprisingly little attention, and its implications for connectionist modeling of human cognition have not been systematically explored.” - from [10]

In continual learning research, this phenomenon is referred to as “catastrophic forgetting”1 [11]. Training our model over new data tends to come at the cost of a significant—or catastrophic—degradation in performance on other tasks. The goal of research in this area is, therefore, to mitigate catastrophic forgetting. The figure below helps us to better understand this phenomenon. Here, a model is initially trained on task A (grey) before being exposed to a new task (yellow).

(from [11])

The three arrows in the figure depict three possible solutions that can emerge when trying to solve this continual learning problem. The red arrow depicts a solution that performs well on both tasks, while the blue and green arrows perform well on only the new task or neither task, respectively. Put simply, the goal of continual learning is to develop techniques that reliably follow the red arrow. More specifically, an effective continual learning system should both:

Perform well on new tasks to which it is exposed.
Maintain comparable (or better) levels of performance on prior tasks.

As we will see throughout this overview, these two objectives are usually at odds—we are constantly balancing general capacities with new tasks. Simply specializing our model to each new incoming task is not a valid approach because new tasks will always continue to emerge in a real-world setting. We must maintain the model’s generality while maximizing adaptability to arbitrary future tasks.

Experimental Frameworks for Continual Learning

There are many continual learning variants that have been studied in the literature; e.g., continual learning, lifelong learning, incremental learning, streaming learning, and more. Despite the many variants of continual learning that exist, all of these variants share the same sequential nature of the training process—the model is exposed to new data over time and cannot return to data from the past (unless explicitly stored in a buffer) when learning from new data; see below.

(from [12])

Non-IID data. First, we must consider the kind of data being exposed to our model. If the incremental data over which the model is trained is sampled from the model’s training distribution, then training on this data is unlikely to cause forgetting. This setup resembles a continued training approach, which is used frequently for LLM pre and post-training. However, if the incremental data is non-IID—or sampled from a distribution that is new or different from the training data distribution—then catastrophic forgetting becomes very likely; see below.

For this reason, most experimental frameworks for continual learning assume the use of non-IID data. For example, when training an image classification model, we can derive incoming data from previously unseen classes. Similarly, we can continually train an LLM on an unseen task. In both cases, we expose the model to an unseen or different distribution of data that can induce catastrophic forgetting.

Data increments. We now need to understand the different approaches for exposing data to the model during continual learning. The most common sequential learning setup is a batch-incremental learning approach, where entire batches of data are passed to the model sequentially. These batches can be arbitrarily large (e.g., an entire new dataset or task) and the model usually trains on each batch of data before moving on to the next batch; see below.

(from [1])

Formally, we have a sequence of T tasks, each with an associated dataset or batch {D_1, D_2, …, D_T}. The model is sequentially trained on each task (i.e., one-by-one and in order), leading to a sequence of T models throughout the continual learning process. When training on a new task, we do not have access to prior tasks’ data. The simplest variant of batch-incremental learning is a domain adaptation setup where T = 1. For this setup, a pretrained model is trained on data from only a single new domain. The goal of continual learning in this scenario is the same, but the model only undergoes one stage of adaptation.

The batch-incremental framework may not always be realistic, as our model may receive data in much smaller increments. For these cases, a streaming learning setup may be more appropriate. Streaming learning uses brief, online updates (i.e., one or a few forward and backward passes) for each piece of incoming data, forcing learning of new data to happen in real-time; see below.

Basic streaming learning setup

In contrast, batch-incremental learning setups usually perform a full, offline training procedure (i.e., several epochs of training) over each batch of incoming data. Although streaming and incremental learning setups are quite different, we can interpolate between these two approaches by:

Changing the amount of data passed to the model at each phase of sequential learning (e.g., single example, batch of examples, entire dataset, etc.).
Restricting the number of model updates at each sequential learning phase (e.g., single update, multi-update, full epoch, multi-epoch, etc.).

Multi-task learning. In order to determine if a continual learning technique is performing well, we need a baseline to which our models can be compared. A common baseline is joint (multi-task) training, where the model has access to all T tasks and can perform offline training over all of the data. Joint training over all data is the best possible training setup and allows us to understand the ceiling in performance that we are aiming to match via continual learning.

Which setup is best? In this overview, we will study a variety of continual learning papers in the LLM domain. Most of these papers adopt some variation of batch-incremental learning, where each batch is a new task that the LLM must learn. The domain-adaptation setup, in which a base LLM is trained over a single new task, is also common. These setups are useful for testing the tendency of LLMs to catastrophically forget, but one could argue that such a task-incremental setup does not reflect how LLMs would continually learn in the real world. For this reason, no one continual learning setup is the best. Rather, we should modify our experimental configuration within the frameworks outlined above such that the practical setting we are trying to test is most accurately reflected.

Common Techniques for Continual Learning

Now that we have a basic understanding of continual learning, we can overview some of the key categories of techniques for mitigating catastrophic forgetting. We will cover continual learning approaches in general, as well as highlight the methods that have been used in recent continual learning work with LLMs.

(from [13])

Replay mechanisms (depicted above) are a simple and effective technique for continual learning that maintain a buffer of prior data over which to train the model. Before being included in the replay buffer, samples usually undergo a selection process (e.g., based on importance or diversity) [14] to ensure that the buffer contains high-quality, representative samples and is not too large. The entire replay buffer can also be quantized or compressed to reduce memory [15]. In cases where data cannot be explicitly stored inside of a replay buffer, we can also train or maintain a generative model to replay synthetic examples [16, 17].

(from [31])

Although replay buffers are one of the most simple and effective techniques for continual learning, applying them in the LLM domain is less straightforward. Namely, LLMs have a vast amount of prior training data and, in many cases, this data is not openly available. Therefore, constructing a replay buffer that captures the general capabilities of an LLM is non-trivial. However, several works have recently explored the use of replay buffers for continual post-training. For example, instruction tuning data has a more manageable volume, allowing a replay buffer to be constructed by retaining the most important or informative data throughout the continual post-training process [30, 31]; see above.

(from [19])

Knowledge distillation [18] can be used to mitigate catastrophic forgetting by ensuring that a model’s representations do not drift during the continual learning process. In their simplest form, distillation-based continual learning techniques just combine the training loss on new data with a distillation loss with respect to prior model outputs [19]; see above. Many variants of this approach have been proposed [12, 20, 22]. We should also note that these techniques are not mutually exclusive; e.g., replay buffers can be combined with a distillation loss [13].

Regularization in various forms can be helpful for continual learning. In fact, knowledge distillation can even be considered a form of regularization. Researchers have explored constraining weight updates for subgroups of parameters—usually the most important parameters for a task [11, 21]—or increasing plasticity for select parameters [23]. We can also regularize the output distribution of the model by applying a KL divergence—similar to the use of KL to regularize the RL training objective—and even simple changes like lowering the learning rate have been found to reduce forgetting [2]. Model merging has also been applied in tandem with explicit regularization to reduce catastrophic forgetting in LLMs [29].

(from [24])

Architectural approaches have also been explored for continual learning that dynamically adapt the model’s architecture to handle incoming data. For example, new modules can be added to a neural network to handle new groups of data [24]; see above. Given the popularity of LoRA for LLMs, recent work has explored using LoRA modules as an architectural extension for learning new information during continual learning [26, 27]; see below. Mixture-of-Experts architectures for LLMs have also been shown to be better at avoiding catastrophic forgetting [28].

(from [27])

Further reading. We have now seen a comprehensive, high-level overview of the continual learning techniques that exist, but the literature is vast and dates all the way back to the 1980s (if not earlier)! The resources linked below will be helpful for developing a deeper understanding of continual learning research:

A broad overview of the categories of continual learning techniques.
A deep dive on streaming learning techniques.
A survey on continual learning for modern generative models.

Continual Learning for LLMs

“Surprisingly, without any data replay, continual post-training with RFT can achieve comparable performance with that of multi-task training, which is not achievable even when equipping SFT with continual learning strategies.” - from [1]

We will now take a look at several papers that study the topic of continual learning in the context of LLMs. Instead of focusing on continual learning techniques, however, these papers adopt standard LLM training methodologies—supervised finetuning (SFT) and reinforcement learning (RL) in particular—and analyze their natural ability to avoid catastrophic forgetting. Although SFT tends to not perform well for continual learning, RL is found to be shockingly robust to forgetting, even without employing explicit continual learning techniques (e.g., replay buffers or regularization). Given the current popularity and impact of RL in training frontier models, this inherent ability to handle continual learning makes RL an important tool for the creation of generally intelligent systems.

More on SFT and RL

To understand the different behaviors of SFT and RL in the continual learning setting, we need to gain a deeper understanding of the learning mechanisms that underlie these algorithms. For a full overview of each technique, please see the following resources:

As we will see, all of the papers in this overview adopt a reinforcement learning with verifiable rewards (RLVR) setup with GRPO as the RL optimizer.

Training objectives. In SFT, we have a fixed dataset of supervised examples over which we are training our LLM. The training objective aims to minimize the model’s negative log-likelihood over this dataset, as shown below.

SFT training objective

In contrast, RL uses the objective shown below, which focuses on maximizing the reward—such as a binary correctness signal in RLVR—of on-policy completions sampled for prompts taken from a fixed dataset. Optionally, we can include a KL divergence regularization term that penalizes the model for producing an output distribution that differs significantly from some reference model2.

Forward and reverse KL. One possible way to view the SFT and RL training objectives is through their relation to the KL divergence. Formally, the KL divergence is a measure for the divergence between two probability distributions; see here for full details. For two probability distributions P and Q, we can define the forward and reverse KL divergences as shown in the figure below.

In the LLM domain, these probability distributions are usually the next token distributions outputted by our LLM. A key difference between the forward and reverse KL divergence lies in the sampling—the distribution from which we sample in the above expectations changes. Specifically, we are either sampling from our dataset (offline) in SFT or from the LLM itself (online or on-policy) in RL.

SFT ≈ forward KL. Using these concepts, we can show that the training objective used by SFT is equal to the forward KL divergence up to a constant. Let’s call the optimal (or target) distribution for our dataset π_*. We can show the following for the relationship between this objective and the forward KL divergence, where H(π_*) denotes the entropy3 of the optimal distribution over the SFT dataset.

In the above expression, the entropy of the optimal distribution is a constant, so the forward KL and SFT training objective are equal up to a constant—minimizing forward KL is equivalent to minimizing the negative log-likelihood objective.

RL ≈ reverse KL. As mentioned previously, RL tries to maximize the reward of on-policy completions while minimizing KL divergence with respect to a reference policy. We can actually derive a closed-form expression for the optimal solution to the RL objective. The expression for the optimal policy is shown below, where Z(x) denotes the partition function. Notably, this optimal policy expression is also the first part of deriving the training loss for DPO!

If we assume that this optimal policy π_* is our target distribution, then we can show that maximizing the RL objective is equivalent to minimizing the reverse KL divergence between this target distribution and our policy π_θ.

As we can see, the first line of this equation computes the reverse KL divergence relative to the KL divergence used in our derivation for the SFT objective. In the final line, we have the negative of our RL objective (plus a scaling factor of 1/β and an additional constant). Therefore, minimizing this reverse KL divergence objective would be the same as maximizing the RL training objective.

What does this tell us? Now we understand the relation of SFT and RL to the forward and reverse KL divergence, respectively. But, what do these relationships actually tell us about the objectives? SFT minimizes negative log-likelihood over a dataset, which is equivalent to minimizing the forward KL divergence. This is a mode-covering objective. Our model is heavily penalized for assigning low probability to any completion that is found in the data—the model must “spread” its probability mass across all possible completions or modes in the data.

On the other hand, RL maximizes rewards of on-policy completions, which is equivalent to a reverse KL objective and is mode-seeking. Put differently, the model prioritizes high-reward outputs, even at the cost of ignoring output modes.

In SFT, the model’s loss increases exponentially if we assign near-zero probability to any completion in the dataset—this is due to the shape of the negative log-likelihood curve (shown above)! Such a property is not true of RL, as we are simply maximizing the reward of on-policy completions. Assigning near-zero probability to a completion will prevent this particular completion from being sampled during RL, but reward can still be maximized over the completions that are sampled. This is a fundamental property of RL that creates favorable behavior with respect to minimizing catastrophic forgetting during continual learning.

Reinforcement Finetuning Naturally Mitigates Forgetting in Continual Post-Training [1]

Continual learning can be viewed as a continued post-training process for an LLM. In this setup, the same base LLM undergoes extensive post-training over an evolving and expanding data stream, forcing the model to adapt to new requirements and learn new skills or knowledge without losing existing capabilities. However, avoiding catastrophic forgetting in this scenario is difficult. In [1], authors consider this continual post-training setup and analyze the best learning paradigm—either supervised finetuning (SFT) or reinforcement learning (RL)—for maximizing performance and minimizing forgetting.

Continual post-training. In the real world, continual learning is messy—the LLM will be constantly exposed to new data from various sources—but a more organized proxy setup is needed for research. A common way to simulate continual learning is via a sequential learning (or batch-incremental) setup, where the LLM is sequentially exposed to an ordered group of datasets. In [1], authors choose seven datasets that cover a wide scope of multi-modal (vision) use cases: ScienceQA, TextVQA, VizWiz, Geometry3K, GQA, PathVQA, and Super-CLEVR.

“A higher AvgAcc indicates better overall performance, while an FM closer to zero signifies less forgetting and better knowledge preservation.” - from [1]

Evaluation metrics. Our goal in continual post-training is to i) maximize the LLM’s performance on each new task and ii) avoid performance degradation—or catastrophic forgetting—on prior tasks. Assume that the LLM is evaluated on all tasks after each training round, yielding performance P_{t, j} on task j after learning for task t is complete. We can then capture key performance properties of continual post-training via the following two metrics:

Average accuracy (AvgAcc): the average accuracy of the model across all tasks after training on the final task T has completed.
Forgetting measure (FM): the average difference between the model’s final accuracy for a task and the best accuracy observed for that task throughout all T rounds of the training sequence.

Continual post-training metrics (from [1])

After the end of the continual post-training process, the above metrics are computed over the test sets of all previously-encountered tasks. Going further, authors in [1] also measure performance on several general LLM benchmarks (i.e., MMMU, MMLU-Pro, and POPE) at the end of the continual post-training process to check for any impact on the model’s general capabilities.

SFT versus RL. Continual post-training experiments are performed in [1] using the Qwen-2.5-VL-7B-Instruct model, which is sequentially trained on data from each of the seven benchmarks. Notably, no replay buffer or data from prior tasks is used when training on new tasks, so the model’s ability to avoid forgetting is entirely dependent upon mechanics of the learning algorithm. As mentioned before, two types of learning algorithms are used:

Supervised Finetuning
Reinforcement Learning (GRPO, RLOO and ReMax)

For RL, we derive rewards using a standard reasoning model setup that combines the verifiable reward with a format reward that encourages the model to i) wrap its reasoning trace in tokens and ii) mark its output with a \boxed{} label. Models output a reasoning trace prior to their final output, though tests are performed both with and without reasoning for all training setups.

RL forgets less. The results of the continual post-training experiments in [1] are depicted below. SFT clearly leads to catastrophic forgetting of previously-learned tasks, which gets worse as tasks move further into the past—forgetting is worst on initial tasks in the sequence. More specifically, we see an average accuracy of 54% with SFT, while multi-task training on all tasks reaches an average accuracy of 62.9%. Similarly, a FM of -10.4% is also observed for SFT, indicating that most tasks degrade noticeably in performance throughout continual post-training.

(from [1])

While SFT struggles to mitigate forgetting, RL naturally adapts well to new tasks. For GRPO, we observe an average accuracy of 60% (i.e., slightly below multi-task learning) and an FM of -2.3%. Additionally, the final accuracy on ScienceQA—the first task in the sequence—is 93%, compared to a peak accuracy of 95.6%. These results show that RL strikes a strong balance between learning and remembering.

“Without any data replay, continual post-training with RFT can achieve comparable performance with that of multi-task training, which is not achievable even when equipping SFT with continual learning strategies.” - from [1]

Influence on general capabilities. In the same vein, SFT-based continual post-training also degrades general model capabilities; see below. In contrast, we see in [1] that RL maintains—or even slightly enhances—performance on general benchmarks. For example, models sequentially trained with GRPO improve from an initial accuracy of 52.1% to a final accuracy of 54.2% on MMMU!

(from [1])

Such an ability to maintain performance on general benchmarks is a desirable aspect of continual learning. Ideally, we want the LLM to adapt to new tasks while maintaining its existing, foundational capabilities as much as possible.

Why does RL forget less? Given the above results, we might begin to wonder: Why does RL have the ability to naturally avoid catastrophic forgetting? Of course, it is possible that such continual learning abilities are directly attributable to RL itself. However, authors in [1] also consider two alternative explanations for the lack of catastrophic forgetting:

The use of a KL divergence term in RL regularizes the training process and acts as a form of knowledge distillation that preserves prior knowledge.
The use of long CoT reasoning in models trained with RL leads to a more robust knowledge base that is better protected from forgetting.

To test whether these factors help with avoiding catastrophic forgetting, three setups are tested that ablate the use of KL divergence and long CoT reasoning. Interestingly, we learn from testing these setups that removing KL divergence, despite degrading the stability of RL training4, does not lead to any degradation in performance metrics for continual post-training. Additionally, models that do not output a reasoning trace resist catastrophic forgetting similarly to those that do. Using CoT reasoning improves baseline model performance, but continually trained models in either setup see the same amount of catastrophic forgetting.

(from [1])

The results of these ablation experiments are outlined in the table above. The impressive performance of RL in continual post-training experiments does not seem to stem from the use of KL divergence or long CoT reasoning. Rather, the ability to perform continual learning seems to be an inherent property of RL training. Insight as to how RL avoids forgetting is provided by theory in [1] showing that RL naturally scales policy updates according to the variance of the reward signal, leading to more conservative updates for important or sensitive parameters.

“We offer a theoretical perspective suggesting that RFT’s updates are inherently more conservative in parameter subspaces sensitive to prior tasks. This conservatism is naturally scaled by the variance of the reward signal, creating a data-dependent regularization that dampens updates on uncertain samples, thus protecting established knowledge.” - from [1]

Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting [2]

Work in [2] shares a very similar focus to the paper above—trying to compare SFT and RL in the context of continual learning. However, a different experimental setup is used that considers three domains: instruction following (IFEval), general skills (MMLU), and arithmetic reasoning (Countdown). Beyond these target tasks that are used for training and evaluation, a few non-target tasks (i.e., MATH and two safety benchmarks) are included to provide a wider evaluation suite. We do not train the LLM over a sequence of tasks in [2]. Rather, the LLM is trained over one target task—a domain adaptation setup—and we measure performance via:

The accuracy gain on that target task.
The average accuracy drop across all non-target tasks.

Notably, the lack of multi-step sequential learning makes this setup less realistic. In [1], we see that the impact of catastrophic forgetting is greater after several training rounds. However, the domain adaptation setup in [2] does allow us to efficiently analyze the forgetting mechanics of different learning algorithms. The following learning algorithms are considered in [2]:

SFT training on responses from a teacher model (Llama-3.3-70B-Instruct).
Self-SFT training, which performs SFT-style training over responses from the initial policy (before training) or reference model.
RL training using GRPO with verifiable rewards—a standard RLVR setup.

Both SFT variants filter completions based on correctness, as determined by deterministic verifiers for each domain. Self-SFT is a rejection sampling setup (i.e., incorrect responses are rejected) that is used as a simple baseline, whereas the SFT setup performs offline knowledge distillation from a larger model. Self-SFT is an offline approach as well because completions are sampled from the initial model, rather than on-policy. The same verifiable correctness signal used for filtering completions in SFT variants is also used as the reward signal in RL.

(from [2])

RL forgets less (again). Experiments are performed in [2] using Qwen-2.5 and Llama-3 models with up to 8B parameters. As shown above, higher levels of forgetting—as measured via the average accuracy drop across non-target tasks—are observed with SFT compared to RL. In fact, Qwen-2.5 models see <1% average accuracy drop across all tasks and model scales for RL training, whereas the average accuracy drop with SFT reaches nearly 30% in some cases.

“RL leads to less forgetting than SFT while achieving comparable or higher target task performance… SFT suffers from severe forgetting, whereas RL can achieve high target task performance without substantial forgetting.” - from [2]

Despite the ability of RL to avoid catastrophic forgetting, the results with SFT are not actually bad—there is just a clear domain tradeoff. We can achieve performance improvements in the target domain via RL training, but models trained via SFT actually perform better. Unfortunately, the superior performance of SFT in the target domain comes at the cost of degraded performance on non-target tasks. For this reason, the comparison is not as simple as RL > SFT. Rather, RL and SFT lie at different points on the Pareto frontier of target and non-target task accuracy—better performance in one domain comes at the expense of the other.

(from [2])

Benefits of on-policy data. Similar to work in [1], authors in [2] show the lack of catastrophic forgetting in RL is not due to the inclusion of a KL divergence term in the objective; see above. Interestingly, the exact advantage formulation used by GRPO is also found to have little impact on continual learning capabilities—a naive REINFORCE-based RL setup is shown to mitigate forgetting to a similar extent. It is possible, however, that the continual learning abilities of RL stem from its use of on-policy samples—unlike the offline dataset used by SFT—during training. To test this theory, we consider the following training setups:

On-policy SFT: running SFT using fully on-policy samples that are directly obtained from the RL training process.
Iterative SFT: re-generating data for SFT after every epoch using the current policy (i.e., a partially on-policy approach).

Put simply, these approaches adapt SFT to use on-policy data, which allows us to decouple the impact of RL training and on-policy data. The use of iterative SFT also allows us to test a semi-on-policy scenario, which samples fresh on-policy data at the end of each epoch (i.e., instead of generating new samples during each training iteration). This coarse-grained approach to on-policy data has efficiency benefits—we can adjust the regularity with which we sample fresh on-policy data.

“We find that for SFT, while generating data only from the initial policy is not enough, approximately on-policy data generated at the start of each epoch can suffice for substantially reducing forgetting. This suggests a practical guideline for LM post-training: leveraging on-policy data, potentially sampled asynchronously or at the start of each epoch for improved efficiency, can reduce unintended disruption of the model’s existing capabilities.” - from [2]

Experiments with these training algorithms provide empirical evidence that on-policy data is a key contributor to the success of RL in the continual learning domain. Specifically, models trained via on-policy SFT mitigate forgetting to a similar extent as those trained via RL. Additionally, the data used does not need to be fully on-policy—similar trends are observed with iterative SFT; see below.

(from [2])

Mode-seeking versus mode-covering. Intuitively, we might assume that the mode-covering nature of SFT would allow the model to maintain probability mass across all tasks and, therefore, avoid catastrophic forgetting. As we have seen, however, the opposite is true in practice. Such a finding is due to the fact that we are only training our model over a small subset of the model’s total data distribution in most of these experiments. Potentially our observations would be different if we were able to retain the LLM’s entire training dataset within a replay buffer, but implementing such an approach efficiently would be incredibly difficult.

(from [2])

In the standard LLM post-training setup, the mode-seeking behavior of RL is more robust to catastrophic forgetting. To explain this phenomenon, authors in [2] construct a simplified setting shown above, which illustrates the dependence of forgetting on the modality of the underlying target distribution. If our target distribution is multi-modal, which is likely to be true for an LLM, then the mode-seeking nature of RL actually leads to less forgetting relative to a mode-covering objective like SFT. The simplified distribution that is constructed in [2] has two modalities corresponding to old and new knowledge. For such a distribution, the forward KL objective yields noticeable forgetting while minimizing the reverse KL allows both modes of the target distribution to be properly captured.

RL’s Razor: Why Online RL Forgets Less [3]

As we know, SFT and RL achieve comparable performance when training on a new task but have drastically different forgetting dynamics. In most cases, gains on new tasks with SFT come at the cost of erasing prior knowledge, while RL is much better at protecting old capabilities; see below. By studying this gap in performance, authors in [3] identify a metric that reliably predicts the amount of forgetting that occurs for both SFT and RL: the distributional shift—measured via KL divergence—between the base and finetuned models on the target task.

(from [3])

RL’s Razor. In addition to discovering this relationship between the underlying distribution shift and forgetting, we see in [3] that the finetuned models from SFT and RL have unique properties:

RL is biased towards solutions that minimize distributional shift.
SFT can converge to solutions arbitrarily far away from the base model.

Such a property naturally implies the improved continual learning abilities of RL. By discovering a solution that minimizes distributional shift, we also minimize the amount of forgetting that occurs; see above. The bias of RL towards nearby solutions that minimize catastrophic forgetting is referred to in [3] as “RL’s Razor”5.

“RL’s Razor: among the many high-reward solutions for a new task, on-policy methods such as RL are inherently biased toward solutions that remain closer to the original policy in KL divergence… the KL divergence between the fine-tuned model and the base model, measured on the new task, reliably predicts… forgetting.” - from [3]

Distribution shift. In the LLM domain, we often measure the KL divergence between the next token distributions of two models. For example, the RL training objective has a KL divergence term that regularizes drift between the current and reference policy, where the KL divergence is computed using on-policy samples taken from the current policy during RL training. In [3], authors compute the KL divergence over data from the task on which our policy is being finetuned (i.e., the target task). We are restricted to using the target data because we rarely have access to the pretraining data (or any prior tasks) on which an LLM was trained.

(from [3])

This KL divergence between base and finetuned models on the target dataset can be viewed as capturing the distributional shift from training. We are computing the divergence between models before and after training over the training data itself. When measured in this way, the distributional shift is found to be consistently predictive of the amount of forgetting that occurs. Given that no prior data is used to compute this KL divergence, such a finding is highly non-trivial!

Experiments in [3] are performed using both vanilla SFT and RL with GRPO. The RL setup uses standard verifiable rewards and no KL divergence regularization. Similarly to [2], the base model (Qwen-2.5-3B-Instruct) is trained on one target task (i.e., Open-Reasoner-Zero, ToolAlpaca, or the Chemistry L-3 subset of SciKnowEval) and evaluated on both the target task and set of prior tasks (i.e., HellaSwag, TruthfulQA, MMLU, IFEval, WinoGrande, and HumanEval). Given that hyperparameter settings can massively impact results in a continual learning setup6, a wide variety of hyperparameters are tested for each task and results are visualized as a Pareto frontier constructed by all possible settings.

Lower KL leads to less forgetting. RL training improves target task performance while keeping performance on prior tasks stable. However, improvements in performance obtained via SFT come at the cost of noticeable forgetting. The deterioration in performance is most visible in the math domain; see below.

Identifying the cause of such forgetting is difficult due to the high computational cost of RL training—testing each hypothesis is quite expensive! To make this search more tractable, a toy setting is created based on the MNIST and FashionMNIST datasets for which RL training is much faster. Using this setting, a variety of candidate metrics are tested for a relationship to catastrophic forgetting:

The magnitude of changes to model parameters.
The sparsity of weight updates.
The rank of policy gradients throughout training.

The only quantity that demonstrates a consistent relationship with the amount of catastrophic forgetting is the KL divergence between base and finetuned models over the target dataset; see below. The fact that the rank or sparsity of policy gradient updates is unrelated to forgetting is notable, as prior research [4] has shown that RL works surprisingly well even when using LoRA with a low rank. Such a finding indicates that the updates being produced by RL are potentially sparse or low rank, which could help to reduce forgetting. However, we see in [3] that the story is not this simple. Rather, the benefits of RL stem from an implicit KL regularization—or RL’s Razor—that minimizes distribution shift in training.

(from [3])

To further validate the relationship between the KL divergence and forgetting, authors create an “oracle” SFT distribution in their toy setting. Put simply, this experiment performs SFT on a dataset that has been analytically constructed to minimize the KL divergence between the base and finetuned models. As shown above, running SFT on this data yields an even better tradeoff than RL—the model performs better on the target task without sacrificing prior task performance.

“RL performs well because its on-policy updates bias the solution toward low-KL regions, but when SFT is explicitly guided to the KL-minimal distribution, it can surpass RL.” - from [3]

On-policy data. Beyond the toy example explained above, authors in [3] also run SFT training over on-policy data obtained during RL. The accuracy-forgetting tradeoff achieved by the resulting model matches that of those trained via RL, which aligns with prior work [2] and provides further evidence that on-policy data plays a key role in mitigating forgetting for RL. To better understand the impact of on-policy data, three different learning algorithms are tested (shown below):

Standard GRPO.
Standard SFT.
1-0 REINFORCE: an on-policy RL algorithm with a very simple advantage function (i.e., 1 if the answer is correct and zero otherwise).
SimPO [5]: an offline preference tuning algorithm that simplifies DPO by directly using the log probability of a sequence as the implicit reward and, therefore, removing the need for a reference model.

As we can see in the left half of the above figure, these experiments ablate the use of negative examples and on-policy data within the training setup. Interestingly, the 0-1 REINFORCE algorithm performs similarly to GRPO, while results with SimPO resemble those of SFT. Such results indicate that the use of on-policy data is the key contributor to RL’s lack of forgetting. We also see above that the use of on-policy data leads to minimal KL divergence between the base and finetuned models over the target distribution. Such results indicate that the implicit bias of RL towards low KL solutions stems from the online nature of training. This empirical observation is also justified by further theoretical analysis in [3].

(from [3])

Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting [6]

“While RL aligns with the model’s internal belief, SFT forces the model to fit external supervision. This mismatch often manifests as Confident Conflicts—tokens characterized by low probability but low entropy.” - from [6]

We have learned that RL avoids catastrophic forgetting much better than SFT due to its use of on-policy data, which allows for the discovery of a solution with minimal KL divergence between base and finetuned models on the target data. Although we know that these factors lead to less forgetting, we do not yet understand why this is the case. In [6], authors offer a new perspective on the forgetting properties of SFT and RL by analyzing token probabilities and entropy of models trained with these two approaches. When these two quantities are measured throughout the training process, we see that a clear gap exists:

On-policy RL tends to cluster in regions of highly-confident and correct predictions—characterized by high probability and low entropy—or exploratory completions—characterized by high entropy.
SFT has a significant cluster of tokens with both low entropy and low probability—these are referred to as “Confident Conflicts”.

To discover this distribution mismatch, token probability and predictive entropy is measured over both the SFT dataset and model-generated rollouts. This trend is visualized below, where we see that SFT data has a noticeable cluster of confident conflict tokens that does not exist when using on-policy data.

(from [6])

Why does this occur? We are using external supervision in SFT (i.e., an offline supervised dataset), whereas RL learns from on-policy or self-generated data. In some cases, training the model on external data forces it to mimic outputs that align poorly with its current next distribution—confident conflicts occur when external data has a strong conflict with the model’s prior. As a result, gradient updates can become large and destructive, leading to catastrophic forgetting.

“Because the model strongly favors another token, fitting the target requires substantial parameter updates, which can overwrite general representations in the base model. By contrast, when the model is uncertain (high entropy), the gradients are smaller and updates are gentler, helping preserve the model’s original capabilities.” - from [6]

Masking conflicts. To determine whether confident conflict tokens truly lead to forgetting, authors in [6] test simply masking the loss from such tokens during SFT. Interestingly, catastrophic forgetting is significantly reduced when these tokens are masked from the training loss, indicating that confident conflict tokens play a significant role in the tendency of SFT to damage prior knowledge; see below.

(from [6])

Extending this idea, a novel training algorithm, called Entropy Adaptive Finetuning (EAFT), is proposed in [6] that scales the token-level cross-entropy loss by a dynamic entropy factor. The new loss formulation is outlined below, which multiplies the supervised loss by the token’s normalized entropy7. By using this token-level entropy scaling factor, we can effectively mask the loss of low entropy tokens that lead to destructive gradient updates while maintaining the full update for high entropy tokens that are beneficial for exploration.

EAFT loss formulation (from [6])

“EAFT employs a soft gating mechanism that dynamically modulates the training loss based on token-level entropy.” - from [6]

To improve the efficiency of EAFT, authors in [6] only compute entropy over the Top-K (where K = 20) tokens in the distribution. As shown in the figure below, this setting balances the tradeoff between compute and memory overhead and ensures that added computational overhead relative to vanilla SFT is minimal.

(from [6])

Results on Math. EAFT is validated in the Math domain using models across multiple families ranging from 4B to 32B parameters. Training prompts are sourced from NuminaMath, BigMathVerified, and Nemotron-CrossThink, while completions are sampled from Qwen-3-235B-A22B-Instruct. Both in-domain and general benchmarks are used for evaluation. Models trained with EAFT perform well in the target domain while maintaining performance on general benchmarks; see below. Additionally, EAFT is found to effectively filter confident conflict samples during the training process, as demonstrated by the visible reduction in gradient magnitude within the confident conflict zone of the below figure. These results are further validated in experiments in the medical and tool use domains.

(from [6])

Does RL Generalize Well?

So far, we have focused on retaining old skills while learning new ones. A closely related question is whether the same mechanisms that reduce forgetting also improve transfer and out-of-distribution generalization. The fact that RL performs well in a continual learning setting has important implications for its generalization properties. Put simply, RL training tends to benefit more than just the target domain. As we will see in the next few papers, there are many examples of RL training yielding cross-domain performance benefits or improving the generalization of an LLM to some other task. Much of this analysis is similar in nature to what we have seen for continual learning, but the emphasis shifts from remembering prior tasks to generalizing beyond the training distribution.

(from [7])

SFT Memorizes, RL Generalizes [7] performs a comparative post-training analysis between SFT and RL on both language-only and vision-language tasks. The main results of this analysis are depicted above, where we see that:

Both SFT and RL improve in-domain performance.
Only RL generalizes well to new tasks or data.

Experiments in [7] use Llama-3.2-Vision-11B as the base model and train over two synthetic tasks (shown below) that test distinct forms of generalization:

GeneralPoints: A card game that requires the model to create equations to reach a target number using four given cards. We can test rule-based generalization by changing the mapping of face cards to numbers.
V-IRL: A navigation task that has the model reach a destination using visual landmarks and spatial reasoning. We can test generalization by varying the available action space or visual context.

Each task can be setup as both a language-only and vision-language problem. In all experiments, RL tends to promote out-of-distribution generalization while SFT actually damages it. For example, the out-of-distribution performance of models trained with RL improves by 3.5% and 11.0% on language-only GP and V-IRL. For vision-language variants, this performance improvement is slightly less pronounced (i.e., 3.0% and 9.3% on GP and V-IRL) but still present. In stark contrast, SFT degrades out-of-distribution performance by as much as 79.5%.

(from [7])

As an interesting side note, authors in [7] also find that RL benefits the model’s underlying perception capabilities. Namely, the model actually improves in its ability to identify key vision features during training, indicating that RL is not just learning reasoning patterns but also improving fundamental abilities (i.e., perception).

“Analysis of the GP-VL task showed that RL improved the model's ability to correctly identify card values from images, suggesting that outcome-based rewards can refine perceptual processing beyond what supervised training achieves.” - from [7]

From Atomic to Composite [8] tests the generalization impact of RL training on problems that require complementary reasoning—the ability to integrate external context with the model’s parametric knowledge. To test this style of reasoning, a controlled synthetic dataset is created; see below. The dataset is based on a knowledge graph of human biographies with fixed relationships. Using this graph, we can construct multi-hop questions that test complementary reasoning by design. More specifically, questions are specifically constructed to test three levels of reasoning with increasing complexity (depicted below):

IID reasoning applies known patterns to new entities.
Compositional reasoning applies known relationships to new relational paths.
Zero-shot reasoning requires generalizing to unseen relations.

(from [8])

The training process in [8] starts with Qwen-2.5-1.5B, performs an initial SFT stage, then tests several combinations of SFT and RL (using GRPO with binary verifiable rewards) training. The main results of these experiments are below.

(from [8])

As we can see, this analysis shows us that RL is capable of synthesizing multiple atomic reasoning capabilities into higher-level (composite) reasoning patterns. However, this is only possible when the model is trained with SFT prior to RL. In contrast, pure SFT training yields high in-domain performance and poor out-of-domain generalization, which reflects findings in prior work. In other words, SFT tends to memorize reasoning patterns rather than learn them. When a model is first trained via SFT to acquire primitive reasoning capabilities, then RL serves as a “synthesizer” through which the model learns how to properly combine these capabilities for solving complex, compositional reasoning problems.

“[We demonstrate] that RL synthesizes novel reasoning strategies and enables robust zero-shot generalization when LLMs are first pre-trained on foundational atomic reasoning skills via Supervised Fine-Tuning.” - from [8]

Does math reasoning improve general capabilities? A large-scale empirical analysis is performed in [9] to determine whether math-oriented reasoning training is also helpful in other domains. This analysis includes both a wide audit of existing models across math reasoning, general reasoning, and non-reasoning benchmarks, as well as a comparison of SFT and RL-based finetuning on Math-only data (i.e., ~47K prompts sourced from DeepScaler and SimpleRL).

(from [9])

As shown in the above plot, SFT-trained models tend to have poor transferability to non-reasoning tasks, while models trained with RL generalize across both reasoning and non-reasoning tasks—RL models generalize well beyond math and naturally avoid catastrophic forgetting. Similar trends are observed when analyzing the transferability of other open SFT or reasoning models across reasoning and non-reasoning benchmarks; see below. Further analysis in [9] reveals that on-policy data—as we might expect from [2, 3]—and the presence of a negative gradient in the RL objective are key contributors to favorable generalization properties.

(from [9])

Conclusion

In continual learning, we want the model to learn new tasks quickly while preserving old capabilities. When studying recent work on continual learning for LLMs, a consistent pattern emerges: on-policy RL is naturally more robust to catastrophic forgetting relative to SFT, even without explicit mechanisms to aid the continual learning process. This advantage appears to stem from the online nature of RL, which biases learning toward low distribution shift (or low KL) solutions and avoids destructive updates induced by offline data. The natural continual learning abilities of RL have broader implications for the emergence of AGI, as adaptability is a key prerequisite for generally intelligent systems. The studies seen in this overview use only simple, structured proxies for continual learning in the real world, which will be much messier. However, these results show that RL—an already impactful training paradigm—is a promising starting point for building general systems that can adapt to any task. In this way, continuing the existing trajectory of LLM research may yield natural progress for continual learning.

New to the newsletter?

Subscribe now

Bibliography

[1] Lai, Song, et al. “Reinforcement fine-tuning naturally mitigates forgetting in continual post-training.” arXiv preprint arXiv:2507.05386 (2025).

[2] Chen, Howard, et al. “Retaining by doing: The role of on-policy data in mitigating forgetting.” arXiv preprint arXiv:2510.18874 (2025).

[3] Shenfeld, Idan, Jyothish Pari, and Pulkit Agrawal. “Rl’s razor: Why online reinforcement learning forgets less.” arXiv preprint arXiv:2509.04259 (2025).

[4] Lu, Kevin et al. “On-Policy Distillation.” https://thinkingmachines.ai/blog/on-policy-distillation/ (2025).

[5] Meng, Yu, Mengzhou Xia, and Danqi Chen. “Simpo: Simple preference optimization with a reference-free reward.” Advances in Neural Information Processing Systems 37 (2024): 124198-124235.

[6] Diao, Muxi, et al. “Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting.” arXiv preprint arXiv:2601.02151 (2026).

[7] Chu, Tianzhe, et al. “Sft memorizes, rl generalizes: A comparative study of foundation model post-training.” arXiv preprint arXiv:2501.17161 (2025).

[8] Cheng, Sitao, et al. “From Atomic to Composite: Reinforcement Learning Enables Generalization in Complementary Reasoning.” arXiv preprint arXiv:2512.01970 (2025).

[9] Huan, Maggie, et al. “Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning.” arXiv preprint arXiv:2507.00432 (2025).

[10] McCloskey, Michael, and Neal J. Cohen. “Catastrophic interference in connectionist networks: The sequential learning problem.” Psychology of learning and motivation. Vol. 24. Academic Press, 1989. 109-165.

[11] Kirkpatrick, James, et al. “Overcoming catastrophic forgetting in neural networks.” Proceedings of the national academy of sciences 114.13 (2017): 3521-3526.

[12] Rebuffi, Sylvestre-Alvise, et al. “icarl: Incremental classifier and representation learning.” Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2017.

[13] Castro, Francisco M., et al. “End-to-end incremental learning.” Proceedings of the European conference on computer vision (ECCV). 2018.

[14] Chaudhry, Arslan, et al. “On tiny episodic memories in continual learning.” arXiv preprint arXiv:1902.10486 (2019).

[15] Hayes, Tyler L., et al. “Remind your neural network to prevent catastrophic forgetting.” European conference on computer vision. Cham: Springer International Publishing, 2020.

[16] Rannen, Amal, et al. “Encoder based lifelong learning.” Proceedings of the IEEE international conference on computer vision. 2017.

[17] Shin, Hanul, et al. “Continual learning with deep generative replay.” Advances in neural information processing systems 30 (2017).

[18] Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural network.” arXiv preprint arXiv:1503.02531 (2015).

[19] Li, Zhizhong, and Derek Hoiem. “Learning without forgetting.” IEEE transactions on pattern analysis and machine intelligence 40.12 (2017): 2935-2947.

[20] Wu, Yue, et al. “Large scale incremental learning.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.

[21] Aljundi, Rahaf, et al. “Memory aware synapses: Learning what (not) to forget.” Proceedings of the European conference on computer vision (ECCV). 2018.

[22] Dhar, Prithviraj, et al. “Learning without memorizing.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.

[24] Rusu, Andrei A., et al. “Progressive neural networks.” arXiv preprint arXiv:1606.04671 (2016).

[25] Draelos, Timothy J., et al. “Neurogenesis deep learning: Extending deep networks to accommodate new classes.” 2017 international joint conference on neural networks (IJCNN). IEEE, 2017.

[26] Guo, Haiyang, et al. “Hide-llava: Hierarchical decoupling for continual instruction tuning of multimodal large language model.” arXiv preprint arXiv:2503.12941 (2025).

[27] Zhao, Hongbo, et al. “Mllm-cl: Continual learning for multimodal large language models.” arXiv preprint arXiv:2506.05453 (2025).

[28] Li, Hongbo, et al. “Theory on mixture-of-experts in continual learning.” arXiv preprint arXiv:2406.16437 (2024).

[29] Liu, Wenzhuo, et al. “LLaVA-c: Continual Improved Visual Instruction Tuning.” arXiv preprint arXiv:2506.08666 (2025).

[30] Maharana, Adyasha, et al. “Adapt-$\infty $: Scalable continual multimodal instruction tuning via dynamic data selection.” arXiv preprint arXiv:2410.10636 (2024).

[31] Lee, Minjae, et al. “OASIS: Online Sample Selection for Continual Visual Instruction Tuning.” arXiv preprint arXiv:2506.02011 (2025).

Earlier research papers on this topic also commonly use the term “catastrophic interference” to refer to the same concept as catastrophic forgetting.

The reference model is usually the initial policy prior to RL training, such as the SFT model or a base model.

See Section 2.1 of this paper for an exact explanation of how entropy is computed using token probabilities outputted by an LLM.

More specifically, authors in [1] mention that, without any KL divergence term, the RL training process has to be resumed after a divergence numerous times for the final model to converge properly.

This is a play on words related to the concept of Occam’s Razor, which suggests that the simplest solution (or the solution requiring the fewest assumptions or elements) is usually correct.

For example, if we want to reduce the amount of forgetting when training with SFT, we can simply lower our learning rate [2].

For a system with K outcomes, the maximum entropy is ln(K), which is the entropy of the uniform distribution; see here for details.

GRPO++: Tricks for Making RL Actually Work

Cameron R. Wolfe, Ph.D. — Mon, 05 Jan 2026 10:33:50 GMT

(from [1, 3, 4])

Recent research on large language models (LLMs) has been heavily focused on reasoning and reinforcement learning (RL). At the center of this research lies Group Relative Policy Optimization (GRPO) [13], the RL optimizer used to train most open-source reasoning models. The popularity of GRPO is enhanced by its conceptual simplicity and practical efficiency. However, the simplicity of GRPO can be deceptive—the vanilla GRPO algorithm has subtle issues that can hinder the RL training process, especially at scale. Solving the shortcomings of GRPO has become a popular research topic, leading to the proposal of many tricks, best practices, and techniques for getting the most out of RL training. In this overview, we will outline all of this work, arriving at a deeper practical understanding of how to modify and use GRPO for training high-quality reasoning models.

Background on Reasoning and RL

Prior to covering recent work on improving GRPO, we will spend this section building a basic understanding of the GRPO algorithm in its original form. We will also learn about Proximal Policy Optimization (PPO) [11], the predecessor to GRPO, and discuss how RL is used in the context of LLMs and reasoning models more generally. Notably, this discussion will assume basic knowledge of the problem setup and terminology used for RL training with LLMs. Those who are less familiar with RL basics can learn more at the following links:

RL Problem Setup & Terminology [link]
Different RL Formulations for LLMs [link]
Policy Gradient Basics [link]

RL for Reasoning

“Inference scaling empowers LLMs with unprecedented reasoning ability, with RL as the core technique to elicit complex reasoning.” - from [1]

GRPO is the most common RL optimizer to use for training reasoning models. Before diving deeper into the details of GRPO, we need to build an understanding of how RL is actually used to train LLMs. In particular, there are two key types of RL training that are commonly used:

Reinforcement Learning from Human Feedback (RLHF) trains the LLM using RL with rewards derived from a reward model trained on human preferences.
Reinforcement Learning with Verifiable Rewards (RLVR) trains the LLM using RL with rewards derived from rule-based or deterministic verifiers.

The main difference between RLHF and RLVR is how we assign rewards—RLHF uses a reward model, while RLVR uses verifiable rewards. Aside from this difference, both are online RL algorithms with a similar structure; see below. GRPO is one possible RL optimizer that can be used to derive the policy update in this pipeline, though any RL optimizer (e.g., PPO or REINFORCE) can be used.

General framework for online RL

Given that RLHF focuses on aligning an LLM to human preferences, it is used more heavily for chat models and is less applicable to reasoning1. Most reasoning models are trained using RL in verifiable domains (e.g., math and coding), so we will primarily focus on the RLVR setup for the remainder of this post.

More on RLVR. To train an LLM with RLVR, we must select a domain that is verifiable in nature; e.g., math or coding. In other words, we need to create a dataset that has either i) a known ground truth answer or ii) some rule-based technique that can be used to verify the correctness of responses to the prompts in our dataset. For coding, we can create a sandbox for running LLM-generated code and use test cases to assess correctness. Similarly, we can evaluate math problems by performing basic string matching between the answer predicted by the LLM and a ground-truth answer for a problem; see below.

Verifying a problem with exact string matching

Usually, we must instruct the LLM to format its output so the final answer can be easily parsed. As an example, Math Verify is a popular package that was built for performing robust verification in the math domain. Even then, however, string matching is not always sufficient for evaluating correctness. In many cases, we can benefit from crafting validation logic that is more robust (e.g., asking an LLM to identify equivalent answers) and that captures variations in output.

“Math verification is determined by an LLM judge given the ground truth solution and DeepSeek-R1 solution attempt. We found that using an LLM judge instead of a stricter parsing engine (Math-Verify) for verification during data generation results in a higher yield and leads to higher performing downstream models.” - source

Reasoning models are structurally identical to a standard LLM. The key distinction between reasoning models and LLMs is the ability to “think” about a prompt prior to providing a final output. By increasing the length of this thinking process, reasoning models can use inference-time scaling—or simply spend more compute on generating a particular completion—to improve their performance.

Concrete example of a reasoning model’s full output

As shown above, this thinking process occurs in the form of a free-text, long chain-of-thought (CoT)—also called a rationale or reasoning trajectory—generated by the LLM. Many closed reasoning models (though not all of them!) hide the raw reasoning trace from the user, providing instead only a truncated version or summary of the reasoning process along with the model’s final answer.

Learning to reason via RL. If we look at some examples of reasoning trajectories from open or closed reasoning models, we will notice that these models exhibit some sophisticated reasoning behaviors in their long CoT:

Thinking through each part of a complex problem.
Decomposing complex problems into smaller, solvable parts.
Critiquing solutions and finding errors.
Exploring many alternative solutions.

Such behavior goes beyond any previously-observed behavior with standard LLMs and chain of thought prompting. However, this behavior is not explicitly injected into the model—it is naturally developed via large-scale RL training!

“One of the most remarkable aspects of this self-evolution is the emergence of sophisticated behaviors as the test-time computation increases. Behaviors such as reflection—where the model revisits and reevaluates its previous steps—and the exploration of alternative approaches to problem-solving arise spontaneously. These behaviors are not explicitly programmed but instead emerge as a result of the model’s interaction with the reinforcement learning environment.” - from [2]

During RLVR, the model undergoes a self-exploration process in which it learns how to properly use its long CoT to solve reasoning problems. As evidence of this self-evolution process, we commonly observe during RL training that the average length of the model’s completions increases over time; see below. The model naturally learns how to use more inference-time compute (by generating a longer reasoning trace) in order to solve difficult reasoning problems.

(from [2])

Training stages and Aha moments. As shown below, LLMs undergo training in several stages. However, reasoning models depart from the standard alignment procedure—including supervised finetuning (SFT) and RLHF—by adding an extra RLVR training stage. Additionally, it is even common in RL research to use an RL-Zero setup in which we directly train the pretrained base model with RLVR.

The RL-Zero setup was popularized by DeepSeek-R1 [2], which showed that reasoning capabilities can be instilled in an LLM via pure RL (using GRPO) even with no SFT. Most notably, DeepSeek-R1-Zero—the version of DeepSeek-R1 that is trained with an RL-Zero setup—is found in [2] to have an “Aha moment” in which it learns to invest additional reasoning effort into re-thinking or evaluating its own responses inside of the reasoning trace; see below. This behavior emerges at an intermediate point in RL training and is a classic example of how self-exploration via RL can naturally lead an LLM to develop sophisticated reasoning behavior.

(from [2])

Proximal Policy Optimization (PPO)

GRPO is based on the Proximal Policy Optimization (PPO) algorithm [11]. PPO was used in seminal work on RLHF and, as a result, was the default RL optimizer in the LLM domain for some time. Only after the advent of reasoning models did alternative algorithms like GRPO begin to gain traction for training LLMs. A full overview of PPO is linked below, but we will cover the key details in this section.

The structure of training with PPO is outlined below. As we can see, each training iteration of PPO goes through the following sequence of steps:

Sample a diverse batch of prompts.
Generate a completion from the policy for each prompt.
Compute advantage estimates for each completion.
Perform several policy updates over this sampled data.

(from [11])

Surrogate objective. In PPO, we formulate a loss function (also called the surrogate objective) that is optimized with respect to the parameters of our policy. The PPO loss function is based on the policy ratio (also called the importance ratio) between the current and “old” (i.e., before the first update in a training step) policies. The importance ratio stabilizes the training process by comparing the new policy’s token probabilities to the old policy and applying a weight (or importance) to training that helps to avoid drastic changes; see below.

Policy or importance ratio

To derive the surrogate objective for PPO, we begin with an unclipped objective that resembles the surrogate objective used in Trust Region Policy Optimization (TRPO); see below. Additionally, we introduce a clipped version of this objective by applying a clipping mechanism to the policy ratio r_t(θ). Clipping forces the policy ratio to fall in the range [1 - ε, 1 + ε]. In other words, we avoid the policy ratio becoming too large or too small, ensuring that the token probabilities produced by the current and old policies remain relatively similar.

The PPO surrogate objective

The PPO surrogate objective is simply the minimum of clipped and unclipped objectives, which makes it a pessimistic (lower bound) estimate for the unclipped objective. The behavior of the clipping mechanism in the surrogate loss changes depending on the sign of the advantage. The possible cases are shown below.

(from [11])

As we can see, taking the minimum of clipped and unclipped terms in the surrogate objective causes clipping to be applied in only one direction. The surrogate objective can be arbitrarily decreased by moving the importance ratio away from one, but clipping prevents the objective from being increased beyond a certain point by limiting the importance ratio. In this way, the clipping in PPO disincentivizes large policy ratios and, in turn, maintains a trust region by preventing large policy updates that could potentially damage our policy.

“We only ignore the change in probability ratio when it would make the objective improve, and we include it when it makes the objective worse.” - from [11]

KL divergence. We often incorporate a KL divergence between the current policy and a reference policy—usually the model from the beginning of training—into RL. The KL divergence serves as a penalty that encourages similarity between the current and reference policies. We compute the KL divergence by comparing token distributions from the two LLMs for each token in a sequence. The easiest—and most common—way to approximate KL divergence [12] is via the difference in log probabilities between the policy and reference; see here.

After the KL divergence has been computed, there are two primary ways that it can be incorporated into the RL training process:

By directly subtracting the KL divergence from the reward.
By adding the KL divergence to the loss function as a penalty term.

PPO adopts the former option by subtracting the KL divergence directly from the reward signal used in RL training, as shown in the equation below.

Adding KL divergence to the reward in PPO

Advantage estimation. The advantage function, a key part of PPO’s surrogate objective, is the difference between the action-value and value function: A(s, a) = Q(s, a) - V(s). The value function in PPO is estimated with a learned model called the value model (or critic). This critic is a separate copy of our policy, or—for better parameter efficiency—an added value head that shares weights with the policy. The critic takes a completion as input and predicts expected cumulative reward on a per-token basis using an architecture that is similar to that of a reward model (i.e., transformer with a regression head); see below.

The value function is on-policy—it depends on the current parameters of our policy. Unlike reward models, which are fixed at the beginning of RL training, the critic is trained alongside the LLM to keep its predictions on-policy—this is known as an actor-critic setup. To train the critic, we add an extra mean-squared error (MSE) loss term—between the rewards predicted by the critic and the actual rewards—to the PPO loss. Using the critic, we can estimate advantage using Generalized Advantage Estimation (GAE). The details of GAE are beyond the scope of this post, but a full explanation and implementation can be found here.

Group Relative Policy Optimization (GRPO)

(from [13])

Group Relative Policy Optimization (GRPO) [13] builds upon PPO by proposing a simpler technique for estimating the advantage. In particular, GRPO estimates the advantage by sampling multiple completions—or a “group” of completions—for each prompt and using the rewards of these completions to form a baseline. This group-derived baseline replaces the value function, which allows GRPO to forgo training a critic. Avoiding the critic drastically reduces GRPO’s memory and compute overhead compared to PPO. Additionally, since GRPO is commonly used for reasoning-oriented training, we typically pair it with verifiable rewards, which eliminates the need for a separate reward model.

“We introduce the Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO). GRPO foregoes the critic model, instead estimating the baseline from group scores, significantly reducing training resources.” - from [13]

Advantage estimation in GRPO is performed by sampling multiple completions for each prompt and using the formulation shown below. This approach is very simple compared to PPO, which uses a learned value model and GAE.

Advantage computation in GRPO

In GRPO, completions to the same prompt form a group, and we calculate the advantage relative to other rewards in the group—hence, the name “group relative” policy optimization! More specifically, the advantage for completion i is calculated by first subtracting the mean reward over the group from r_i, then dividing this difference by the standard deviation of rewards over the group. The GRPO loss is assigned on a per-token basis, but we should note that the above formulation assigns the same advantage to every token t in completion i. The per-token loss is therefore dictated by the policy ratio, which varies for each token.

“GRPO is often run with a far higher number of samples per prompt because the advantage is entirely about the relative value of a completion to its peers from that prompt.” - RLHF book

Because we compute the advantage in a relative manner (i.e., based on rewards in the group), the number of completions we sample per prompt must be high to obtain a stable policy gradient estimate. Unlike GRPO, PPO and REINFORCE typically sample a single completion per prompt. However, sampling multiple completions per prompt has been explored by prior RL optimizers like RLOO.

Surrogate loss. Despite estimating the advantage differently, GRPO uses a loss function that is nearly identical to that of PPO. As shown below, GRPO uses the same clipping mechanism that is used by PPO for the importance ratio. This expression assumes an MDP formulation and has been modified to explicitly aggregate the loss over multiple completions within a group.

GRPO surrogate loss

KL divergence. One key difference between PPO and GRPO is the KL divergence term being added as a penalty term to the surrogate loss, rather than subtracted from the reward. However, we should note that the KL divergence is frequently omitted when training reasoning models. In the context of RLHF, KL divergence enables model alignment without diverging significantly from the initial model, but this approach makes less sense when training long CoT reasoning models. The model’s behavior may diverge significantly from the initial model as it develops the ability to perform long CoT reasoning. All of the work that we will study in this overview omits the KL divergence term during RL training.

Limitations of vanilla GRPO. For a full overview of GRPO, please see the above link. As we have seen, GRPO is a relatively simple algorithm. The popularity of GRPO was catalyzed by its use for training DeepSeek-R1 [2]. The openness of this work led GRPO to be adopted in open replications of reasoning models, as well as countless other research efforts. Despite its popularity, vanilla GRPO has several issues that become especially pronounced in large-scale RL training runs:

Noise and instability during the training process.
Excessive response lengths, especially in incorrect answers.
Collapse of the LLM’s entropy (i.e., reduced exploration).
Poor sample efficiency and slow learning.

Due to these issues, many open research efforts initially struggled to replicate the results reported by DeepSeek-R1 [1, 3], indicating that some details necessary to achieve peak performance with GRPO may have been omitted from [2]. This overview will study various works that have diagnosed such issues with GRPO, uncovering a set of practical tricks that can be used to train better reasoning models at scale.

Assessing the Health of RL Training

Despite the recent success of reasoning models, we must remember that training LLMs via RL is a complex process with many moving parts. We are working with multiple disjoint systems to train the model, each of which has unique settings that must be tuned. As described below, even simple changes to the RL training process can yield unexpected results or completely derail the model. When issues occur, it can be hard to know exactly what went wrong, and the high cost of RL training can make debugging these issues slower and more difficult. To quickly identify issues and iterate on our RL training setup, we need intermediate metrics that allow us to efficiently monitor the health of the training process.

“Reinforcement learning on large language models is… an intrinsically complex systems engineering challenge, characterized by the interdependence of its various subsystems. Modifications to any single subsystem can propagate through the system, leading to unforeseen consequences due to the intricate interplay among these components. Even seemingly minor changes… can amplify through iterative reinforcement learning processes, yielding substantial deviations in outcomes.” - from [1]

Health checks. The key training and policy metrics that can be monitored to catch issues with our RL setup are as follows:

Response length should increase during reasoning RL as the policy learns how to effectively leverage its long CoT. Average response length is closely related to training stability, but response length does not always monotonically increase—it may stagnate or even decrease. Excessively long response lengths are also a symptom of a faulty RL setup.
Training reward should increase in a stable manner throughout training. A noisy or chaotic reward curve is a clear sign of an issue in our RL setup. However, training rewards do not always accurately reflect the model’s performance on held-out data—RL tends to overfit to the training set.
Entropy2 of the policy’s next token prediction distribution serves as a proxy for exploration during RL training. We want entropy to lie in a reasonable range—not too low and not too high. Low entropy means that the next token distribution is too sharp (i.e., all probability is assigned to a single token), which limits exploration. On the other hand, entropy that is too high may indicate the policy is just outputting gibberish. Similarly to entropy, we can also monitor the model’s generation probabilities during RL training.
Held-out evaluation should be performed to track our policy’s performance (e.g., average reward or accuracy) as training progresses. Performance should be monitored specifically on held-out validation data to ensure that no reward hacking is taking place. This validation set can be kept (relatively) small to avoid reducing the efficiency of the training process.

An example plot of these key intermediate metrics throughout the RL training process from DAPO [1] is shown below. To iterate upon our RL training setup, we should i) begin with a reasonable setup known to work well, ii) apply interventions to this setup, and iii) monitor these metrics for positive or negative impact. We will see many examples of such a workflow throughout this overview as we study various tweaks and improvements to the vanilla GRPO algorithm.

“We typically use length in conjunction with validation accuracy as indicators to assess whether an experiment is deteriorating… the trend of reward increase [should be] relatively stable and does not fluctuate or decline significantly due to adjustments in experimental settings… we find that maintaining a slow upward trend in entropy is conducive to the improvement of model performance.” - from [1]

(from [1])

A note on batching and data. Prior to making algorithmic changes to GRPO, we should focus on the correctness of our data. GRPO needs (relatively) large batch sizes to work well. Using a small batch size in GRPO is one of the most common mistakes in RL training. To avoid this mistake, we should begin with a reasonable batch and group size (e.g., Olmo 3 [5] uses a batch size of 512 with 64 prompts and 8 rollouts per prompt) and test how varying the batch and group sizes impacts the metrics discussed above. For example, if a larger batch size makes our reward curve much more stable, then our initial batch size was probably too small.

As shown in recent RL research [9, 10], curating the correct set of prompts is also essential. More specifically, we want our data to be diverse in terms of topic and difficulty. For example, Olmo 3 [5] incorporates several domains—math, coding, instruction following, and general chat—into RL training and uses offline difficulty filtering to filter out prompts that are too easy or too difficult. Using another LLM to gauge prompt difficulty by measuring Pass@K performance is also a common filtering approach [9]. We see each data point multiple times during RL training, so data curricula3 are less relevant. To make the most of our data, we simply want to ensure sufficient quality and diversity!

“Where algorithmic changes can make models more robust to less balanced data, a crucial part of the current RL training is to have a diversity of difficulties in your data. With large batch sizes, the model should have questions that are trivial, somewhat challenging, and nearly impossible in each batch.” - Nathan Lambert

As a final note, certain categories of questions—specifically those that are easily guessable without any true reasoning—can damage the fidelity of RL training. For example, multiple choice questions can easily be reward hacked if the policy randomly guesses an answer to each question. Therefore, removing this style of easily-guessable questions from RL training is a common practice.

Improving upon Vanilla GRPO

Now that we understand GRPO, we will learn about recent research that has identified (and solved) problems with the vanilla GRPO algorithm. Given the popularity of GRPO, many papers have been published on this topic. We will aim to review this work in a way that is both comprehensive and of sufficient depth. The section will begin with longer overviews of a few popular papers. After the longer overviews, we will provide a wider outline of the topic via shorter paper summaries and an exhaustive list of recent and notable publications.

DAPO: An Open-Source LLM Reinforcement Learning System at Scale [1]

Despite the impressive recent results achieved with reasoning models, many details needed to reproduce these results are concealed. In fact, even open models like DeepSeek-R1 [2] do not provide sufficient technical details to fully reproduce their results. A naive application of GRPO with Qwen-2.5-32B achieves a score of 30% on AIME 2024, underperforming the score of 47% achieved in the DeepSeek-R1 technical report. This difficulty in reproducing the results of DeepSeek-R1 hints at missing details that are necessary for stable, performant, and scalable RL.

“The broader community has encountered similar challenges in reproducing DeepSeek’s results suggesting that critical training details may have been omitted in the R1 paper that are required to develop an industry-level, large-scale, and reproducible RL system.” - from [1]

In [1], authors aim to discover these missing details, arriving at four key changes to the vanilla GRPO algorithm that—when applied in tandem—match and surpass results observed in [2]. The modified GRPO algorithm derived in [1] is called the Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) algorithm. All code (based on verl) and data are openly released to support future research.

Vanilla GRPO. When running the vanilla GRPO algorithm, authors notice several issues in the training process, including:

Entropy collapse: the entropy of the model’s next token distribution collapses during the training process. Probability mass is primarily assigned to a single token and outputs are more deterministic.
Reward noise: the training reward is very noisy and does not steadily increase during the RL training process.
Training instability: the training process is unstable and may diverge. We do not observe a steady increase in response length during training.

To mitigate these issues, authors propose the following four solutions in [1].

(1) Clip higher. As mentioned previously, authors in [1] observe entropy collapse when training models with vanilla GRPO; see below. When entropy declines, the next token distribution becomes concentrated on a single token, leading sampled responses in a group to be very similar. As a result, exploration becomes limited and the advantage computation in GRPO becomes less reliable—each sample in the group will tend to receive the same reward, making group normalization difficult.

(from [1])

Interestingly, we see in [1] that this entropy collapse is caused by the clipping operation in PPO and GRPO. To see why this occurs, let us consider two kinds of tokens to which clipping could be applied:

Exploitation token: a token that is already highly likely in the current policy.
Exploration token: a low probability token in the current policy.

Sampling lower probability tokens gives the model a chance to explore alternative tokens when searching for better completions. Clipping is applied to the policy ratio, or the ratio of a token’s probability after and before the policy update:

The policy ratio is constrained to a range of [1 - ε, 1 + ε]. This upper bound allows high probability (exploitation) tokens to become more probable, but it restricts increases in low probability (exploration) tokens. A concrete example of how the upper clipping bound can discourage exploration is explained below.

“When ε = 0.2 and [advantage is positive], consider two actions with probabilities π_old(a_t|s_t) = 0.01 and 0.9. The upper bounds of the increased probabilities π_θ(a_t|s_t) are 0.012 and 1.08, respectively (i.e., π_old·(1 + ε)). This implies that exploitation tokens with a higher probability (e.g., 0.9) are not constrained to get even extremely larger probabilities like 0.999. Conversely, for low-probability exploration tokens, achieving a non-trivial increase in probability is considerably more challenging.” - from [1]

The “clip higher” approach, which decouples the lower and upper bound for clipping, is proposed as a solution to this problem. Specifically, we clip in the range [1 - ε_low, 1 + ε_high], where ε_low = 0.2 (default setting) and ε_high = 0.28 in [1]. As shown in the figure above, increasing ε_high prevents entropy collapse and improves GRPO performance. On the other hand, authors note that ε_low should not be increased, as this would suppress some tokens to a probability of zero and collapse the token sampling space.

Ratio of samples with perfect accuracy throughout RL training (from [1])

(2) Dynamic Sampling. Throughout the course of RL training, the number of samples for which all completions in a group are correct naturally increases; see above. Although this trend indicates that the model is improving, prompts with perfect accuracy are problematic for GRPO. If all completions in a group are correct (i.e., reward of one), then the advantage for each completion in the group and the corresponding policy gradient are zero. As a result, our batch size effectively becomes smaller because there are many elements in the batch with zero gradient—leading to a noisier batch gradient and, in turn, degraded sample efficiency. To solve this issue, we can perform dynamic sampling, which simply:

Over-samples prompts for each batch.
Filters or removes all prompts with perfect accuracy.

The sampling cost per batch is dynamic—hence the name “dynamic sampling”—and we simply continue sampling and filtering until we have a full batch. However, this additional sampling cost is typically offset by the improved sample efficiency of the algorithm. Put differently, the model tends to converge much faster when we filter out prompts with perfect accuracy (i.e., dynamic sampling); see below.

(from [1])

(3) Token-level loss. The GRPO surrogate objective is computed at a token level, but we must aggregate this objective over the batch before computing the policy update. This aggregation is performed at the sample level, as described below.

“The original GRPO algorithm employs a sample-level loss calculation, which involves first averaging the losses by token within each sample and then aggregating the losses across samples. In this approach, each sample is assigned an equal weight in the final loss computation.” - from [1]

When aggregating at the sample level, each sample in the batch is assigned an equal weight in the GRPO loss. Although this approach may seem reasonable, it creates a subtle bias in our GRPO implementation—tokens within long responses have a disproportionately lower contribution to the loss. More specifically, a sample receives equal weight in the GRPO loss no matter its length. The contribution of each individual token is determined by its impact on the average loss for the sequence. Given that longer sequences contain a larger number of tokens, the impact of an individual token is muted when it exists in a longer sequence.

This length bias makes learning from high-quality, longer samples—or punishing patterns in low-quality samples—difficult in vanilla GRPO. As evidence of this bias, we often see that excessively long samples tend to contain noticeable artifacts like repeated words or gibberish. Luckily, this problem has an easy solution: we can just aggregate the loss via an average over all tokens, thus weighting the contribution of each token equally. As shown below, this modification has a clear impact on the health and stability of RL training, where we can observe a stable increase in the model’s entropy and response length throughout training.

(from [1])

(4) Overlong reward shaping. The final improvement proposed for GRPO in [1] is related to the handling of truncated samples. During RL training, we usually impose a maximum generation length for rollouts to improve efficiency, but the policy does not always adhere to this maximum length. In some cases, the policy will attempt to generate a sample that is too long, and we will have to truncate this sample to the maximum length. The default response to this behavior in RL is punishment—we simply provide a negative reward for any truncated samples.

Interestingly, authors in [1] show that how we shape this punitive reward for truncated samples is important and can lead to training instability if handled incorrectly. For example, what if the policy’s reasoning process was totally valid but just too long? Assigning a negative reward to such a case could confuse the model. To test this theory, authors perform an experiment in which truncated samples are masked—meaning they have no contribution to the policy update—in the GRPO loss instead of being negatively reinforced. As shown in the figure below, this overlong filtering strategy improves both performance and training stability.

(from [1])

Additionally, a length-aware penalty is proposed that assigns a soft punishment to truncated samples. In particular, we define both a maximum generation length (L_max) and a cache length (L_cache), which together form the punishment interval [L_max - L_cache, L_max]. Any generation that exceeds L_max tokens in length will receive a maximum penalty of -1, while any generation less than L_max - L_cache tokens in length will have no penalty. Within the punishment interval, however, the negative reward is dynamically adjusted based on the length of the sample; see below. This soft overlong punishment is directly added to the verifiable reward in GRPO. A maximum length of 16K tokens and cache length of 4K tokens are used for DAPO experiments in [1].

Soft overlong punishment formulation (from [1])

The full DAPO algorithm, which combines the four modifications described above, is formulated by the algorithm and objective function provided below.

(from [1])

Experiments in [1] are conducted with the DAPO-Math-17K dataset4, which contains 17K prompts. The dataset is purposely curated so that answers are formatted as integers, making parsing and verification simple. Experiments are only performed in the math domain, but this is a common approach for evaluating algorithmic changes in RL. Due to the high cost of experimentation, researchers frequently use math RL as a testbed and assume that most findings will translate reasonably well to other domains. The Qwen-2.5-32B base model is selected to match the RL-Zero training setup of DeepSeek-R1 [2]. As shown below, accuracy on AIME increases from 0% to 50% after training with DAPO, exceeding the 47% accuracy achieved in [2].

(from [1])

This performance is achieved using only half of the training steps required to train DeepSeek-R1-Zero-Qwen-32B, showcasing the improved sample efficiency of DAPO. In contrast, vanilla GRPO achieves an accuracy of only 30% on this benchmark. All four DAPO modifications are shown to clearly benefit final performance; see below. Although we see the smallest accuracy boost from the token-level loss, this modification makes the training process more stable. The improved health of the RL training process with DAPO is evidenced by stable increases in average response length, entropy, and training reward.

(from [1])

Understanding r1-zero-like training: A critical perspective [3]

When performing RL-Zero-style training (i.e., RL training applied directly to a base model), there are two key aspects of our training setup to consider:

The base model.
The RL training setup.

In [3], authors perform a deep investigation into these two aspects to better understand i) the impact of pretraining on performance after RL and ii) the dynamics of the RL training process in general. This investigation uncovers several interesting properties of base models that are commonly used in open RL recipes. Additionally, several biases are discovered in the GRPO loss formulation that are shown to degrade training stability and artificially inflate the length of incorrect responses. As a solution, authors propose GRPO done right (or Dr. GRPO), which uses a different advantage formulation and modified loss aggregation strategy to improve stability and address biases in GRPO.

(from [3])

Base models. Several pretrained base models are tested in [3]—with a focus upon Qwen-2.5 (i.e., commonly used in open RL-Zero recipes) and DeepSeek-V3-Base (i.e., the original base model used for DeepSeek-R1-Zero [2])—by analyzing their responses to a set of 500 questions from MATH. The results of this analysis are summarized in the figure above and focus on two major questions:

Can we elicit better reasoning skills by changing the template used for prompting the base model?
Do base models already exhibit reasoning and self-reflection behaviors (i.e., the “Aha moment” of DeepSeek-R1) prior to RL training?

(1) Templates. Base models are trained using next token prediction and have not yet undergone any alignment. As a result, these models struggle with instruction following, making the exact template used for prompting the model important. To better understand how the selected prompt template influences base model performance, three different styles of templates are tested in [3]; see below.

(from [3])

To determine the template that is most suitable for each model, GPT-4o-mini is used to assess whether questions are answered with the correct output format5. As shown in the figure below, the choice of template significantly influences model performance, but the most suitable template varies by model. For example, Qwen-2.5 models perform best with no template, while DeepSeek-V3 base [14] performs very poorly unless the correct chat template is used.

(from [3])

“Since Qwen2.5 uses chat model’s data (question-answer pairs) during the pretraining stage, we hypothesize that they might pretrain on the concatenated text… If our hypothesis turns out true, we shall be more careful about using Qwen2.5 models to reproduce DeepSeek-R1-Zero, since the base models are already SFT-like without templates.” - from [3]

Using a concatenated question-answer format with no template for Qwen-2.5 models leads to a 60% performance improvement, demonstrating the importance of understanding the unique properties of each base model used for RL training. In the case of Qwen-2.5, these results indicate that the base model was pretrained on concatenated question-answer data. If true, this hypothesis has significant implications for RL-Zero training—the base model has already undergone SFT-like training over question-answer pairs and thus cannot truly be considered an unaligned base model for RL-Zero-style training. However, validating this hypothesis is not possible because Qwen models do not openly disclose their training data.

Using the correct template is beneficial to the base model, but the impact is less pronounced after RL training. Similar performance is reached by most templates after RL training, despite large initial performance differences in the base model; see below. This finding hints that the performance benefits of RL may be more modest than is typically reported—model performance can be artificially deflated prior to RL training based on the exact template being used.

(from [3])

Interestingly, the ability of RL to restore full performance may depend on data coverage. More specifically, if the model and prompt template are aligned well (i.e., meaning the base model initially performs well with that prompt template), we can achieve performance benefits from RL training even on very narrow datasets (e.g., GSM-8K). However, if there is a mismatch between the base model and prompt template being used, the observed performance after RL training may suffer unless a diverse dataset with wider coverage is used; see above.

(from [3])

(2) Reasoning performance. After determining the most suitable template per model, authors also measure the Pass@8 performance using various temperature settings to measure base model exploration capabilities. If a base model cannot generate at least one viable solution among several rollouts, then improving reasoning capabilities with RL will be difficult—the model cannot learn to answer problems correctly via exploration. The results of this test are outlined in the figure shown above, where we see that all models have a non-zero success rate for solving reasoning problems when sampling multiple rollouts. The Qwen-2.5 and DeepSeek-V3 models already demonstrate impressive Pass@8 performance.

“If a base policy cannot even sample a single trajectory that leads to the correct final answer, it is impossible for reinforcement learning to improve the policy because there is no reward signal.” - from [3]

(2.5) Aha moment. The presence of an Aha moment in the RL training process of DeepSeek-R1-Zero [2] was a huge discovery in AI research, as it indicates that sophisticated reasoning behaviors can emerge naturally from RL training. However, researchers have struggled to reproduce this behavior with open models, leading many to question whether self-reflection is truly an emergent property of RL. One popular explanation for these difficulties is that base models may already exhibit self-reflection behavior prior to RL training, leading this behavior to just be emphasized—rather than completely learned—during the RL training process.

“Although self-reflection behaviors occur more frequently in R1-Zero, we observe that these behaviors are not positively correlated with higher accuracy.” - from [3]

To test this theory, authors in [3] analyze DeepSeek-V3-Base for patterns of self-reflection on the MATH dataset. This analysis reveals that the base model already uses self-reflection in a large number of queries; see below. We can find from simple keyword searches that the model outputs many “Aha” or “wait” tokens, revealing that self-reflection behavior may not be purely developed via RL. Interestingly, RL training does increase the frequency of self-reflection in the model’s output, but this behavior is not found to measurably improve performance.

(from [3])

GRPO biases. In addition to analyzing properties of base models, authors in [3] point out a few problematic biases in GRPO, as well as recommend a modified algorithm—called GRPO Done Right (or Dr. GRPO)—to fix these biases. When an LLM is trained using vanilla GRPO, we usually observe a clear increase in the model’s average response length throughout training. Such increasing response length is usually attributed to the development of long CoT reasoning abilities.

(from [3])

Going against common intuition for reasoning models, however, we see in [3] that this increase in response length is partially attributable to fundamental biases in the GRPO objective function. In fact, we even see in [3] that GRPO continues to increase response length after rewards begin to plateau; see above. Additionally, output lengths become noticeably longer for incorrect responses throughout the course of training, revealing a bias towards artificially inflating response lengths in GRPO. Specifically, there are two key biases that exist in the GRPO objective:

Response-level length bias: GRPO normalizes the summed loss of tokens in each sequence by the total number of tokens in that sequence, leading to biased gradient updates based on the length of each response.
Question-level difficulty biases: the standard deviation term in the denominator of the advantage formulation in GRPO causes the advantage to become very large for questions that are either too easy (i.e., most responses have a reward of one) or too hard (i.e., most responses have a reward of zero)6.

(from [3])

Response lengths vary during RL training, so the loss is normalized dynamically based on the length of each sequence. The response-level length bias observed in [3] matches findings in [1] that motivated the use of a token-level loss to avoid sequence lengths influencing each token’s contribution to the loss. Normalizing the GRPO loss on a sequence level leads to larger gradient updates for shorter responses—or smaller gradient updates for longer responses—when the advantage is positive. When advantage is negative, however, long responses are penalized less, leading longer responses to be preferred among incorrect outputs.

(from [3])

Put differently, GRPO biases the model towards overthinking by using more tokens for incorrect answers! To avoid the length bias from sequence-level aggregation, we can divide the sum of losses in each sequence by a fixed constant rather than the total number of tokens in the sequence; see above for an example implementation.

Dr. GRPO is a modified version of GRPO proposed in [3] to fix the biases outlined above. Compared to vanilla GRPO, Dr. GRPO makes two key modifications:

Normalizing the summed loss of each sequence by a fixed constant, rather than by the number of tokens in the sequence.
Removing the standard deviation term from the denominator of the advantage formulation.

Dr. GRPO is formulated below, where we see that the loss is not normalized by sequence length. The loss is instead divided by the MAX_TOKENS constant, as shown in the above code snippet. Additionally, the advantage is computed by subtracting the group-level mean of rewards from the reward for each completion (i.e., no division by standard deviation). These changes are found to mitigate the aforementioned biases and yield models that perform better on a per-token basis—better performance is achieved while outputting fewer tokens on average; see below.

(from [3])

Experiments. Dr. GRPO is implemented using the Oat framework and is released openly. Models are trained using the MATH dataset and evaluated on a variety of benchmarks, including OlympiadBench, AIME 2024, AMC, Minverva Math, and Math500. Rewards are derived based on correctness (i.e., correct responses receive a reward of one, while incorrect responses receive a reward of zero) using Math Verify. When used to train the Qwen-2.5-Math-7B model (with the Qwen-Math prompt template), the simple Dr. GRPO RL-Zero recipe achieves 43.3% accuracy on AIME 2024, which is state-of-the-art for a model of this scale; see below.

(from [3])

The training process for this model completes in ~27 hours on only eight A100 GPUs. Such a lightweight training setup is useful for research, as one can quickly iterate upon changes to the RL training process. The key findings from [3] are summarized in the figure below. Beyond the observed properties of base models and reported benefits of Dr. GRPO, authors in [3] find that continued, domain-specific pretraining is helpful for RL. Specifically, continually pretraining the Llama-3.2-3B model on math-specific data prior to RL-Zero training noticeably raises the model’s performance ceiling during the RL training process.

(from [3])

Your Efficient RL Framework Secretly Brings You Off-Policy RL Training [4]

During RL training, we alternate between two key operations:

Rollouts: given a set of prompts, sample multiple completions for each prompt using the current LLM.
Policy Updates: compute a weight update for the LLM using the sampled rollouts and the given objective function (e.g., from GRPO).

The cost of the RL training process is notoriously high and typically dominated by rollout generation—most of the time in RL is spent waiting for inference to finish7. For example, profiling the RL training process for Olmo 3 [5] reveals that 5-14× more compute is spent on inference compared to policy updates. For this reason, most modern RL training frameworks use separate engines on the backend for generating rollouts and performing policy updates. Specifically, we usually use popular training frameworks like FSDP or DeepSpeed for policy updates, while optimized inference engines like vLLM or SGLang—often with lower precision inference (e.g., int8 or fp8) for added efficiency—are used to generate rollouts.

“In modern RL training frameworks, different implementations are used for rollout generation and model training… We show the implementation gap implicitly turns the on-policy RL to be off-policy.” - from [4]

For simplicity, we will refer to the engines used for sampling rollouts and computing policy updates as the sampler and learner engines, respectively.

Gap between engines. One may naively assume that engine implementations should be similar, but the use of separate sampler and learner engines creates a mismatch in the code used for rollouts and policy updates. Even when engines share the same exact model parameters, the token probabilities that they predict can differ significantly; see below. In the worst case, token probabilities are completely contradictory between the two engines, meaning that the learner would not have generated the same completion as the sampler. In this case, the RL training process actually becomes off-policy, thus degrading performance.

Difference in token probabilities created by the mismatch between sampler and learner engines (from [4])

We must address this implementation gap for RL training to be truly on-policy. To accomplish this, we could (obviously) take an engineering-centric approach—just find and eliminate implementation differences so that the two engines yield identical token probabilities. In [4], authors take this approach by identifying problem areas that contribute to differences in token probabilities, but the implementation gap still exists even after patching several issues in the engine code; see above.

To fully eliminate this implementation gap, we must chase down an even larger number of subtle issues that exist, such as precision differences throughout parts of the model or deviations in sampling code. Identifying and removing all of these bugs is a tedious engineering process that must be repeated any time a new (or even slightly modified) engine is used for RL. Going further, even if all of these issues are addressed, the LLM inference process is still fundamentally non-deterministic. As a result, the engine gap can be minimized but not fully removed. For these reasons, an engineering-centric solution, though conceptually simple, is resource-intensive and difficult to achieve in practice.

Importance Sampling. Authors in [4] propose an algorithmic approach based on importance sampling for addressing the engine mismatch in RL. Formally, importance sampling is a statistical method used to estimate properties (e.g., an expectation) of a target probability distribution f(x) by sampling from a proposal distribution g(x). Usually, sampling from g(x) is much cheaper than sampling from f(x), which is the motivation for importance sampling.

(source)

In other words, if sampling from f(x) is difficult, we can instead choose to draw samples from g(x) and just correct for the discrepancy between f(x) and g(x) by weighting each sample by the importance ratio f(x) / g(x); see above. This concept can be directly applied in the context of RL! Namely, we can denote the token probabilities from our learner and sampler as f(x) and g(x), respectively. From our prior discussion, we know that:

Sampling from g(x) is much more efficient relative to f(x).
There is a discrepancy between these two distributions.

Therefore, importance sampling can be directly used to correct for this mismatch.

“When direct Monte Carlo estimation of the expected value under a target distribution is difficult, importance sampling allows us to sample from an alternative distribution instead. In our case, the target distribution is π_learner, but it is extremely slow to sample from. Using a separate backend (e.g., vLLM) for rollout generation means that we are sampling from π_sampler instead. The discrepancy is then corrected by weighting each sample with an importance ratio.” - from [4]

Truncated Importance Sampling (TIS) for RL. To understand how importance sampling can be practically implemented in the context of RL training, let’s begin with the most basic expression for a policy gradient; see below.

Basic policy gradient expression

In practice, the policy gradient that we can compute looks slightly different from this, as we are not using the same policy for sampling the rollout and computing the policy gradient. Rather, the actual expression we will use is shown below, where separate engines are used for the rollouts and policy gradient.

Basic policy gradient expression with different engines and TIS

As shown above, importance sampling operates by weighting the policy gradient by the importance ratio f(x) / g(x). For RL training, the importance ratio is computed as π_learner / π_sampler (i.e., the ratio of token probabilities from the learner and sampler engines). To make the policy update more stable, authors in [4] adopt truncated importance sampling (TIS), which simply caps the importance ratio at a maximum value of ρ. The policy gradient is not changed much—we just scale the gradient expression by the (truncated) importance ratio.

“While there has been extensive study on how to design a stable and effective importance sampling, in practice we find it usually sufficient to use a classical technique, truncated importance sampling.” - from [4]

We formulate TIS with a basic policy gradient expression above, but extending this idea to other RL optimizers is straightforward. In particular, we can just:

Take the policy gradient expression for our RL optimizer of choice.
Scale the new policy gradient expression by the same importance ratio.

For example, we can apply TIS to GRPO or PPO as shown below8. After computing the policy gradient, we can multiply this gradient by the (truncated) importance ratio. We still scale the policy gradient by the importance ratio, but we substitute our standard policy gradient expression with that of GRPO.

Policy gradient with TIS for GRPO or PPO (from [4])

Does TIS work? To determine whether TIS solves the mismatch problem, authors in [4] first conduct experiments using Qwen-2.5-32B with DAPO [1] on the DAPO-Math-17K dataset. Due to resource limitations, RL training is stopped after 250 iterations, but these initial iterations can be used to analyze the properties of the training process. An early stopping approach is commonly used to efficiently test interventions to the RL training process. As shown below, we see a clear boost in performance when TIS is used in DAPO—TIS benefits performance significantly. Additionally, we see that similar performance cannot be achieved by addressing implementation gaps between engines (i.e., an engineering-centric approach).

(from [4])

Quantized rollouts, which refers to sampling rollouts in a lower numerical precision (e.g., fp8 or int8 instead of bf16), can be used to study the impact of the distribution gap between sampler and learner engines. We can increase this gap by lowering the precision used for generating rollouts. To test the impact of increasing the mismatch in this way, a basic GSM8K setup is used in [4], where rollouts are sampled using either bf16 or int8 precision.

Using lower precision is shown in [4] to increase the maximum difference in token probabilities from ~0.4 to ~1.0, thus confirming that quantized rollouts do measurably increase the gap between the sampler and learner. As shown below, performing regular PPO training with quantized rollouts results in noticeable performance deterioration. By using TIS, we can mitigate this issue and match the performance of the higher precision (bf16) training setup; see below.

(from [4])

Analyzing the impact of quantized rollouts further, we see in [4] that experiments using int8 rollouts i) show clear signs of entropy collapse and ii) produce models with abnormally long average response lengths. Both observations indicate poor health in the RL training process. Entropy collapse is not observed when using bf16 rollouts, revealing that the RL training process is negatively impacted by the mismatch introduced by quantized rollouts. However, using TIS is also found to effectively address the mismatch and reverse these observations; see below.

(from [4])

Although the bf16 training setup is already stable, using TIS even with bf16 rollouts is found to further improve entropy values, which can allow the model to explore more during RL; see above. Generally, TIS should not provide much of a benefit when the mismatch between sampler and learner engines is small—the importance ratio in these cases is ~1.0 and the objective becomes identical to standard PPO or GRPO. However, TIS does not deteriorate performance in these cases and can still yield some benefits, as shown in the case with bf16 rollouts.

What causes the gap? To conclude their analysis, authors in [4] study practical choices that can worsen the sampler-learner gap in RL. To quantify the size of the gap, token-level probability mismatch is measured per response—either using the mean or maximum difference across tokens in the response—over a set of 512 prompts from DAPO-Math-17K. From this analysis, we learn that:

Mean mismatch tends to stay the same between most implementations—the largest impact is observed in terms of maximum mismatch. In other words, large sampler-learner gaps are characterized by a noticeable increase in the maximum token probability discrepancy across sequences.
Differences in parallelism strategies significantly increase the mismatch (e.g., sequence parallelism in the learner and tensor parallelism in the sampler).
Using the same parallelism strategy with different settings (e.g., tensor parallelism with 2 versus 4 GPUs) is less problematic compared to using different distribution strategies altogether.
Using longer rollouts in RL tends to increase the sampler-learned gap.
Using different sampler backends (e.g., vLLM, SGLang, or SGLang with deterministic kernel) does not impact the sampler-learner gap.

“Responses capped at 20K tokens exhibit a higher maximum mismatch than those capped at 4K… the mean mismatch remains similar across both settings… longer sequences provide more opportunities for a single, large probability divergence, even when the average per-token difference remains stable.” - from [4]

Beyond the factors mentioned above, there are other choices that may impact the sampler-learner gap but are not deeply analyzed in [4]. For example, dense models exhibit different levels of mismatch compared to Mixture-of-Experts (MoE) models, while base models tend to have a smaller mismatch compared to models that have already been post-trained. Additionally, the mismatch can fluctuate depending upon characteristics of our data (e.g., difficulty or domain).

More Tweaks: GSPO, GMPO, CISPO and Beyond

We have now learned about the most popular GRPO modifications that have been recently proposed, but there are still many other useful papers in this space. This section will provide a wider overview of such work with links to further reading.

“Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization.” - from [6]

Group Sequence Policy Optimization (GSPO) [6] is a modified version of GRPO that yields benefits in terms of stability and efficiency, especially for MoE models. The GSPO algorithm was used for training Qwen 3 models, which are (at the time of writing) the most performant and widely used open weight models. The key idea behind GSPO is changing the loss to operate at the sequence level instead of the token level. Most LLMs are trained using outcome rewards, meaning the reward is assigned at the sequence level. Assuming a single outcome reward, GRPO assigns the same advantage to every token in a sequence. Despite using outcome supervision, however, the surrogate loss in GRPO defines a per-token policy (or importance) ratio that scales the gradient of each token; see below.

Token-level importance and sequence-level advantage in the GRPO loss

In this standard formulation of the surrogate objective in GRPO, there is a misalignment between how the model is optimized—on the token level—and how rewards are assigned—on the sequence level. Using token-level importance ratios increases the variance of the policy gradient and can lead to training stability issues in large-scale RL runs. To avoid these issues, GSPO instead computes the importance ratio on the sequence-level, which aligns naturally with the reward structure used for LLMs and improves training stability; see below.

GSPO loss function (from [6])

The importance ratio is computed using the probability of the entire sequence, and we apply clipping to this sequence-level importance ratio. By doing this, we apply a stable sequence-level weight to all tokens, rather than introducing token-level importance weights with high variance. Notably, the importance ratio in GSPO is still normalized by the number of tokens in a completion T, ensuring that the ratio does not fluctuate drastically based on the length of a sequence. GSPO also uses the same advantage formulation as GRPO, allowing it to keep the same computational efficiency (i.e., from not using a value model).

(from [6])

When used in experiments, GSPO not only improves training stability, but also offers better sample efficiency and overall performance; see above. The stability of GSPO is found to be especially useful when training large MoE models, such as Qwen3-235B-A22B. In particular, we often experience expert-activation volatility when training MoEs with RL, meaning that a large portion of experts active for a given prompt change or fluctuate drastically after one or more policy updates. This volatility in expert selection can prevent convergence during RL training.

Initially, Qwen 3 models solved this issue via routing replay, which caches the initial experts selected for a prompt and uses these same experts for computing several subsequent policy updates. Routing replay enables convergence of MoE models when trained with GRPO. However, GSPO naturally provides stable RL training for MoEs without the need for any complex workarounds; see below.

(from [1])

Geometric Mean Policy Optimization (GMPO) [7] addresses the same problem observed by GSPO but uses a different approach. During RL training with GRPO, token-level importance ratios can become large in magnitude, creating outlier importance weights that cause training instability. GMPO solves this issue by using a new aggregation strategy for the loss. In GRPO, the loss is aggregated by taking the mean of token-level losses over the sequence. GSPO improves stability by calculating importance ratios at a sequence level (i.e., not the token level). In contrast, GMPO still uses token-level importance ratios, but we aggregate the token-level loss by taking a geometric mean over the sequence; see below.

(from [7])

Because geometric means involve taking roots, they are only defined for non-negative numbers. To get around this, the geometric mean in GMPO is computed over absolute values of token-level losses and multiplied by the sign of the advantage (i.e., either -1 or 1) to ensure correct directionality of the update.

“GMPO is plug-and-play—simply replacing GRPO’s arithmetic mean with the geometric mean of token-level rewards, as the latter is inherently less sensitive to outliers.” - from [7]

Given that arithmetic means are sensitive to outliers, outlier importance ratios during RL training can cause instability in the standard GRPO loss. On the other hand, geometric means are less sensitive to outliers and can, therefore, help to reduce the variance of the policy gradient; see below.

(from [7])

Although GMPO still uses token-level importance ratios and applies clipping at the token level, a wider clipping range is needed relative to GRPO; e.g., authors in [7] use a range of [~0.7, ~1.5] instead of the default [0.8, 1.2] range used by GRPO. To ensure numerical stability, we usually compute importance ratios (and the entire geometric mean) using log probabilities instead of raw probability values. See below for an example—this is a common practical trick used by most PPO-style algorithms9. The clipping range used for GMPO corresponds to clipping the log of the importance ratio within the range [-0.4, 0.4].

Example implementation of the GMPO loss (from [7])

We learn from ablations in [7] that token-level clipping outperforms computing and clipping importance ratios at the sequence level. Importance ratios during RL training lie in a more stable range relative to GRPO as well; see below.

(from [7])

Compared to GRPO, GMPO also has more stable entropy during training, which is a positive sign of exploration. In the math domain, GMPO improves Pass@1 performance by as much as 4% absolute, and the largest performance benefits are observed when training multimodal and MoE models.

Clipped Importance Sampling Weight Policy Optimization (CISPO) [8] is a modified variant of GRPO that is proposed in the MiniMax-M1 technical report and shown to benefit training stability in experiments with large-scale RL. In experiments with PPO and GRPO, authors in [8] observe that “fork” tokens in the model’s reasoning trace (e.g., “aha” or “wait”) are rare and tend to have low probabilities, leading them to be assigned large importance ratios. Unfortunately, these pivotal fork tokens, which play an important role in the LLM’s reasoning process and help to stabilize entropy during training, are usually clipped by the GRPO objective, which eliminates their contribution to the policy update.

In DAPO [1], this issue is addressed via the clip higher approach, which lessens restrictions on policy updates for exploration tokens by increasing the upper bound of clipping in GRPO. However, such an approach is less effective for MiniMax-M1 because 16 policy updates are performed over each batch of data—most standard RL setups perform fewer (~2-4) updates. Usually, the importance ratio will exceed the clipping range after a few policy updates, and tokens with larger ratios will eventually be ignored by all subsequent policy updates. Ideally, we should allow pivotal exploration tokens to contribute to all policy updates.

CISPO uses the same advantage estimation technique as GRPO, but the structure of the objective resembles that of REINFORCE; see below. Unlike REINFORCE, however, token-level losses in CISPO are scaled by a clipped version of the importance ratio. Due to the use of a stop gradient, the importance ratio is treated as a constant that scales each token’s contribution to the overall policy gradient, but it is not backpropagated when computing the gradient.

CISPO loss (from [8])

For PPO and GRPO, tokens that are clipped from the loss receive zero gradient—they have no contribution to the policy update. By treating the importance ratio as a capped constant, CISPO adopts a soft, token-level clipping strategy. Clipped tokens still contribute to the gradient, but their weight is capped at a maximum value, as determined by the clipping mechanism in CISPO. When compared to GRPO and DAPO [1] for training Qwen2.5-32B-Base on math reasoning tasks, CISPO is found to improve both stability and sample efficiency; see below.

(from [8])

More GRPO variants. Given the popularity of reasoning and RL in current LLM research, there are many modified algorithms and practical tweaks that have been proposed in the wake of GRPO. Only a small (though notable!) part of this work has been covered in this overview. To learn more, there are several great posts beyond this overview that have been written on the topic. Additionally, a list of other notable works in the area has been compiled below:

Router-Shift Policy Optimization (RSPO) is an MoE-focused RL algorithm that rescales router logits to improve training stability.
Soft Adaptive Policy Optimization (SAPO) replaces clipping for the policy ratio with a softer gating mechanism to encourage stable policy updates.
Low-Probability Token Isolation (Lopti) reduces the effect of low-probability tokens on the policy gradient and emphasizes parameter updates driven by high-probability tokens to improve the efficiency of RL.
Value-based Augmented Proximal Policy Optimization (VAPO) builds upon work in DAPO to improve RL efficiency via the introduction of value models.
Lite PPO performs an extensive empirical analysis of RL for reasoning, arriving at a critic-free RL algorithm—based upon the vanilla PPO loss—that consistently outperforms GRPO and DAPO. The main idea is to perform token-level loss aggregation and compute the standard deviation from the GRPO advantage over the entire batch instead of the group.
Dynamic Clipping Policy Optimization (DCPO) proposes a dynamic clipping scheme for token-level importance ratios and standardizes rewards across consecutive training steps to avoid cases with zero policy gradients.
Reinforce-Rej proposes a simple scheme—inspired by rejection sampling—that improves RL efficiency by removing entirely correct and incorrect samples during training (similarly to dynamic sampling).

If you are aware of any other works that propose improvements to GRPO, please share them in the comments so that this list can be improved and expanded!

Putting It All Together

“Our TIS fix addresses the distribution mismatch problem rooted in the system level… Such a problem widely exists in RL training frameworks… our fix can be applied irrespective of the specific RL algorithms used.” - from [4]

Throughout the course of this overview, we have seen a wide variety of tips and tricks that can be applied to improve the effectiveness of RL training with GRPO. Despite the breadth of this work, we must remember that these proposals are not mutually exclusive—the most performant RL setups will combine many best practices together. For example, Olmo 3 [5] provides a perfect example of an RL training pipeline that incorporates several techniques from recent research. Specifically, the following set of improvements are adopted for training the Olmo 3 Think reasoning models with GRPO:

Zero Gradient Filtering: prompts for which the entire group of completions or rollouts in GRPO receive the same reward are removed [1].
Active Sampling: to maintain a constant batch size despite filtering zero-gradient examples, additional samples are always available to replace those that are filtered out [1].
Token-Level Loss: the GRPO loss is normalized by the total number of tokens across the batch instead of per-sequence, which avoids instilling a length bias in the loss [1].
No KL Loss: the KL divergence term is removed from the GRPO loss to allow for more flexibility in the policy updates, which is a common choice in recent reasoning research.
Clipping Upper Bound: the upper bound in the PPO-style clipping used by GRPO is set higher than the lower bound to enable larger policy updates [1].
Truncated Importance Sampling (TIS): an extra importance sampling term is added to the GRPO loss to adjust for differences in log probabilities between engines used for training and inference [4].
No Standard Deviation: the standard deviation of rewards in a group is excluded from the denominator of the GRPO advantage calculation [3].

The modified GRPO objective for Olmo 3 is shown below. Compared to vanilla GRPO, we maintain the high-level structure of the loss but i) normalize the objective differently, ii) slightly change the advantage, iii) tweak the upper bound for clipping, and iv) weight the objective using TIS. Plus, there is no need to stop here! RL is a rapidly evolving research domain. We must actively monitor work in this area over time, test new modifications to the GRPO objective, and continually incorporate the tricks that are found to be helpful empirically.

Enhanced GRPO formulation for Olmo 3 (from [5])

New to the newsletter?

Subscribe now

Bibliography

[1] Yu, Qiying, et al. “Dapo: An open-source llm reinforcement learning system at scale.” arXiv preprint arXiv:2503.14476 (2025).

[2] Guo, Daya, et al. “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.” arXiv preprint arXiv:2501.12948 (2025).

[3] Liu, Zichen, et al. “Understanding r1-zero-like training: A critical perspective.” arXiv preprint arXiv:2503.20783 (2025).

[4] F. Yao, L. Liu, D. Zhang, C. Dong, J. Shang, and J. Gao. Your efficient rl framework secretly brings you off-policy rl training, Aug. 2025. URL https://fengyao.notion.site/off-policy-rl.

[5] Olmo, Team, et al. “Olmo 3.” arXiv preprint arXiv:2512.13961 (2025).

[6] Zheng, Chujie, et al. “Group sequence policy optimization.” arXiv preprint arXiv:2507.18071 (2025).

[7] Zhao, Yuzhong, et al. “Geometric-mean policy optimization.” arXiv preprint arXiv:2507.20673 (2025).

[8] Chen, Aili, et al. “MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention.” arXiv preprint arXiv:2506.13585 (2025).

[9] Team, Kimi, et al. “Kimi k1. 5: Scaling reinforcement learning with llms.” arXiv preprint arXiv:2501.12599 (2025).

[10] Hu, Jingcheng, et al. “Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.” arXiv preprint arXiv:2503.24290 (2025).

[11] Schulman, John, et al. “Proximal policy optimization algorithms.” arXiv preprint arXiv:1707.06347 (2017).

[12] Schulman, John. “Approximating KL Divergence.” Online (2020). http://joschu.net/blog/kl-approx.html.

[13] Shao, Zhihong, et al. “Deepseekmath: Pushing the limits of mathematical reasoning in open language models.” arXiv preprint arXiv:2402.03300 (2024).

[14] Liu, Aixin, et al. “Deepseek-v3 technical report.” arXiv preprint arXiv:2412.19437 (2024).

However, preference tuning in general can still play a useful role in modern LLM research; e.g., Olmo 3 Think includes DPO-based preference tuning as part of the post-training pipeline for reasoning capabilities.

Entropy can be computed in a language model as described here. Put simply, entropy looks at the next-token distribution of our LLM and quantifies the amount of entropy that exists in this distribution. In plain English, low entropy means that almost all of the probability is assigned to a single token, while high entropy means that the probability mass is spread across a larger number of tokens.

At the time of writing, curriculum learning for RL (at least with LLMs) is not widely used. Most focus is placed on data composition rather than curriculum. However, this could become an interesting future topic of study.

The DAPO-Math-17K dataset in HuggingFace actually contains ~1.8M rows, but many of these rows are duplicates. These rows are deduplicated in the DAPO code to arrive at a final set of ~17K prompts. Directions for how exactly to deduplicate this dataset properly can be found in these notes.

Only answer format is considered, not the actual correctness of the answer.

Normalizing or whitening advantages is a very common practice improve in RL that is often used to improve training stability. However, rewards are usually normalized over a batch of data, whereas the bias demonstrated in [3] exists on a question-level. Batch-level normalization is consistent across all examples in the batch, but the question-level normalization in GRPO can lead to biased policy updates based on the difficulty of each individual question.

This time can also be dominated by the long tail of completions that have many tokens. Most completions tend to be of short or average length—these may complete quickly when sampling rollouts. However, a much longer amount of time may be spent waiting for a few very long completions to finish. This long tail problem can significantly degrade the efficiency of RL training, especially in a synchronous setup.

We usually express GRPO via the clipped surrogate objective, rather than as a direct policy gradient expression. However, the policy gradient in GRPO is just the gradient of this surrogate objective.

For example, we can see in this implementation of the PPO loss that we compute the importance ratio using log probabilities instead of raw probability values.

Olmo 3 and the Open LLM Renaissance

Cameron R. Wolfe, Ph.D. — Mon, 15 Dec 2025 10:33:23 GMT

(from [1, 5, 11])

As the capabilities of large language models (LLMs) have continued to progress, AI research has generally become less accessible to those outside of frontier labs. Although a variety of open-source LLMs are publicly available, there are two key issues that have consistently impeded progress in open research:

The performance gap between closed and open models.
The prevalence of open-weight models, and the scarcity of fully-open models.

Put simply, most “open” LLMs only publicly release the model’s weights (and sometimes an accompanying technical report). However, these weights are only a shallow snapshot of the model’s training process. To reproduce any component of this training process, more artifacts (e.g., data, code, training recipes, and deeper technical details) are needed. The limitations of open-weights LLMs have caused fully-open LLMs to become more popular, with AI2’s Open Language Model (Olmo) series being one of the most prominent proposals in the space. In this post, we will provide a comprehensive and understandable overview of Olmo 3 [1]—the most recent release in the Olmo series and top-performing fully-open LLM.

“We introduce Olmo 3, a family of state-of-the-art, fully open language models at the 7B and 32B parameter scales. The release includes… every stage, checkpoint, datapoint, and dependency used to build [Olmo 3]. Our flagship model, Olmo 3 Think-32B, is the strongest fully open thinking model released to-date.” - from [1]

As we will see, Olmo 3 lags behind the performance of top frontier models, but the value of these models lies in their transparency. In addition to providing a detailed technical report [1], Olmo 3 releases model checkpoints across the entire training process, all of the training data, and full training and evaluation code—the models can be completely retrained from scratch using these resources. For these reasons, the value of Olmo 3 goes beyond simply providing better, fully-open LLMs. For anyone interested in contributing to open LLM research, Olmo 3 and its artifacts are among the most comprehensive starting points to ever be released.

Olmo 3 model flow (from [1])

Model flow. The high-level training pipelines, referred to as “model flows” in [1], used for training both sizes (i.e., 7B and 32B) of Olmo 3 models are shown above. Base models for Olmo 3, which are also openly released, are created via a three-stage process of general pretraining, midtraining on targeted data, and a context extension phase. From here, base models undergo a sequential post-training process that includes supervised finetuning (SFT), direct preference optimization (DPO), and RL training to produce multiple Olmo 3 model variants:

Olmo 3 Instruct: non-reasoning models that quickly respond to user queries and are optimized for multi-turn chat, instruction following, and tool usage.
Olmo 3 Think: reasoning models that undergo specialized training to hone their complex reasoning capabilities by outputting long chains of thought (or reasoning trajectories) prior to providing a final answer.
Olmo 3 RL-Zero: reasoning models that are created by running reinforcement learning (RL) training directly on the pretrained base model—this setup was popularized by the DeepSeek-R1 model [9].

Notably, the training algorithms and pipeline used for the Instruct and Think models are quite similar, but the data are modified to target unique capabilities. After covering necessary details of the Olmo 3 model architecture, we will explain in detail each component of this training process—beginning with pretraining and ending with reasoning-oriented RL training—in an end-to-end fashion.

Preliminaries. This overview outlines the entire training pipeline for Olmo 3. In a single overview, we cannot cover all necessary background information needed to understand how a near-frontier-level LLM is trained. Instead, most important concepts will be explained inline as they are introduced throughout the overview. Additionally, an index of important topics that will appear throughout the overview (with links for further learning) is provided below:

Base Models

“The goal of Olmo 3 Base is to establish a strong foundation that supports a diversity of general capabilities while enabling downstream capabilities like thinking, tool-use, and instruction-following to be easily elicited during post-training.” - from [1]

A new base model is pretrained from scratch for Olmo 3 with a special focus on key capabilities like reasoning and agents (i.e., function calling or tool use). These capabilities are usually elicited during later post-training stages, but we lay the groundwork during pretraining by exposing the model to a diverse dataset and building a robust knowledge base. Specifically, Olmo 3 undergoes three separate phases of pretraining:

A general pretraining stage over a large textual corpus.
A midtraining phase focusing on targeted, high-quality data.
A context extension phase teaching the model to handle longer inputs.

To improve upon Olmo 2 [3], authors in [1] explore new data curation strategies and iterate on the pretraining process in a scientifically rigorous manner. An expanded suite of benchmarks and evaluations that meaningfully capture base model performance across diverse experimental settings is also created, allowing the highest-performing pretraining recipe to be discovered empirically.

Training infrastructure. Pretraining code and recipes for Olmo 3 are available in the Olmo-Core repository, allowing all Olmo 3 model checkpoints to be exactly reproduced. The pretraining process relies upon Fully-Sharded Data Parallel (FSDP)1 distributed training, which saves memory by sharding2 parameters, gradients, and optimizer states across GPUs; see below. During the forward and backward passes, each GPU gathers the full parameters for the current layer from the shards distributed across all GPUs, computes the necessary operations, and then re-shards the parameters—and gradients after the backward pass—before moving on to the next layer. As a result, we only store a single full layer in GPU memory at any given time, while all other layers are sharded across GPUs.

FSDP configuration (source)

FSDP is also performing data parallel training (i.e., the “DP” part of FSDP). In addition to sharding, each GPU processes a unique mini-batch of data, allowing the total batch size to reach 8× (assuming eight GPUs) the maximum batch size of a single GPU. For example, we can see the full training settings for the Olmo 3 32B Base model below, which uses a total batch size of 1,024 during pretraining. Given that the pretraining process uses 1,024 GPUs in total, this means that each GPU in the cluster is processing a single sequence during a training step.

(from [1])

When pretraining a modern LLM like Olmo 3, we use more than just a single eight-GPU node. For example, we just mentioned that Olmo 3 is pretrained with 1,024 H100 GPUs (or 128 eight-GPU nodes), while midtraining and long context training use 128 and 256 GPUs, respectively. However, sharding across thousands of GPUs is inefficient because inter-node communication is much slower than intra-node communication. To solve this, we usually apply FSDP inside of each eight-GPU node and create replicas of the model across nodes to avoid constantly communicating model parameters—which is very expensive—between nodes.

“We ran on 128 nodes with 8× NVIDIA H100 (80GB HBM3) per node, connected via TCPXO (200 Gbps/GPU). We used HSDP via PyTorch FSDP2 with 8-way sharding so each node hosted a single model replica. Communication-intensive collectives were therefore restricted to within-node, improving efficiency.” - from [1]

Within each node, FSDP is used to shard the model, while across nodes, standard data parallelism is used. Each node has a full copy of the model, and gradients are averaged across nodes at each step. This way, we sync parameters and gradients across nodes during each model update, rather than performing syncs at every layer as in FSDP. This approach, called Hybrid-Sharded Data Parallel (HSDP), is used during all phases of training for the Olmo 3 Base models.

Depiction of tensor and context parallelism, or TP and CP (source)

The primary limitation of the HSDP setup described above is the fact that it does not shard everything. For example, full activations are stored on each GPU! When performing long context training, storing the full activations on each GPU can lead to memory issues. As a solution, authors in [1] add Context Parallelism (CP) to their distributed training setup, which splits the model’s input across multiple GPUs in a node along the sequence dimension to reduce memory overhead; see above. To support a multi-node setup, we can apply CP in tandem with FSDP inside of a node, then create data parallel replicas across nodes as in HSDP.

Base model evaluation. The performance of Olmo 3 Base models across a wide variety of benchmarks is presented in the table below. Among fully-open models—meaning weights, data, and code are all available—like Marin 32B and Apertus 70B, Olmo 3 Base models achieve state-of-the-art performance and make notable gains in the math and coding domains. When including open-weights models like Qwen and Gemma, Olmo 3 performs comparably in some domains (e.g., question answering), while lagging behind in others (e.g., math and code).

(from [1])

When analyzing the performance of Olmo 3 models, we will notice that they are not usually state-of-the-art when compared to open-weights LLMs. However, Olmo 3 models do outperform fully open models and approach the performance of the best open-weight models in most domains. Because the Olmo 3 model series discloses its full training dataset, certain data sources must be removed from training to retain a commercial license. Open-weight models, due to not disclosing training data, do not operate under this restriction, which may (partially) explain the gap in performance. Despite lagging slightly behind the state-of-the-art, however, Olmo 3 models are an invaluable contribution due to their transparency and the ecosystem of tools they provide for further research.

Model Architecture

(from Ahead of AI by Sebastian Raschka)

The model architecture used by Olmo 3 [1] (shown above) is a dense3, decoder-only transformer architecture very similar to that of Olmo 2 [3]. There are two model sizes released—7B and 32B parameters—which have the same structure, differing only in the following aspects:

Number of self-attention heads.
Number of key and value heads (in self-attention).
Dimension of hidden layers and token vectors.
Number of total layers.

This architecture follows most design decisions found in other popular open LLMs, such as the Qwen-3 [21] series. Notably, Olmo 3 maintains the post-normalization structure (with RMSNorm) that was shown by Olmo 2 to improve training stability. Additionally, QK-norm is used, meaning an additional RMSNorm layer is applied to queries and keys before computing the attention operation. This additional normalization avoids attention logits from becoming too large, which can aid in training stability (especially for low precision training). This same approach is also used by models such as Gemma-3 and Olmo 2.

In the 7B model, attention layers in Olmo 3 are regular, multi-headed attention layers instead of Grouped Query Attention (GQA) layers. In contrast, the 32B model uses GQA with 40 attention heads and only eight key and value heads. As shown below, GQA shares keys and values—but not queries!—between multiple attention heads, which benefits both parameter and compute efficiency.

(source)

However, the biggest benefit of grouped-query attention comes at inference time. Memory bandwidth usage during inference is reduced because fewer keys and values need to be retrieved from the model’s KV cache. Given that memory bandwidth is the key bottleneck for the decode step during the transformer’s inference process, this change drastically speeds up the inference process.

To further improve attention efficiency, Olmo 3 uses Sliding Window Attention (SWA), which only considers tokens inside of a sliding window—Olmo 3 adopts a window size of 4K tokens is particular—during attention to save costs; see below. SWA is used in 3/4 layers—every fourth model layer uses full attention. SWA is a common architectural choice used by GPT-OSS, Mistral, Gemma and more.

Regular (masked) attention versus SWA

Finally, Olmo 3 uses Sigmoid Linear Unit (SiLU) activations and is pretrained with a context window of 8K tokens. In a later training stage, Olmo 3 undergoes context extension using YaRN [8], which will be discussed more later in the overview. For a from-scratch implementation and detailed explanation of the Olmo 3 architecture, see this recent notebook from Sebastian Raschka, or his extensive architecture comparison that includes most open LLMs.

Ahead of AI

The Big LLM Architecture Comparison

Last updated: Dec 14, 2025…

10 months ago · 1516 likes · 74 comments · Sebastian Raschka, PhD

Evaluating the Base Model

Developing a solid pretraining recipe is an empirical process—we need to test a bunch of settings and see what works well. Given that pretraining is expensive, the number of full-scale pretraining runs we can perform is limited. Instead, we test interventions to the pretraining process by:

Formulating smaller-scale tests to validate our ideas.
Applying promising interventions to full-scale runs.

However, such an approach can still be difficult—results at a small scale may not translate well to larger-scale experiments. Some benchmarks may only be sensitive at specific scales. For example, small-scale pretraining tends to yield models with random performance on math and code benchmarks, but other benchmarks may already be saturated even at smaller scales. Additionally, the LLM evaluation process is generally noisy, so small differences in results may not be meaningful.

“If something hurts performance at small scale, you can confidently rule it out for large scale. But if something works at small scale, you should still make sure you’ve trained on a reasonable number of tokens to conclude with high probability that these findings will extrapolate to larger scales. The longer you train and the closer the ablation models are to the final model, the better.” - from [2]

OlmoBaseEval is a set of 43 total benchmarks that is created to guide pretraining experiments for Olmo 3. This suite is 4× larger than the benchmarks used by Olmo 2. It covers a wide range of capabilities (including math and code), presents multiple newly proposed benchmarks, and maintains held-out test sets for several important capabilities targeted during pretraining. The benchmark is developed according to three major design principles:

Task Clusters: benchmarks are grouped into task clusters over which scores are aggregated, where each cluster targets a core capability.
Proxy Metrics: a detailed scaling analysis is performed to determine which tasks provide a useful signal at which scales.
Signal-to-Noise Ratio (SNR): benchmarks with high SNR4 are either removed from the evaluation suite or evaluated using a larger number of samples.

To form the task clusters, a pool of 23K benchmark scores is collected using 70 different open-weight models, then a clustering approach5 is used to group tasks with similar evaluation results together. In other words, a cluster includes tasks that tend to rank models similarly during evaluation. Some manual post-processing is performed to arrive at the final task clusters: multiple-choice (MC) STEM, MC non-stem, Math, Code, and Code Fill-in-the-Middle (FIM); see below.

(from [1])

A suite of 25 Olmo 2 [3] models trained with varying amounts of compute—and a few other open-weight base models—are used to conduct a scaling analysis, allowing us to observe the scale at which particular metrics become useful; see below.

(from [1])

Based on this analysis, evaluation tasks are separated into two groups:

Base Easy: tasks that show signal at smaller scale.
Base Main: tasks that were not yet saturated at larger scales.

The Base Easy task suite includes all tasks from Base Main that have ground truth answers available. Performance on this suite is measured in bits-per-byte6, which is computed by dividing the negative log-likelihood of the ground truth answer by the number of bytes in the answer string. Using bits-per-byte as a proxy metric for evaluating a pretrained LLM provides a less noisy measure of performance without requiring advanced instruction following capabilities. Other common strategies include perplexity-based evaluation or multiple choice questions.

“Continuous proxy metrics have been shown to be a better decision making tool for model performance before we exit the noise floor.” - from [1]

The OlmoBaseEval suite is used across pretraining and midtraining. The Base Easy suite is used as a proxy for evaluating smaller-scale pretraining runs, while full-scale pretraining and midtraining runs are evaluated with Base Main. The entire OlmoBaseEval suite is openly available in the Olmes repo from AI2 and can be run on any model, as shown below (taken from the Olmes README).

# Run the base easy evaluation (for evaluating small-scale experiments)
olmes \
    --model allenai/Olmo-3-1025-7B \
    --task \
        olmo3:base_easy:code_bpb \
        olmo3:base_easy:math_bpb \
        olmo3:base_easy:qa_rc \
        olmo3:base_easy:qa_bpb \
    --output-dir 

# Run the base main evaluation
olmes \
    --model allenai/Olmo-3-1025-7B \
    --task \
        olmo3:base:stem_qa_mc \
        olmo3:base:nonstem_qa_mc \
        olmo3:base:gen \
        olmo3:base:math \
        olmo3:base:code \
        olmo3:base:code_fim \
    --output-dir 

# Run the base held-out evaluation
olmes \
    --model allenai/Olmo-3-1025-7B \
    --task \
        olmo3:heldout \
    --output-dir

Evaluation during pretraining. When running pretraining, we usually want to monitor the intermediate performance of our model. However, the learning rate has a huge impact on evaluation results. To get meaningful metrics, we must anneal (or decrease according to a schedule) our learning rate to zero prior to this evaluation being performed—this simple approach is followed for the Olmo 3 7B model but is expensive. As an efficient alternative, authors in [1] adopt a model merging approach from [6] for their 32B model that merges four checkpoints that are 1,000 steps apart before performing evaluation. This approach has been found to accurately simulate learning rate annealing behavior during pretraining.

“We demonstrate that merging checkpoints trained with constant learning rates not only achieves significant performance improvements but also enables accurate prediction of annealing behavior. These improvements lead to both more efficient model development and significantly lower training costs.” - from [6]

Model merging combines multiple models with the same architecture by taking a linear combination of their weights. This approach might seem bizarre, but it works well because LLMs finetuned from the same pretrained model are mode connected—taking a linear combination of two such models’ weights produces another model that performs well. We can use model merging to combine multiple models into a hybrid model that shares the models’ capabilities. One of the simplest model merging approaches is a model soup [22], which simply averages the weights of multiple model checkpoints. We can find public implementations of various model merging techniques in MergeKit, which is also used in [1]. A full overview of model merging techniques can be found at the link below.

Pretraining

Creating Dolma 3 Mix

The pretraining process for Olmo 3—including both experiments and final training runs—consumes over 90% of total compute for the project and targets four key capabilities: science, medical, math and coding. Dolma 3 Mix, which contains 6T tokens derived from the full Dolma 3 pool of 9T tokens, is the primary data source used for pretraining and is created using the steps illustrated above. These steps mostly match other open pretraining recipes [2, 3, 4], aside from:

Using token-constrained mixing and quality-aware upsampling (details to follow) to improve the overall quality of tokens included in the mixture.
Including a new set of academic PDF data—238M unique PDFs in total with a knowledge cutoff of December 2024. This data is curated using a custom PDF crawler that prioritizes academic sites and paper repositories then converted into linear plain text with OlmOCR.

For Olmo 3 pretraining, authors only consider data sources that have a sufficient number of tokens to meaningfully impact model capabilities during pretraining—additional small but high-quality data sources are reserved for midtraining. Structured data (e.g., question-answer pairs or chat templated data) is also saved for midtraining. Including structured data in pretraining—even if the token quantity is small—significantly impacts evaluation results and can complicate data ablations.

(from [1])

Mixing approach. The full Dolma 3 pool contains 9T tokens, but we must mix and sample this pool—under the constraint of the total number of tokens we want to use for training (i.e., 6T for Olmo 3)—to create the best possible pretraining corpus. As shown in the table above, Dolma 3 is partitioned into groups by type, and we must determine the optimal mixing ratio for each of these groups. The strategy for determining the best data mixture in [1] has two components:

A base procedure that constructs a high-quality data mix over a fixed (i.e., not being actively changed or developed) set of data sources.
A conditional mixing step that efficiently updates our existing mix as data sources change during the model development process.

“We apply a mixing strategy that draws on swarm-based methods to train and evaluate many smaller proxy models, using these results to inform an optimal mix. Further, we apply a novel conditional mixing procedure to account for the fact that our data sources were being constantly refined and updated.” - from [1]

The base procedure in [1] uses a swarm optimization approach that is similar to the idea of RegMix [5]. The swarm optimization proceeds as follows:

Randomly sample a large number of mixtures. In [1], the number of mixtures sampled is set to 5× the number of data sources being mixed.
Perform small proxy experiments by training a 30M parameter Olmo 3 model over 3B tokens from each mixture.
Evaluate each proxy model on the Base Easy suite.
For each task in the Base Easy suite, train a generalized linear model that predicts task performance given the mixing parameters as input.
Use the generalized linear models to simulate performance of different data mixtures and search for the optimal data mixture under constraints.

In [1], authors have a maximum token budget and aim to not repeat any domain in the data more than four to seven times. These constraints are added to the final optimization step when searching for the optimal data mixture. From here, we can take the optimal data mixture and test it at a larger scale; see below.

(from [5])

During model development, we will usually search for the optimal data mixture more than once. Data sources are constantly changing and being improved, which influences the optimal mixture. Additionally, some sources of data may become available later in the development process. Re-running the base procedure from scratch is inefficient—not all sources of data are changing. Instead, a conditional mixing approach is proposed [1], which avoids re-computing the full swarm by:

Beginning with a base mixture that has already been optimized.
Treating this mixture as a “virtual” data source with frozen mixing ratios.
Considering all new or modified data sources.
Re-running the base procedure with both new and virtual domains.

Multiple rounds of data mixing are performed for Olmo 3, including an initial round to optimize the mixture of web data and several conditional mixing rounds that added code and PDF data to the mixture. Properties of the final data mixture are shown below7, where we can see that training on the optimal data mixture—as opposed to the natural data distribution—improves performance on most tasks.

(from [1])

The mixing strategy described above is also flexible and can be used to optimize more than just domain mixtures. For example, when optimizing the code mixture in [1], authors fix the overall ratio of code data at 25% and instead optimize the mixture of programming languages within this hard-coded token budget.

“We found that quality-aware upsampling improves performance in data-constrained settings… We achieved better results by upsampling the highest-quality data: including multiple copies of the top 5% and single copies of the remaining data to reach the target token count.” - from [1]

Quality-aware upsampling. We can further improve performance by upsampling—or including multiple copies of—the highest quality data in the training mixture. This effect can be achieved by first running all data through a quality classifier and forming an upsampling curve as shown below, where the x-axis represents data quality and the y-axis is the upsampling factor. If we were to filter data with a fixed quality threshold, this upsampling curve would be a step function, but authors in [1] model upsampling as a monotonically increasing curve. For example, we see below that the highest quality percentile of data receives an upsampling factor of ~7×, meaning the data is repeated seven times in training.

(from [1])

A separate upsampling curve is formed for every topic in the pretraining data. To find this curve, we start with our known constraints for pretraining:

The optimal mixture of topics (i.e., determined by data mixing).
The total number of desired tokens for training.
The maximum upsampling factor.

From here, we can perform a search over the space of parametric curves to find one that meets these constraints. Once the curve is found, the data for a topic is separated into a discrete set of quality buckets or percentile ranges. We can compute the upsampling factor for a given bucket by integrating the upsampling curve over this bucket and dividing this integral by the width of the bucket.

Midtraining & Long Context

Following the primary pretraining phase for Olmo 3, the model undergoes continued midtraining and long context training. The training objective during these phases is identical to that of pretraining, but we i) adopt more targeted datasets and ii) train for fewer tokens. For example, midtraining and long context training for Olmo 3 each train the model over an additional 100B tokens.

(from [1])

Midtraining for Olmo 3 uses the Dolma 3 Dolmino Mix, which contains 100B tokens curated to enhance key model capabilities. This data mix is derived via a two-part iterative process (illustrated above):

Parallel (or distributed) feedback: many data sources are considered in parallel via efficient microannealing experiments [2] that use lightweight training runs to ablate each data source8.
Integration tests: any data sources yielding promising microannealing results are combined into a centralized annealing run over a 100B token dataset that includes all promising sources of data at that time.

This approach creates a distributed feedback loop that allows many sources of data to be efficiently explored and brings promising data sources together for centralized integration tests. Put simply, we can repeatedly vet data sources in parallel and validate them at larger scales until we arrive at the final mixtraining mix. Five rounds of integration tests were performed when developing Olmo 3.

“This methodology allowed us to make rapid, targeted assessments of the quality of datasets being considered for the midtraining mix, and to iterate on many data domains in parallel.” - from [1]

To evaluate models during midtraining, authors rely primarily upon the Base Main dataset, which consists of benchmarks that are not yet saturated during pretraining. Additionally, lightweight SFT experiments are performed with midtrained models to test the “post-trainability” of various data mixtures. The performance of Olmo 3 models on these benchmarks after iterative rounds of microannealing experiments and integration testing is outlined below.

(from [1])

The final midtraining mix includes some pretraining data to avoid model drift. Additionally, instruction and thinking (or reasoning) data is included, which is found to benefit performance almost universally across benchmarks and helps to lay early groundwork for post-training. All instruction and reasoning data avoids using templates or special tokens during midtraining due to the complexity that this additional formatting introduces into the evaluation process. Instead, text formatting is adopted, which maintains the pretrained model’s output format.

“Although individual sources and domains present performance tradeoffs, the inclusion of these cross-domain post-training data types in aggregate is consistently beneficial, and this benefit begins even before post-training.” - from [1]

We observe very clear domain tradeoffs during midtraining. For example, math and code performance can be improved by increasing the ratio of this data in the midtraining mixture, but such improved performance comes at the cost of degraded performance in other domains. The Dolma 3 Dolmino mix strikes a balance between important domains. Interestingly, the final midtraining model is also a merge of two independently-trained models with different seeds, which authors find to improve performance compared to using an individual model.

Long context9 training is an important component of modern LLMs that plays a huge role in real-world tasks (e.g., tool usage or multi-turn chat) and helps with enabling test-time scaling for reasoning models. However, pretraining an LLM from scratch with natively long context would be incredibly expensive—long sequences consume a lot of memory and compute during training. To get around this, most LLMs are pretrained using much shorter sequences (e.g., 8K tokens in the case of Olmo 3) and undergo a context extension phase after pretraining.

“Because training with long sequence lengths is computationally costly, most language models are pretrained with shorter sequences and extended only in a later stage of model development. During the extension phase, models are trained on longer documents, and positional embedding hyperparameters are typically adjusted to ease positional generalization.” - from [1]

The details of this context extension phase vary drastically between models. For example, the number of tokens used for long context training can be anywhere between 100B (or less) to 1T tokens, and the order of training phases changes between models—the long context phase could be placed before midtraining or even included as part of post-training. Olmo 3 adopts a straightforward pipeline that performs long context extension after midtraining and before post-training. This long context phase uses a 100B token mix10 of the full 600B token Dolma 3 Longmino pool to extend the context of Olmo 3 from 8K to ~65K tokens.

Long context data. The dataset for long context training includes a combination synthetic data and long documents sourced from the academic PDF pretraining corpus. This data undergoes heuristic GZIP filtering that removes any document in the top and bottom 20% of GZIP compressibility. In other words, we remove long context documents that are the least or most redundant. Interestingly, this GZIP heuristic outperforms more sophisticated, model-based techniques that use perplexity metrics to identify documents with long-range token dependencies.

(from [7])

Beyond PDF data, authors in [1] collect synthetic long context data that is focused on information extraction tasks over long documents. Specifically, the technique used to generate long context data is inspired by CLIPPER [7]; see above. This approach avoids making the assumption that the LLM being used to generate synthetic data already has long context abilities. Instead, we do the following:

Partition a long document into several sections.
Identify the most common noun phrases in each section11.
Extract (k=8) text snippets from the section for each noun phrase.
Provide this information in a prompt to an LLM—Olmo 2 32B is used in [1]—to synthesize an aggregation task; e.g., writing a summary, providing a list of true or false claims, creating a conversational explainer, and many more.

We can then train our model to replicate these synthetic outputs using only the long document as input, which teaches the model to reliably extract information. During long context training, this data—including both the PDF and synthetic data—is mixed with short context data from midtraining at a 1:2 ratio (i.e., 34% long context and 66% short context data) to form the Dolma 3 Longmino Mix.

Data during long context training varies drastically in terms of sequence length. Naively batching sequences together would yield excessive padding. When we batch sequences together, we create a fixed-size tensor of size B (batch size) × S (sequence length) × d (embedding dimension). Here, S is either the maximum context length during training or the size of the longest sequence in our batch. Usually, each sequence is shorter than S, and we occupy the rest of this tensor with padding tokens to maintain the fixed shape needed by the GPU.

Standard batching compared to document packing

In the case of long context training, most examples will have length ≪S—most of this tensor will be occupied by empty padding tokens that waste computation; see above. To solve this issue, we can use document packing, which batches sequences together in the same row to avoid excessive padding; see above. Additionally, we add an inter-document mask to the attention process to avoid attention across examples that are packed together. This approach is used by Olmo 3 to improve the efficiency of the long context training process; see here for more details.

“We experiment with several methods for extending RoPE… including adjusted base frequency scaling, position interpolation, and YaRN. Each approach is applied either to all RoPE instances or is restricted to RoPE used in full attention layers. We find that applying YaRN only to full attention layers yields the best overall performance” - from [1]

Context Extension. Several different context extension techniques are tested in [1], and YarN [7] is found to yield the best performance on key evaluations like advanced Needle-in-a-Haystack (NIH) tests, RULER, and HELMET. Full details on YarN and other context extension techniques can be found in this overview.

YaRN is only applied to full attention layers, while positional embeddings are left unchanged in layers that use SWA. As shown in the figures above, this extension approach, when combined with an increasing amount of curated long context data, significantly benefits the long context performance of Olmo 3 models.

Model merging continues to play a role in long context training, but we cannot run multiple long context training runs with different seeds due to the high cost of long context training. Instead, authors in [1] take three (adjacent) checkpoints from the end of a single long context training run and merge them, which further benefits performance. The long context capabilities of Olmo 3 are comparable to or slightly worse than that of the Qwen-2.5 models, as shown in the table below.

(from [1])

Thinking Models

“Olmo 3 Think is trained for reasoning by generating extended thoughts before producing a final answer. To achieve this, we curate high-quality reasoning data (Dolci Think), apply a three-stage training recipe (SFT, DPO, and RLVR), and introduce OlmoRL infrastructure, which brings algorithmic and engineering advances in reinforcement learning with verifiable rewards.” - from [1]

Expanding upon the Olmo 3 Base models, authors in [1] explore post-training strategies to create a suite of reasoning models, referred to as Olmo 3 Think. These models are trained to reason by outputting long reasoning traces or trajectories prior to their final output via large-scale RLVR. For an in-depth overview of LLM-based reasoning models, please see the link below.

The reasoning training process for Olmo 3 Think models differs from other work in two keys respects:

Models are trained with both SFT and DPO prior to RLVR.
A multi-objective RLVR approach is used that mixes data from both verifiable and non-verifiable domains.

Despite differing slightly from related work, this post-training pipeline is shown in [1] to yield consistent gains across all stages (i.e., SFT, DPO, and RLVR).

Evaluation results. Relative to Olmo 2 [2], Olmo 3 Think models are evaluated over a much wider set of benchmarks that capture capabilities like math, general reasoning, knowledge, coding, instruction following, question answering, chat, and more. At the 32B scale, Olmo 3 Think models achieve state-of-the-art metrics among other fully-open thinking models, as well as match the performance of some popular open-weight models like Qwen-2.5 and Gemma-3; see below.

(from [1])

Compared to top open-weight reasoning models like Qwen-3, Olmo 3 Think narrows the gap in performance but is still lags behind. This gap is especially pronounced for 7B-scale models, where we see that Olmo 3 Think is significantly outperformed by Qwen 3 on knowledge-based tasks (e.g., MMLU). Such results align with general trends in performance for Olmo 3—these models are close to state-of-the-art and provide many benefits in terms of transparency and openness.

SFT & DPO

Prior to RL training, we finetune the base model using both SFT and DPO in order to create a more useful starting point for RL. The purpose of these training stages is to both improve capabilities and, more specifically, teach the model to produce thinking traces prior to its final answer. We are seeding the model with the correct output format before performing RL. Notably, recent work on LLM post-training typically does not use all of these stages. For example, DeepSeek-R1 [9] either performs a lightweight SFT stage before RLVR or applies RLVR directly to the base model (i.e., an RL-Zero setup). We see in [1] that consistent gains can be realized by performing SFT and DPO prior to RL given proper data curation.

(from [1])

The key training settings for the SFT and DPO training processes performed with Olmo 3 are provided in the tables shown below for reference. The training code is present in Olmo-Core (for SFT) and OpenInstruct (for DPO).

(from [1])

SFT. Dolci Think SFT is a set of ~2.3M supervised training examples that is used for the SFT stage of Olmo 3 and spans several important capabilities like math, science, coding, instruction following, chat and safety. This data is curated as follows (see above for a step-by-step illustration):

Prompt sourcing: prompts are sourced for each capability from a wide variety of public datasets12.
Re-generating examples: for prompts with incomplete completions, we generate new completion(s)—including both a reasoning trace and final answer for each completion—using either DeepSeek-R1 or QwQ-32B.
Correctness filtering: completions are verified using various domain-specific strategies (e.g., synthetically-generated test cases for code or verifiers for specific precise instruction following constraints).
Heuristic filtering: prompts are removed based on having unclear usage licenses, incomplete reasoning traces, excessive repetition, mention of other model providers, and other heuristics.
Topic filtering: prompts are classified by topic according to the OpenAI query taxonomy, and any topics that are irrelevant to Olmo 3 (e.g., requests for image generation) are either filtered or downsampled.

This post-training data curation process is generic and goes beyond SFT—a similar pipeline is used to curate data for DPO and RLVR. After prompts are sourced and filtered, the data mixture is derived using an approach very similar to that of midtraining: many data sources are gathered in parallel and tested via lightweight SFT experiments that train an LLM over 100B tokens from the domain of interest combined with an 100B token SFT base mixture. After evaluating data sources in parallel, we can perform centralized integration tests with data sources that are found to meaningfully benefit performance. Interestingly, all data sources in [1] were found to benefit performance on at least one evaluation benchmark; see below.

(from [1])

Beyond the post-training benchmarks used for Olmo 3, authors in [1] emphasize the role of “vibe checks”—or the manual inspection of a diverse (but usually small) set of model outputs by researchers—in evaluating models. Evaluation metrics and benchmark scores are useful, but they rarely tell the full story. By manually inspecting model outputs, we can discover trends in performance across experiments and training stages that might be difficult to uncover otherwise.

“Using [Olmo-Core], we can train a 7B model at 7700 tokens per second per GPU and a 32B at 1900 tokens per second per GPU… by relying on PyTorch’s built-in torch.compile(), custom kernels for operations such as attention and language modeling head, asynchronous and batched gathering of metrics, and asynchronous writing of checkpoints.” - from [1]

Similarly to pretraining and midtraining, the SFT training process uses the Olmo-Core codebase, which provides optimized code for supervised training. Compared to prior SFT training code for Olmo (i.e., found here in OpenInstruct), Olmo-Core is ~8× faster. Two epochs of training are conducted over Dolci Think SFT, and we again derive the final model via model merging. Specifically, we linearly merge the weights of two model checkpoints trained with different learning rates over the same data, forming the Olmo 3 7B and 32B Think SFT models.

DPO. Preference tuning is typically used for improving the alignment of an LLM to human preferences. In recent research on reasoning models, preference tuning is rarely used, but we see in [1] that DPO-based preference tuning yields an improvement in capabilities when used in tandem with SFT prior to the RL training phase. More specifically, Olmo 3 undergoes DPO-based preference tuning using a strategy that is inspired by Delta Learning [11]; see below.

(from [11])

To create a preference dataset for DPO, prior models like Olmo 2 [3] leverage a synthetic data pipeline similar to UltraFeedback [20] that generates completions from a diverse pool of models. For each prompt, we do the following:

Generate completions with each model.
Rate each completion with an LLM judge.
Form preference pairs based on these ratings (i.e., higher-scoring responses are preferred in a preference pair).

This approach hinges upon the diversity of the underlying model pool to yield high-quality preference pairs. Applying a similar model pooling approach in the reasoning domain would be difficult, as the number of LLMs with open reasoning traces is limited—most (proprietary) reasoning models surface only final outputs and hide their reasoning process. Delta Learning uses an alternative approach of forming high-quality preference pairs by minimizing the quality of rejected completions.

This approach focuses less on the absolute quality of completions in a preference pair and more on the relative quality difference between the chosen and rejected completions. For example, authors in [1] show that further training the Olmo 3 Think SFT model on synthetic completions from Qwen-3-32B actually degrades performance. However, we can improve Olmo 3 Think SFT performance via DPO with preference pairs that contain i) a chosen completion from Qwen-3-32B and ii) a rejected completion from the weaker Qwen-3-0.6B model.

“The intuition behind delta learning is that the quality of preference data depends primarily on the quality of the delta between chosen and rejected responses; the quality of either response individually is less important.” - from [1]

Olmo 3 Think DPO models are trained on Dolci Think DPO, a preference dataset comprised of completions with clear capability deltas that are generated using Delta Learning. As described above, model size is adopted as a simple heuristic for completion quality—chosen completions are sampled from the 32B Qwen model, while rejection completions are sampled from the 0.6B model. While all Olmo 3 Think SFT models are trained on a similarly-sized dataset, 7B and 32B Olmo 3 Think DPO models use preference datasets with 150K and 200K pairs, respectively.

Prompts for Dolci Think DPO are mostly reused from SFT, but additional sources of preference data from Olmo 2 (e.g., UltraFeedback and DaringAnteater) are also added. The same filtering operations from SFT are used by DPO, but filtering is only applied to chosen completions—rejected completions are left unfiltered. Due to the computational expense of experiments with reasoning traces, a hierarchical approach is used for finding the best data mixture. First, a wide variety of mixing experiments are performed using standard LLMs that directly provide output with no reasoning. The top three data mixtures from this phrase are then used in full reasoning experiments to find the best-performing preference mix.

RLVR with GRPO

As a final touch, Olmo 3 Think models undergo RL training using a combination of verifiable and non-verifiable rewards to improve the models’ reasoning skills while maintaining their general utility. The RL training process focuses upon the domains of math, code, instruction following, and general chat.

“We introduce OlmoRL, which includes our algorithm and closely intertwined engineering infrastructure to address challenges for RL with long reasoning traces, extending RLVR to include a wider variety of verifiable tasks.” - from [1]

Detailed training configurations for each of the RL training processes performed using Olmo 3 are provided in the table shown below.

(from [1])

Reward signals. Most recent work on RL for reasoning models considers a pure RLVR setup with only verifiable rewards. For example, many works apply RL in math or coding domains [15, 17], where we can easily check the correctness of the model’s output via rules or test cases. In [1], the standard RLVR setup is extended to include rewards from both deterministic verifiers and LLM judges; see below.

(from [1])

The math domain uses a standard verifier that performs basic normalization of answers and equivalence checks via sympy to yield a binary correctness score. For coding and instruction following, correctness is checked via either test cases or constraint-specific verification functions. The reward in these domains can be binary (i.e., all tests must pass to receive a reward) or the ratio of tests that pass.

The general chat domain is not verifiable—we must rely upon an LLM judge to derive a reward. Authors in [1] use Qwen-3-32B as their judge model with thinking mode turned off and the prompt shown below. Depending on if ground truth outputs are available, the judge can either be reference-based or reference-free.

(from [1])

Enhancements to GRPO. Olmo 3 Think uses Group Relative Policy Optimization (GRPO) as the underlying optimizer for RL training. Inspired by a swath of recent papers that propose useful modifications to GRPO, authors in [1] adopt a wide set of improvements to the vanilla GRPO algorithm. The following enhancements are used in particular:

Zero Gradient Filtering: prompts for which the entire group of completions or rollouts in GRPO receive the same reward are removed [16].
Active Sampling: despite filtering zero gradient examples, a constant batch size is maintained by ensuring additional samples are always available to replace those that get filtered [16].
Token-Level Loss: the GRPO loss is normalized by the total number of tokens across the batch instead of per-sequence, which avoids instilling a length bias in the loss [16].
No KL Loss: the KL divergence term is removed from the GRPO loss to allow for more flexibility in the policy updates, which is a common choice in recent reasoning research [16, 17, 18].
Clipping Upper Bound: the upper-bound term in the PPO-style clipping used by GRPO is set to a higher value than the lower bound to enable larger policy updates [16].
Truncated Importance Sampling (TIS): an extra importance sampling term is added to the GRPO loss to adjust for differences in log probabilities between engines used for training and inference [18].
No Standard Deviation: the standard deviation of rewards in a group is excluded from the denominator of the GRPO advantage calculation [19].

When considering all of these enhancements, the GRPO objective function is formulated as shown below. This objective maintains the structure of the GRPO objective, which is nearly identical to the objective from PPO but uses a modified advantage formulation. Compared to vanilla GRPO, however, we normalize the objective differently, slightly change the advantage, tweak the upper bound for clipping, and weight the objective using a capped importance sampling ratio.

Enhanced GRPO formulation for Olmo 3 (from [1])

More details on TIS. During RL training, we are constantly alternating between two key operations:

Rollouts: given a set of prompts, sample multiple completions to each prompt using the current LLM (or policy).
Policy Updates: computing an weight update for our LLM using the sampled rollouts and the objective function outlined above.

To improve efficiency, these operations are usually handled by separate engines. We sample rollouts using an optimized inference engine like vLLM or SGLang and compute policy updates with training frameworks like transformers—or usually a distributed version of this framework that uses an algorithm like FSDP or ZeRO. The use of different backends for rollouts and policy updates can lead to a mismatch between the two environments in which the log probabilities for a rollout differ significantly from those used in the policy update; see below.

(from [18])

This mismatch persists even when steps are taken to reduce differences between inference and training backends. As a solution, authors in [18] use a truncated importance sampling scheme that re-weights the GRPO objective by the ratio of log probabilities from the two engines. We cap (or truncate) this importance sampling ratio at a maximum value of ρ. Without this correction, the RL training process becomes slightly off-policy, which can degrade performance. Using TIS re-weights examples with significant mismatches to solve this issue; see below.

(from [18])

“This importance sampling term seems to be essential to getting modern RL infrastructure right, as without it, scaling to more complex systems is hard to get numerical stability with…. the advantage or reward is getting re-weighted by an importance sampling log-ratio corresponding to the difference in probabilities from the two sets of model implementations (e.g. VLLM vs Transformers).” - source

The importance sampling expression used by TIS is derived from the statistical definition of importance sampling. Formally, importance sampling is a statistical method used to estimate properties13 of a target probability distribution f(x) by sampling from a different proposal distribution g(x). Usually, taking samples from g(x) is much cheaper than f(x), which is the motivation for importance sampling. Because sampling from f(x) is difficult, we instead draw samples from g(x) and correct for the discrepancy between f(x) and g(x) by weighting each sample by the importance ratio f(x) / g(x); see below.

(source)

In the case of RL, we are interested in log probabilities sampled from our training engine—this is our target distribution f(x). However, we can take samples more efficiently from our optimized inference engine—this is our proposal distribution g(x). From here, we can use importance sampling to correct for any mismatch between these two distributions. Specifically, the importance sampling ratio (highlighted in the explanation above) is f(x) / g(x), or the log probability from the training engine divided by the log probability from the inference engine. As we might recall, this is exactly the importance ratio used within TIS!

Dolci Think RL. Similarly to other training phases, prompts for RL training are sampled from a wide variety of public sources. The full dataset, called Dolci Think RL, contains ~100K prompts spanning math, code, instruction following and chat domains. When curating code data, we need pairs of problems with associated test cases, which are not always available. As a solution, authors in [1] develop the following synthetic data pipeline:

Rewrite the problem and solution.
Generate test cases for the problem.
Execute the test cases to see if they pass.
Keep all problems that pass >80% of test cases.
Remove any remaining test cases that fail.

A similar rewriting and filtering approach is used for chat data. First, GPT-4.1 is used to rewrite samples for better clarity and reference answers are extracted. We then generate eight samples for each prompt using an LLM, compute the F1 score14 between the reference answer and each response, then remove samples with an F1 score outside of the range [0.1, 0.9]. Intuitively, this filtering operation aims to remove noisy or difficult examples from RL training.

Prior to RL, the dataset also undergoes offline difficulty filtering. Concretely, this means that we:

Generate eight rollouts for each prompt using the DPO model (i.e., the starting policy for RL training).
Remove and prompts that are already easily solved by the model before any RL (i.e., a majority pass rate of >62.5%).

The goal of difficulty filtering is to improve the sample efficiency from RL by not training on trivial data. This offline filtering is performed for the Olmo 3 Think 7B model, then the results of this filtering are re-used for the 32B model due to cost constraints. Intuitively, the 32B model should be able to solve any problem that is easily solved by the 7B model, and any remaining easy samples would still be filtered via active sampling in GRPO. In specific cases, authors also filter out data that is found to be too difficult for the model to solve during RL training.

“We found RL experiments were both long and compute-expensive… we established a pipeline in which: we performed dataset-specific runs on an intermediate SFT checkpoint and observed downstream evaluation trends over the first 500-1000 RL steps; focused on math domain training when testing new algorithmic changes; periodically ran overall mixture experiments to ensure mixing was stable.” - from [1]

The prohibitive cost of RL training makes discovering optimal data mixtures more difficult relative to prior training phases, forcing authors to design cheaper proxy experiments for tuning their RL setup. Candidate data mixtures are vetted with short RL training runs (~1K training steps) and combined into a larger mixture that is intermittently tested in centralized experiments. Similarly, algorithmic changes, such as modifications to GRPO, are tested in a simplified single-objective (i.e., math only) RL environment. Most tuning is also performed with the 7B model, while the 32B model just uses the same settings. Put simply, any ablations must use a simplified setup—running full RL training is too costly.

Key findings. We learn in [1] that DPO tends to be a better starting point for RL training—further preference tuning improves the performance of the SFT model (further SFT does not) and yields higher performance after downstream RL training. Starting from a DPO checkpoint, training rewards increase steadily throughout the RL training process; see below. Additionally, training on a mixture of different reward signals is found to be beneficial, as it prevents over-optimization to a particular domain. The training reward is actually lower when reward signals are mixed together, but the model is found to generalize better in downstream evaluations. This finding indicates that performing RL training over a diverse dataset with varying reward signals can aid performance and prevent reward hacking.

(from [1])

“In RL, the key technical challenge for finetuning models that generate long sequences is managing inference – also called the rollouts. For our final models, we performed RL rollouts that were up to 32k tokens in length, and on average over 10k tokens (for the reasoner models). Inference dominated our costs, using 8 H100 nodes for training and 20 nodes for inference for the 32B OlmoRL reasoner model. Given the cost of autoregressive inference, our learner spends 75% of the time waiting for data, so in terms of GPU utilization, we use approximately 5x as much for inference vs training.” - from [1]

Infrastructure for RL. One key focus of Olmo 3 is improving the efficiency of the RL training process. The cost of RL training is dominated by rollouts; e.g., Olmo 3 models use 5-14× more compute for inference compared to policy updates. During RL training, most of the time is spent waiting for inference to finish, and this inference process can have a long tail if certain completions are longer than others. All of these issues degrade throughput and lead to poor hardware utilization.

(from [1])

To make the RL training process more efficient, authors in [1] propose OlmoRL, an optimized setup for RL training that focuses proposes the following:

A fully-asynchronous, off-policy RL setup is used that reduces idle time by allowing inference and model updates to continue running without waiting for all components to finish.
Continuous batching (see here for details) is used to constantly enqueue new inference requests in real-time as generations finish.
To compensate for examples removed by active sampling, OlmoRL—due to its asynchronous setup—can just continue sampling and filtering examples until the desired batch size is reached.
Inflight updates to the model weights being used for inference are performed without pausing generation or clearing the KV cache, which is found in [1] to improve throughput by ~4× with no deterioration in accuracy.

Several low-level threading updates are also made to each of the inference and policy update actors; see here for the full code. When applied in tandem, the set of optimizations proposed for OlmoRL allows the wall-clock RL training time of Olmo 3 RL Think to be decreased from over 15 days to ~6 days!

(source)

Olmo 3.1 Think. After the initial release of Olmo 3, authors kept the RL training process running for an extra three weeks, producing the Olmo 3.1 Think model. This model perfectly demonstrates the value of scaling RL training and the necessity of creating stable RL training frameworks (like OlmoRL) that can run for long periods of time without instability. After the initial release, authors were unsure whether continuing the RL training process would yield further benefits, but the model continued to improve during this time. Interestingly, the model’s performance was also still not fully saturated after further training.

RL-Zero

(from [9])

DeepSeek-R1-Zero [9] demonstrated that LLMs can learn complex reasoning behavior by applying RL training directly to a base model (i.e., with no SFT); see above. This work was the first to demonstrate that reasoning capabilities could be developed without supervised data, making the RL-Zero setup—or just running RLVR on top of a base model—a popular benchmark for RL research. Although this setup is widely used in LLM research, most RL-Zero experiments are performed using models with no data transparency, preventing proper decontamination.

“[A lack of data transparency] can lead to a myriad of issues with benchmark evaluations being contaminated; e.g. midtraining data containing the evaluation which makes spurious rewards as effective as true reward or improvements from fixing prompt templates outweighing the improvements from RL.” - from [1]

Going further, a variety of unexpected findings have been recently published by work that leverages an RL-Zero setup. For example, researchers have shown:

RLVR with random rewards still improve model performance [12].
RLVR with a single training example can improve performance [13].
Base models can match the reasoning capabilities of models trained with RLVR if a sufficient number of samples are taken per prompt [14].

Understanding the cause of these findings is necessary to develop a deeper collective knowledge of RL training. Although many hypothesis exist, one possible explanation for this behavior is data contamination—these observations may simply be an artifact of evaluation data leaking into the base model’s dataset. Unfortunately, existing RL-Zero setups provide no way of validating the impact of data contamination, which makes drawing definitive conclusions from this work difficult (potentially even impossible). Olmo 3 solves this problem!

Authors in [1] release a fully open RL-Zero setup based upon Olmo 3 Base, which has fully transparent pretraining and midtraining datasets, and a new dataset for RLVR called Dolci RL-Zero. While most RL-Zero setups are single-objective (e.g., running RLVR from a base model on Math-500 is a common benchmark), Dolci RL-Zero is comprised of four domains: math, code, precise instruction following, and a mixture of all three objectives. Additionally, decontamination—or ensuring pretraining and midtraining data have no overlap with evaluation data—is prioritized, allowing more confident conclusions to be drawn from experiments with RLVR.

Notable findings. The RL-Zero setup proposed by Olmo 3 is mostly positioned as a cleaner and more reliable starting point for future research. However, authors in [1] also perform some interesting analysis using this setup. First, we see in [1] that using simpler prompt templates—mostly text-based with no special tokens as shown below—is more conducive to performant RLVR. This behavior stems from base models being primarily trained on raw text without special tokens or templates.

(from [1])

To achieve the best results, Olmo 3 RL-Zero performs lightweight prompt tuning with the base model to derive a simple, custom prompt with no special formatting for each RL domain. Performance of RL-Zero models is shown in [1] to improve steadily throughout RL training in terms of both training reward and held-out evaluation metrics; see below. As expected, we see an improvement in Pass@1 metrics, aligning with prior findings on RLVR15 [15]. Interestingly, we also see a slight improvement in Pass@3216 metrics, indicating that the base model learns to solve some problems that go beyond its initial reasoning capabilities.

(from [1])

The multi-objective nature of Olmo 3 RL-Zero also presents new challenges in RLVR research. We see above that models trained over the mix of rewards from each domain improve in their performance, but they still lag behind models that are explicitly trained on a single domain. Solving this under-optimization and developing effective techniques for balancing multi-objective RLVR is a tough research problem, but Olmo 3 provides a clean and efficient—RL-Zero is cheaper than the full post-training pipeline for Olmo 3 Think!—test bed for further analysis. For example, the Dolci RL-Zero setup is used in [1] to test several changes to the underlying RL algorithm, as well as to study the impact of different data mixtures during midtraining on the downstream RL training process.

(from [1])

Fixing RLVR with random rewards. RLVR with random rewards no longer benefits model performance when using the decontaminated Olmo 3 RL-Zero setup; see above. Although this finding clearly demonstrates the value of fully-open models for research, the results shown for RLVR with random rewards in [12] may still not be completely a product of data contamination. As shown below, these results were only found to hold true for the Qwen-2.5 model series on the Math 500 dataset—other models and tasks did not clearly benefit from random rewards.

(from [12])

Therefore, there may be some unique aspects of Qwen-2.5—including potential data contamination—that lead to these observations, which are very hard to debug without full openness. For example, beyond data contamination, an alternative rationale exists for the performance benefit of RLVR with random rewards:

Qwen models are very good at generating code to assist in solving math reasoning problems, and code reasoning—even when no code execution is allowed—is positively correlated with math performance.
In the DAPO [16] paper, authors observe that entropy decreases quickly during both PPO and GRPO training. Token distributions become concentrated, so outputs are similar when you sample multiple times and existing model behaviors are reinforced (i.e., made more likely).
This entropy collapse occurs because the clipping operation in PPO (and GRPO) restricts policy updates for low probability tokens more strictly than for high probability tokens, due to the structure of the policy ratio.
To solve this issue, DAPO recommends a “clip higher” approach, which increases the upper bound of the clipping range in PPO so that clipping is not too restrictive of policy updates.

In the case of RLVR with random rewards, clipping can reinforce the existing behavior of performing code reasoning for solving math problems in Qwen-2.5 and, in turn, improve its performance. Although this behavior is not observed in Olmo 3, the GRPO variant used in [1] also adopts the clip higher approach from DAPO. As a result, it is unclear in this case whether RLVR from random rewards is fixed due to algorithmic changes or the lack of data contamination. However, analyzing such a property would be impossible without fully open models like Olmo 3.

Instruct Models

Although reasoning models are very powerful, much of the real-world usage for LLMs is still based on general tasks that do not require extensive reasoning (e.g., information or advice-seeking queries). With this in mind, authors in [1] create Instruct versions of the Olmo 3 models that quickly respond to user queries without the need to output a reasoning trajectory. The training pipeline for Olmo 3 Instruct is similar to that of the Think models—it includes SFT, DPO and RLVR. Rather than focusing upon reasoning, however, the data used for Instruct post-training emphasizes multi-turn chat, conciseness of responses, and tool use.

“Everyday chat settings often do not require the inference-time scaling of Olmo 3 Think, allowing us to be more efficient at inference time on common tasks by not generating extended internal thoughts.” - from [1]

Instruct evaluation. The benchmarks used for evaluating Olmo 3 Instruct models include benchmarks from Olmo 3 Think along with a few additional benchmarks (i.e., Berkley function calling leaderboard, LitQA2, and SimpleQA) for evaluating function calling capabilities. As shown below, Olmo 3 Instruct models are found to benefit significantly from tool use, indicating that post-training has instilled correct tool usage behavior. Across other benchmarks, Olmo 3 Instruct models perform comparably to popular non-thinking models. Interestingly, Olmo 3 outperforms Qwen-3 with thinking mode turned off at the 7B scale on several benchmarks, though this gap in performance is not present at the 32B scale.

(from [1])

SFT. A new Dolci Instruct SFT dataset is created for Olmo 3 Instruct models that emphasizes multi-turn chat and agentic capabilities (i.e., function calling). This dataset builds upon that of Olmo 2 [3] but makes a few key changes:

Any reasoning traces that exist in the data are removed.
Synthetic completions are updated to use newer model generations (e.g., GPT-4.1 instead of GPT-3.5 or GPT-4).
An extensive set of supervised function calling examples is included.

When curating function calling data, authors focus heavily upon collecting data in realistic environments, primarily MCP servers. More specifically, there are two key strategies used:

Real trajectories: ScienceQA and WebSearchQA datasets are created by using GPT-4.1 or GPT-5—equipped with tools for querying the internet or a corpus of scientific papers via separate MCP servers—to generate problem solving trajectories for real-world queries.
Simulated interactions: starting with a pool of tools and API specifications taken from public datasets, a large synthetic function calling dataset is created by prompting a pool of LLMs (GPT-4o, GPT-4.1, and GPT-5) to generate user queries, tool responses, and assistant messages.

Executable function calling environments provide valuable training data by exposing the model to complex interactions with real tool outputs—including errors. Because collecting real tool-use data is hard to scale, however, simulated environments are used to create data for a wider set of function calling scenarios; see below for details. While real trajectories are more complex, simulated data has higher tool diversity and can be used to create examples with both multiple chat turns and multiple agent-environment interaction steps.

(from [1])

Interestingly, we see in [1] that using a unified format for function calling data is necessary for the model to perform well. Specifically, authors provide a tool spec in the system prompt, wrap tool calls in XML tags in assistant messages, and use a special environment role—represented with dedicated special tokens—for all tool outputs. An example of the unified tool format for Olmo 3 is shown below.

Tool calling example from Dolci Instruct SFT

To obtain the final data mixture for Olmo 3 Instruct SFT, authors adopt the same strategy used for tuning the Olmo 3 Think models. Namely, we start with a base data mixture of 100K supervised examples and ablate the performance impact of each data domain that is added on top of the original dataset from Olmo 2 [3].

“We find that training our instruct model on top of the thinking SFT model both increases model performance on benchmarks… and also does not increase average model response length.” - from [1]

As we might recall from the model flow at the beginning of this overview, the Olmo 3 Instruct models are trained starting from Olmo 3 Think SFT, which the authors find to benefit performance of Instruct models.

DPO. Olmo 3 Instruct models are trained using a similar (but expanded) Delta Learning approach that is adapted from Olmo 3 Think to better prioritize general chat capabilities. Specifically, three types of preference pairs are used:

Delta Learning is used to construct contrastive preference pairs in an identical fashion to Olmo 3 Think, but both chosen and rejected completions are generated via Qwen-3 with thinking mode turned off.
Delta-maximized GPT-judged pairs are created by generating synthetic completions from a pool of diverse models (including at least one model that is known to be much worse than the others), scoring them with a GPT-4.1 judge, then choosing the best and worse completion as a preference pair.
Multi-turn preferences are synthetically generated by first prompting an LLM to self-talk or synthetically generate context from an existing prompt to create multi-turn chat data, then sampling a final assistant response for this multi-turn chat via Delta Learning.

Multi-turn preferences only differ in the final assistant response, where chosen and rejected completions use models with a large quality gap (e.g., GPT-3.5 versus GPT-4.1 or Qwen-3-0.6B versus Qwen-3-32B) to generate this final turn. The GPT-judged preference data pipeline is inspired by UltraFeedback [20] but has been updated to use a more modern model pool and LLM judge; see below.

(from [20])

Interestingly, authors in [1] mention that naively applying the UltraFeedback approach with a modern model pool and judge performs poorly—all models in the modern pool perform well and tend to have minimal quality deltas in their output. As a solution, Olmo 3 proposes a “Delta Maximization” approach that i) ensures that at least one model in the pool is of much lower quality than others and ii) always constructs preference pairs from the best and worst completion in the pool.

“Our initial attempts to modernize the Ultrafeedback pipeline from OLMo 2 and Tülu 3 by improving the quality of the LLM judge (GPT-4o → GPT-4.1) and updating our data-generator model pool failed to yield gains and even hurt relative to the OLMo 2 preference dataset baseline.” - from [1]

Ensuring a large delta between preference pairs is found to be essential for model performance. Additionally, we see clear benefits in [1] by combining GPT-judge preference pairs with those from Delta Learning, revealing the benefit of using different preference signals. We also see in [1] that verbosity bias—or the tendency of LLM judges to prefer longer completions—noticeably impacts synthetic preference pipelines. To promote concise responses, chat-based preference pairs are filtered such that chosen and rejected completions do not differ in length by more than 100 tokens17; see below. Length control deteriorates certain benchmark scores but also improves usability, leads to better vibe tests, and is—somewhat counterintuitively—determined to be a superior starting point for RL training.

(from [1])

As shown in the figure above, DPO performance does not improve monotonically with more data and the optimal amount of data is task-dependent. In other words, the total size of the training dataset is a hyperparameter that must be tuned. In [1], the optimal data size and mixture is determined via a combination of:

Ablation experiments that combine different data sources with an 100K base mixture to determine data viability.
Mixing experiments that combine 50K examples from the base mixture with 50K examples from various data sources to test the impact of up-sampling a particular source of preference pairs.
One-off tests of hand-crafted data mixtures determined by expert intuition.

The behavior of DPO training is less predictable, and the final training strategy was determined empirically. Authors manually selected nine different mixtures to compare to a uniform sampling baseline and performed hyperparameter sweeps to determine the optimal amount of training data and learning rate. The final checkpoint is selected via a combination of vibe tests and benchmark scores.

RL. The RL training process for Olmo 3 Instruct is identical to that of Olmo 3 Think aside from a few minor modifications:

Using less challenging datasets (i.e., by removing the most difficult tasks) in the math and coding domains.
Removing the offline difficulty filtering step. This step is unnecessary for Instruct models due to focusing less on complex reasoning.

Olmo 3 Instruct models are trained on a mixture of general chat, math, and code data using the same RL training stack as Olmo 3 Think. However, the maximum response length is capped at 8K tokens to avoid excessively long outputs. The full RL pipeline is applied to multiple DPO models, and the final model is chosen via a combination of “final average performance, length analysis, and vibe-tests.”

The Open LLM Renaissance

AI research has traditionally been very transparent, but the level of openness has decreased during the LLM boom as top labs have focused efforts on proprietary models (e.g., GPT, Gemini, or Claude) with little transparency. Open models have always been a topic of discussion, but the level of interest in open LLM research skyrocketed with the release of DeepSeek-R1 [9]. After this release, a variety of (primarily Chinese) AI labs followed suit by releasing great models like Qwen-3, Kimi-K2, MiniMax M2, GLM-4.5, and more; see below for details.

Interconnects

2025 Open Models Year in Review

Welcome to the first Artifacts Recap, where we highlight the most notable and impactful open model releases of this year. And what a year it has been! Starting into the year, the open model landscape was seen as lagging behind severely, with open models being mostly a choice for those who needed privacy or wanted to fine-tune models for their use cases…

5 months ago · 14 likes · 7 comments · Florian Brand and Nathan Lambert

Despite the boom in open LLM research, open LLM releases were minimal in Western countries aside from models like GPT-OSS and Mistral. Additionally, the models that were released are almost exclusively open-weights, rather than being fully open—i.e., no code or data transparency is provided. These issues inspired the creation of initiatives like the ATOM project and have driven investment into the Olmo model series. As we have seen, Olmo 3 models still lag behind their open-weight counterparts, but we should remember the following points:

Progress between the original Olmo model and Olmo 3 is significant.
No other fully-open model series has neared state-of-the-art performance.
The impact of Olmo 3 goes beyond just the models themselves.

The artifacts released by Olmo 3 are more than a model—they are a starting point for any aspect of open LLM research. Anyone with access to GPUs has the ability to clone and iterate upon the model flows proposed in [1]. Performing this kind of research before Olmo 3 may have required first crafting a functional training recipe, which would (conservatively) require millions of dollars in experiments.

With this in mind, resources from Olmo 3 will fuel open research for the foreseeable future. We are already seeing positive signs in this direction with models like Intellect-3, Trinity, and Mistral 3 being released immediately after Olmo 3.

New to the newsletter?

Subscribe now

Bibliography

[1] OLMo, Team, et al. “Olmo 3” https://www.datocms-assets.com/64837/1763662397-1763646865-olmo_3_technical_report-1.pdf (2025).

[2] Hugging Face Team. “Smol-LLM Training Playbook.” https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook (2025).

[3] OLMo, Team, et al. “2 OLMo 2 Furious.” arXiv preprint arXiv:2501.00656 (2024).

[4] Groeneveld, Dirk, et al. “OLMo: Accelerating the science of language models.” Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers). 2024.

[5] Liu, Qian, et al. “Regmix: Data mixture as regression for language model pre-training.” arXiv preprint arXiv:2407.01492 (2024).

[6] Li, Yunshui, et al. “Model Merging in Pre-training of Large Language Models.” arXiv preprint arXiv:2505.12082 (2025).

[7] Pham, Chau Minh, Yapei Chang, and Mohit Iyyer. “CLIPPER: Compression enables long-context synthetic data generation.” arXiv preprint arXiv:2502.14854 (2025).

[8] Peng, Bowen, et al. “Yarn: Efficient context window extension of large language models.” arXiv preprint arXiv:2309.00071 (2023).

[9] Guo, Daya, et al. “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.” arXiv preprint arXiv:2501.12948 (2025).

[10] Lambert, Nathan, et al. “Tulu 3: Pushing frontiers in open language model post-training.” arXiv preprint arXiv:2411.15124 (2024).

[11] Geng, Scott, et al. “The delta learning hypothesis: Preference tuning on weak data can yield strong gains.” arXiv preprint arXiv:2507.06187 (2025).

[12] Shao, Rulin, et al. “Spurious rewards: Rethinking training signals in rlvr.” arXiv preprint arXiv:2506.10947 (2025).

[13] Wang, Yiping, et al. “Reinforcement learning for reasoning in large language models with one training example.” arXiv preprint arXiv:2504.20571 (2025).

[14] Yue, Yang, et al. “Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?.” arXiv preprint arXiv:2504.13837 (2025).

[15] Shao, Zhihong, et al. “Deepseekmath: Pushing the limits of mathematical reasoning in open language models.” arXiv preprint arXiv:2402.03300 (2024).

[16] Yu, Qiying, et al. “Dapo: An open-source llm reinforcement learning system at scale.” arXiv preprint arXiv:2503.14476 (2025).

[17] Zeng, Aohan, et al. “Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.” arXiv preprint arXiv:2508.06471 (2025).

[18] F. Yao, L. Liu, D. Zhang, C. Dong, J. Shang, and J. Gao. Your efficient rl framework secretly brings you off-policy rl training, Aug. 2025. URL https://fengyao.notion.site/off-policy-rl.

[19] Liu, Zichen, et al. “Understanding r1-zero-like training: A critical perspective.” arXiv preprint arXiv:2503.20783 (2025).

[20] Cui, Ganqu, et al. “Ultrafeedback: Boosting language models with scaled ai feedback.” arXiv preprint arXiv:2310.01377 (2023).

[21] Yang, An, et al. “Qwen3 technical report.” arXiv preprint arXiv:2505.09388 (2025).

[22] Wortsman, Mitchell, et al. “Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time.” International conference on machine learning. PMLR, 2022.

Another common choice for the distributed training of LLMs is the zero redundancy optimizer (ZeRO), which is usually accessed via the deepspeed package.

Here, “sharding” means that we split the data evenly among the GPUs that we have available. For example, if we have an eight-GPU node and want to store 16 parameters in a sharded manner, we would store two parameters on each GPU. Sharding reduces per-GPU memory consumption to 1 / N, where N is the number of GPUs.

Here, we call the architecture dense to clarify that it does not use a sparse architecture variant like a Mixture-of-Experts (MoE).

The technique used to compute SNR of each benchmark is explained here.

Authors in [1] use Ward’s variance-minimization, which iteratively refines task clusters to minimize the variance of evaluation scores between benchmarks in a cluster.

Bits-per-byte and perplexity are common information-theoretic metrics used to measure the performance of pretrained language models. Both of these metrics capture the predictive quality of the model’s next token distribution. These metrics are related in that they both measure the cross-entropy of the model’s next token distribution, but they are normalized differently.

Interestingly, this procedure naturally learns to up-weight STEM data, as well as favor python data within the StackEdu code mixture.

Given a target dataset, we begin with 5B tokens of web data and combine this with 5B tokens of the target dataset. We then anneal (i.e., train a model over the data as the learning rate is decayed to zero) over the combined 10B tokens and evaluate. As a baseline, we simply anneal over 10B tokens of web-only data.

The context window refers to the total number of tokens that an LLM can process at a time. For example, a context window of 4K tokens means that the total length of the model’s input and output cannot exceed 4K, otherwise the model may perform poorly.

The long context extension phase of Olmo 3 trains on 100B tokens for the 32B model and 50B tokens for the 7B model. The exact same proportions of data are used for both the 50B and 100B mixtures.

In particular, these noun phrases are identified in [1] using TF-IDF.

Math uses Open Thoughts 3 and Synthetic 2. Coding uses AceCoder, the code portion of the Llama Nemotron post-training dataset, and Open Code Reasoning. Chat uses WildChat (with a focus on the Tulu-3 [10] subset) and Open Assistant. Precise instruction following uses the same prompts from Tulu-3 with some additional verifiable constraints. There are also a few other datasets included in the SFT mix like TableGPT for transforming data and Aya for multilinguality.

By “properties” of the target distribution, we usually mean some function of the target distribution (e.g., an expectation).

An F1 score can be computed between two sequences of text by tokenizing each sequence and computing precision and recall based upon whether certain tokens appear in each sequence.

Specifically, the discussion section of DeepSeekMath mentions that RL training primarily improves Maj@N capabilities, rather than Pass@N. In other words, the LLM may not learn to solve net new problems, but it becomes much more reliable at solving problems that were already within its scope.

Pass@N is an evaluation technique in which we generate N completions from an LLM and count the model as correct if at least one of these N completions is correct. Larger values of N give the LLM more “shots” at correctly solving an answer.

This exact threshold (also called a length budget) is determined empirically via vibe tests in which researchers tested different values, examined performance metrics, and manually inspected the model’s resulting verbosity.

Group Relative Policy Optimization (GRPO)

Cameron R. Wolfe, Ph.D. — Mon, 24 Nov 2025 10:33:31 GMT

(from [1, 19])

Reinforcement learning (RL) has always played a pivotal role in research on large language models (LLMs), beginning with its use for aligning LLMs to human preferences. More recently, researchers have heavily focused on using RL training to improve LLM reasoning performance. This line of research has led to a rapid expansion of LLM capabilities over the last few years. The objective of RL training (e.g., alignment or reasoning) has changed over time, along with the RL optimizers that are used to achieve these goals. Most early work on RL for LLMs used Proximal Policy Optimization (PPO) as the default RL optimizer, but recent reasoning research relies upon Group Relative Policy Optimization (GRPO).

This overview will provide a deep dive into GRPO, where it comes from, how it works, and the role it has played in creating better large reasoning models (LRMs). As we will learn, RL training—even with GRPO—is a complex process that presents a seemingly endless frontier of open research questions. However, GRPO is a refreshingly simple—and effective—algorithm that is more efficient and approachable than its predecessors. These characteristics allow GRPO to democratize RL research and, in turn, accelerate progress on both:

Building a better collective understanding of RL for LLMs.
Training more powerful reasoning models.

Basics of RL. We will not discuss the basics of RL (e.g., terminology, problem setup, or policy gradients) in this overview. To gain a more comprehensive grasp of the foundational ideas in RL that are useful for understanding GRPO, please see the following excerpts from prior articles:

RL Problem Setup & Terminology [link]
Different RL Formulations for LLMs [link]
Policy Gradient Basics [link]

Reinforcement Learning (RL) for LLMs

(from [19])

To begin our discussion, we will cover some preliminary details on reasoning models and reinforcement learning (RL). Specifically, we will first discuss the two most common RL frameworks used for training LLMs (depicted above):

Reinforcement Learning from Human Feedback (RLHF) trains the LLM using RL with rewards derived from a reward model trained on human preferences.
Reinforcement Learning with Verifiable Rewards (RLVR) trains the LLM using RL with rewards derived from rule-based or deterministic verifiers.

After this discussion, we will provide further details on large reasoning models (LRMs), which are LLMs that have been extensively trained (via RLVR) to hone their complex reasoning capabilities. This discussion is relevant to GRPO, as it is currently the most common RL optimizer—at least for open LLMs—to use for training LRMs with RLVR. In fact, GRPO gained popularity primarily through its use in training open reasoning models like DeepSeek-R1 [8]!

General RL setup. The main difference between RLHF and RLVR lies in how we assign rewards—RLHF uses a learned reward model, while RLVR uses verifiable (or rules-based) rewards. Despite this difference, these are both online RL algorithms that follow a similar training framework; see below.

General framework for online RL

We first sample a batch of prompts and generate a completion—or multiple completions—for each prompt in the batch using our current policy. A reward is computed for each completion, which can then be used to derive a policy update using our RL optimizer of choice—this is where GRPO comes in! GRPO is a generic RL optimizer that is used to compute the policy update (i.e., the update to our LLM’s weights) during RL training. GRPO is usually used for RLVR, while PPO is usually used for RLHF. However, RL optimizers are generic, and technically any RL optimizer can be used to derive the policy update in these frameworks.

Reinforcement Learning from Human Feedback (RLHF)

(from [16])

The first form of RL training to be popularized in the LLM domain was Reinforcement Learning from Human Feedback (RLHF). Early post-ChatGPT LLMs were almost always post-trained using the following three-step alignment procedure (depicted above), as proposed by InstructGPT [16]:

Supervised finetuning (SFT)—a.k.a. instruction finetuning (IFT)—trains the model using next-token prediction over examples of good completions.
A reward model is trained over a human preference dataset.
Reinforcement learning (RL)—usually with PPO—is used to finetune the LLM with the reward model as the reward signal.

The second and third steps of this procedure are collectively referred to as RLHF. This framework actually involves two training procedures: a supervised learning phase for the reward model and an RL training phase for the LLM.

(from [17])

Preference data is the foundation of RLHF. Each element of a preference dataset consists of a prompt, two completions to that prompt, and a preference label—assigned either by human or an AI or LLM judge—indicating which completion is preferred to the other. Specifying an explicit reward for an LLM is very difficult—how do we reliably determine whether a completion is “good” or not when the model has so many diverse capabilities? Instead of answering this question directly, we can instead collect preference data, which captures preferred model behavior via examples of ranked model responses for a particular prompt. A typical interface for collecting preference annotations can be seen in the figure below.

(from [18])

Choosing the better model response is relatively intuitive, though it does require detailed guidelines on alignment criteria to ensure data quality. Preference data is used extensively in LLM post-training because:

We can use it to train our model to produce human-preferable responses.
We just have to select a preferred response (rather than define an explicit reward signal or manually write responses from scratch).

After collecting sufficient preference data, we have many examples of preferred model behavior that can be used to align our LLM to human (or AI-generated) preferences. We can directly train an LLM on this preference data using a direct alignment algorithm like Direct Preference Optimization (DPO), but we usually incorporate this data into RL by first using it to train a reward model.

Reward model architecture

Reward models. A reward model is a specialized LLM—usually a copy of the LLM we are training with an added regression head (depicted above)—that is finetuned to predict a human preference score given a prompt and candidate completion as input. Specifically, the reward model is finetuned on our preference data using a ranking loss function that is derived from the Bradley-Terry model; see below.

Reward model loss function

Put simply, this loss function teaches the reward model to assign a higher score to the preferred response in a preference pair relative to the rejected response. The reward model is trained over paired preference data, but we see above that the model outputs an individual preference score for each completion in the pair. More details on reward models can be found in the overview below.

Input and output structure of a reward model

PPO & RLHF. Once the reward model has been trained over the preference data using this loss, the model learns how to assign a preference score to each model completion; see above. We can directly use this reward model as a reward signal for RL training. For RLHF, we usually use Proximal Policy Optimization (PPO) [12], which we will cover later in more detail, as the underlying RL optimizer.

“Reward models broadly have been used extensively in reinforcement learning research as a proxy for environment rewards.” - RLHF book

Our LLM is indirectly trained on human feedback via the reward model. We begin with a preference dataset, which captures human preference via concrete examples of ranked model outputs. This data is used to train a reward model that can assign accurate preference scores to arbitrary outputs from the LLM. During training with RL, we generate new outputs—or on-policy samples—from our LLM and score them with the reward model. These scores serve as the reward signal, and our RL optimizer updates the model’s weights to maximize rewards. Since the reward here is the output of our reward model, we are maximizing preference scores. In this way, the RL training process guides the LLM to produce outputs that align with human preferences, as estimated by the reward model.

Schematic depiction of RLHF (from [19])

Impact of RLHF. The ability to align an LLM to human preferences is a hugely impactful technology that catalyzed the popular use of LLMs. If we think about the differences between well-known LLMs like ChatGPT and their less widely-recognized predecessors, one of the key enhancements made to ChatGPT was the use of more sophisticated post-training. Specifically, ChatGPT was extensively aligned via SFT and RLHF, which significantly improved the model’s helpfulness. In this way, RL research—and RLHF in particular—played a pivotal role in creating the impressive and capable LLMs that we have today.

Reinforcement Learning from Verifiable Rewards (RLVR)

The reward in RLHF is derived from a reward model. This reward model requires its own training pipeline and validation, which adds costs and complexity to the RL training process. Our policy could also suffer from reward hacking, even when using a high-quality reward model. The policy explores the space of possible completions during RL to maximize rewards. If we continue running RL for long enough, however, the model may learn to maximize rewards via an exploit or hack in our reward model, rather than by generating better completions.

“Reinforcement Learning with Verifiable Rewards (RLVR) can be seen as a simplified form of… RL with execution feedback, in which we simply use answer matching or constraint verification as a binary signal to train the model.” - from [13]

Put simply, reward models—despite their incredible impact through RLHF—have downsides. Reinforcement Learning from Verifiable Rewards (RLVR) chooses to avoid reward models, instead deriving rewards from manually verifiable and deterministic sources (e.g., rules or heuristics). Using verifiable rewards instead of neural reward models reduces the risk of reward hacking and makes extensive, large-scale RL training more feasible by making rewards harder to game.

Schematic depiction of RLVR (from [19])

Verifiable domains and rewards. To train an LLM with RLVR, we must select a domain that is verifiable in nature; e.g., math or coding. In other words, we need to create a dataset that has either i) a known ground truth answer or ii) some rule-based technique that can be used to verify the correctness of an answer for each prompt in our dataset. For coding, we can create a sandbox for running LLM-generated code and use test cases to assess correctness. Similarly, we can evaluate math problems by performing basic string matching between the answer predicted by the LLM and a ground-truth answer for a problem; see below.

Verifying a problem with exact string matching

Usually, we must instruct the LLM to format its output such that the final answer can be easily parsed. Even then, however, string matching is not always sufficient for evaluating correctness. In many cases, we can benefit from crafting validation logic that is more robust (e.g., asking an LLM to tell us if two answers are the same [20]) and that captures variations in format for similar or identical outputs.

“Math verification is determined by an LLM judge given the ground truth solution and DeepSeek-R1 solution attempt. We found that using an LLM judge instead of a stricter parsing engine (Math-Verify) for verification during data generation results in a higher yield and leads to higher performing downstream models.” - from [20]

Applications of RLVR. Beyond substituting a reward model with verifiable rewards, the RL component of RLVR is unchanged. However, RLHF and RLVR differ in their purpose and application:

RLHF is usually implemented with PPO as the underlying RL optimizer, while GRPO is the most common RL optimizer for RLVR.
RLHF focuses on LLM alignment with preference feedback, while RLVR is used to improve the complex reasoning capabilities of an LLM.

Most recent research on LLMs and RL is heavily focused on creating LLMs with better reasoning capabilities, known as large reasoning models (LRMs). The training process for LRMs is centered around performing RLVR on domains like math and coding. In these training setups, GRPO is the most commonly used RL optimizer—at least for open LLMs. As we will see in this overview, several notable results have already been achieved from using RLVR (with GRPO) to train LRMs. However, this area of research is still incredibly active and dynamic. Examples of popular topics being explored in this area include:

Large Reasoning Models (LRMs)

As mentioned before, RLVR and GRPO can be used to improve the reasoning capabilities of LLMs on verifiable tasks, and research on this topic has led to the creation of large reasoning models (LRMs). The key distinction between an LRM and a standard LLM is the ability to dynamically “think” about a prompt prior to providing a final output. By increasing the length of the thinking process, these LRMs can use inference-time scaling—or simply spend more compute on generating a completion—to improve their performance.

“We’ve developed a new series of AI models designed to spend more time thinking before they respond.” - from [4]

One of the first such models to be released was OpenAI’s o1-preview, which was predated by a long series of rumors about OpenAI developing a new series of LLMs with complex reasoning capabilities. This model has since been followed by a massive number of new closed (e.g., o3 / o4 or Gemini 3) and open (Qwen-3, DeepSeek-R1, and Olmo-3) LRMs as the research community continues to iterate on these ideas. Interestingly, the popularization of LRMs has also led to a proliferation of open models—mostly proposed after DeepSeek-R1 [8], which we will discuss later on. Recent open LRM releases like Kimi-K2 [14] have even started to match or exceed the performance of closed models; see below.

(from [14])

How do LRMs work? LRMs and LLMs are identical architecturally1. They are both based upon decoder-only transformers, potentially with a Mixture-of-Experts (MoE) architecture. Their main difference lies in how they generate output. At a high level, LRMs operate by allowing the model to “think” prior to producing a final output. This thinking process occurs in the form of a long, free-text chain-of-thought (CoT)—also called a rationale or reasoning trajectory—that is generated by the LLM. Most closed LRMs hide this reasoning trajectory from the end-user for safety purposes2. The user sees only the model’s final output and (optionally) a truncated summary of the reasoning process.

(from [9])

For open LRMs, we can observe the model’s reasoning process and final output. Concretely, LRMs use special tokens to separate their reasoning process from their actual output. The reasoning trajectory is generated first and is wrapped between tokens. The model ends its reasoning process with a token, then proceeds to generate a final response; see below.

Concrete example of LRM output in Qwen-3 prompt format

Reasoning trajectories. If we look at some examples of reasoning trajectories from open or closed LRMs, we will notice that these models exhibit sophisticated reasoning behaviors in their long CoT:

Thinking through each part of a complex problem.
Decomposing complex problems into smaller, solvable parts.
Critiquing solutions and finding errors.
Exploring many alternative solutions.

In many ways, the model is performing a complex, text-based search process to find a viable solution to a prompt. Such behavior goes beyond any previously-observed behavior with standard LLMs and chain of thought prompting. With this in mind, we might begin to wonder: How does the model learn how to do this?

LRM training. LRMs also differ from standard LLMs in their training methodology. Though exact post-training details may vary significantly between models, both LLMs and LRMs undergo similar pretraining and alignment phases that consist of supervised finetuning (SFT) and RLHF.

However, LRMs extend this standard training process by performing large-scale RLVR on verifiable domains like math and code. Because verifiable reward signals are less prone to reward hacking, we can perform larger-scale RL training (i.e., by running the training process longer) with less risk of training collapse. Several works [8, 9] have shown that LRMs obey a predictable scaling law with respect to the amount of compute used during RL training, meaning that we can achieve better performance by increasing the number of RL training steps.

“We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process.” - from [8]

The complex reasoning behaviors of an LRM are not directly encoded into the model in any way. Rather, this behavior naturally emerges from large-scale RL training. The LRM undergoes an RL-powered self-evolution as it attempts to solve problems and is rewarded for finding correct solutions. From this process, the model learns to properly leverage its reasoning trajectory. We will continue discussing the details of RL training for LRMs throughout the remainder of this post, but the key idea here is to:

Create the correct incentives for RL training—usually a deterministic or rule-based reward signal that is at low risk for reward hacking.
Run large-scale RL training with these reliable reward signals.
Allow sophisticated model behavior to naturally emerge.

Powerful LRMs are a product of large-scale RL with the correct incentives, but there are many practical details involved in properly incentivizing and scaling the RL training process—this is still a very active area of research [15].

Are LRMs a silver bullet? Given the impressive performance of LRMs in complex reasoning domains, we might naively believe that LRMs will outperform standard LLMs at all tasks. However, the story is not this simple—LRMs are not always the best tool to use. Because the training process for LRMs is focused on verifiable domains like math and code, their performance may be biased towards these domains—and away from non-verifiable domains like creative writing.

“Reasoning models are designed to be good at complex tasks such as solving puzzles, advanced math problems, and challenging coding tasks. However, they are not necessary for simpler tasks like summarization, translation, or knowledge-based question answering. In fact, using reasoning models for everything can be inefficient and expensive. For instance, reasoning models are typically more expensive to use, more verbose, and sometimes more prone to errors due to overthinking.” - Sebastian Raschka

LRMs may also have deficiencies in alignment (e.g., instruction following or reading-friendly formatting) relative to standard LLMs. However, most of these issues are being solved as we continue to study the interplay between RLHF and RLVR. We should use LRMs for the domains in which they excel but be sure to test their performance in non-verifiable domains. Using a standard LLM may be sufficient—or better—and is usually more efficient in terms of inference-time compute.

GRPO from Idea to Implementation

Now that we understand how RL is used to train LLMs (and LRMs), we will take a deeper look at common RL optimizers used to derive policy updates for RLHF and RLVR. To begin, we will learn about Proximal Policy Optimization (PPO) [12] before moving on to the main topic of this overview—Group Relative Policy Optimization (GRPO) [1]. GRPO is inspired by PPO and shares some of its core ideas. However, GRPO also goes beyond PPO by making several changes to simplify the algorithm while maintaining effectiveness for LLM training.

Proximal Policy Optimization (PPO) [12]

GRPO is heavily based upon the Proximal Policy Optimization (PPO) algorithm [12]. PPO was used in seminal work on RLHF and, as a result, became the default RL optimizer in the LLM domain for some time. Only recently with the advent of LRMs have alternative algorithms like GRPO started to become popular.

(from [12])

The structure of PPO is outlined above. As we can see, each training iteration of PPO performs the following sequence of steps:

Sample a diverse batch of prompts.
Generate a completion from the policy for each prompt.
Compute advantage estimates for each completion.
Perform several policy updates over this sampled data.

Surrogate objective. During PPO, we formulate a surrogate objective3 that is optimized with respect to the parameters of our policy. The PPO surrogate objective is based upon the policy ratio between the current policy and an old model (i.e., the policy as it existed before the first update in a training step). The policy ratio—also called the importance ratio—stabilizes the training process by comparing the new policy’s token probabilities to the old policy and applying a weight (or importance) to training that helps to avoid drastic changes; see below.

Policy or importance ratio

The PPO surrogate objective

In PPO, the surrogate objective is simply the minimum of clipped and unclipped objectives, which makes it a pessimistic (lower bound) estimate for the unclipped objective. The behavior of the surrogate loss’ clipping mechanism changes depending on the sign of the advantage. The possible cases are shown below.

(from [12])

As we can see, taking the minimum of clipped and unclipped terms in the surrogate objective causes clipping to be applied in only one direction. The surrogate objective can be arbitrarily decreased by moving the policy ratio away from one, but clipping prevents the objective from being increased beyond a certain point by limiting the policy ratio. In this way, the clipping mechanism of PPO disincentivizes large policy ratios and, in turn, maintains a trust region by preventing large policy updates that could potentially damage our policy.

“We only ignore the change in probability ratio when it would make the objective improve, and we include it when it makes the objective worse.” - from [12]

KL divergence. When training LLMs with PPO, we usually incorporate the KL divergence between the current policy and a reference policy—like the SFT model—into training. The KL divergence serves as a penalty that encourages similarity between the current and reference policies. We compute the KL divergence by comparing token distributions from the two LLMs for each token in a sequence. The easiest—and most common—way to approximate KL divergence [7] is via the difference in log probabilities between the policy and reference; see here.

After the KL divergence has been computed, there are two primary ways that it can be incorporated into the RL training process:

By directly subtracting the KL divergence from the reward.
By adding the KL divergence to the loss function as a penalty term.

PPO adopts the former option by subtracting the KL divergence directly from the reward signal used in RL training as shown in the equation below.

Adding KL to the reward in PPO

Advantage estimation. The advantage function, a key part of PPO’s surrogate objective, is the difference between the action-value and value function: A(s, a) = Q(s, a) - V(s). The value function in PPO is estimated with a learned model called the value model or critic. This critic is usually a separate copy of our policy, or—for better parameter efficiency—an added value head that shares weights with the policy. The critic takes a completion as input and predicts expected cumulative reward on a per-token basis by using an architecture that is similar to that of a reward model (i.e., a transformer with a regression head)4; see below.

The value function is also on-policy, meaning it depends on the current parameters of our policy. Unlike reward models, which are fixed at the beginning of RL training, the critic is trained alongside the LLM in each policy update to ensure its predictions remain on-policy. This is known as an actor-critic setup. To handle this, we can add an extra mean-squared error (MSE) loss term—between the rewards predicted by the critic and actual rewards—to the surrogate loss for PPO.

The critic can be used to compute the advantage via Generalized Advantage Estimation (GAE) [13]. The details of GAE are beyond the scope of this post. We will only cover GAE at a high level, but a full explanation can be found here. GAE builds upon the concept of a temporal difference (TD) residual; see below.

The TD residual

The TD residual uses per-token value predictions from the critic to form a one-step estimate of the advantage. Put simply, the TD residual is analyzing how much the reward changes after predicting a single token relative to the expected reward. However, the TD residual only uses a small amount of actual reward information (i.e., the reward at step t) to estimate the advantage, which causes the estimate to become biased5. To solve this issue, we can generalize the single-step TD residual to form a series of N-step advantage estimators; see below.

N-step advantage estimators

Similarly to the single-step TD residual, advantage estimators with lower values of N have low variance but high bias. As we increase the value of N, however, we are incorporating more exact reward information into the advantage estimate, thus lowering the bias (and, in turn, increasing variance). GAE tries to find a balance between these two ends of the spectrum by i) using all values of N and ii) taking an average of these advantage estimates. This is accomplished with the mixing parameter λ for GAE, as shown below.

GAE formulation

The value of λ ∈ [0, 1] controls the bias variance tradeoff. We can toggle the value of λ in GAE as needed to stabilize the training process6. For example, if training is unstable, we can decrease λ to yield lower variance policy updates.

Complexity of PPO. As we might infer from the above discussion, PPO is not a simple algorithm—there are many more details to be learned. For a more complete overview of PPO, please see the article linked below. However, we need to briefly discuss the key limitations of PPO to serve as motivation for GRPO.

There are a total of four models included in PPO’s training process: two that are being trained (i.e., the policy and the critic) and two that are used for inference (i.e., the reference and reward model). The fact that the critic must be trained in tandem with the policy complicates the training process, increases compute costs, and consumes a lot of memory. Plus, there are many additional nuances and settings that must be carefully tuned to arrive at a working PPO implementation (e.g., GAE, value model setup, reward model setup, clipping, and more).

“During RL training, the value function is treated as a baseline in the calculation of the advantage for variance reduction. In the LLM context, usually only the last token is assigned a reward score by the reward model, which may complicate the training of a value function that is accurate at each token.” - from [1]

Can we simplify PPO? Much of the complexity of PPO—though not all!—stems from estimating the per-token value function with the critic. Recent work has questioned the need for this critic, arguing that critic-free RL algorithms like REINFORCE can be used instead of PPO to train LLMs with no performance degradation. This argument stems from a few key observations:

Avoiding high-variance policy updates—which is the key benefit of PPO and a limitation of simpler RL optimizers like REINFORCE—is less of a concern for LLMs because we are finetuning models that are extensively pretrained.
LLMs are mostly trained using outcome rewards, which makes estimating advantage on a per-token basis unnecessary. How can we learn an accurate per-token value estimate from outcome rewards only? Modeling the advantage and reward on a completion level should be sufficient for LLMs in this case.

GRPO provides further empirical support for these claims in the LLM domain. Specifically, GRPO forgoes the critic and estimates advantage by averaging rewards for multiple completions to the same prompt. Each token in GRPO receives the same advantage estimate, rather than attempting to assign credit on a per-token basis from a sequence-level (outcome) reward signal.

Group Relative Policy Optimization (GRPO)

(from [1])

Group Relative Policy Optimization (GRPO) [1] builds upon PPO by proposing a simpler technique for estimating the advantage. In particular, GRPO estimates the advantage by sampling multiple completions—or a “group” of completions—for each prompt and using the rewards of these completions to form a baseline. This group-derived baseline replaces the value function, which allows GRPO to forgo training a critic. Avoiding the critic drastically reduces GRPO’s memory consumption and training complexity compared to PPO.

“We introduce the Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO). GRPO foregoes the critic model, instead estimating the baseline from group scores, significantly reducing training resources.” - from [1]

Advantage estimation in GRPO. Instead of using a learned value model, GRPO estimates the advantage by sampling multiple completions for each prompt in the batch and using the formulation shown below to compute the advantage.

Advantage computation in GRPO

In GRPO, completions to the same prompt form a group, and we calculate the advantage relative to other rewards observed in the group—hence, the name “group relative” policy optimization! More specifically, the advantage for completion i is calculated by first subtracting the mean reward over the group from r_i, then dividing this difference by the standard deviation of rewards over the group. We are still assuming an MDP formulation in this discussion, but the formulation above assigns the same advantage to every token t in the sequence i.

“GRPO is often run with a far higher number of samples per prompt because the advantage is entirely about the relative value of a completion to its peers from that prompt.” - RLHF book

Surrogate loss. Despite estimating the advantage differently, GRPO uses a surrogate loss that is nearly identical to that of PPO. Both of these optimizers make use of the same clipping mechanism for the policy ratio; see below.

GRPO surrogate loss

This expression assumes an MDP formulation and has been modified to explicitly aggregate the loss over multiple completions within a group. In contrast, we previously formulated the loss for PPO as an expectation over completions.

One key difference between PPO and GRPO is the KL divergence term being subtracted as a penalty term from the surrogate loss rather than incorporated into the per-token reward. Additionally, GRPO does not always perform multiple policy updates per batch of data. If we only perform a single policy update per batch, we have π_θ = π_old, which simplifies the clipped objective to the expression shown below7. See here for more discussion on this topic.

Simplification of the clipping term with a single update

Extension to process rewards. Most implementations of GRPO use outcome rewards, as this is the most common setting for an LLM. However, we can extend GRPO to handle process rewards (e.g., after each reasoning step) by:

Normalizing rewards based on the mean and standard deviation of all process rewards observed in the group.
Computing the advantage of each token as the sum of normalized rewards for following steps in the reasoning trajectory.

When using outcome rewards, each token is assigned the same advantage by GRPO, but this approach changes when using process rewards. The advantage is estimated for each token based on rewards observed in following steps of the trajectory, which changes depending on the position of a token. Additionally, we must now consider all rewards—including multiple rewards in each trajectory—when computing the mean and standard deviation metrics for GRPO.

Memory consumption. In PPO, we are training two models—the policy and the critic—in tandem. Additionally, we are running real-time inference for both the reward model and the reference policy, yielding a total of four models that must be managed. The need to train two models drastically increases the memory footprint of PPO. Assuming we use half precision (bf16 or fp16), we can host an LLM using ~2GB of memory for every 1B model parameters; e.g., inference with Qwen-3-32B should require ~60-70GB of memory. Notably, this calculation only accounts for loading the model’s weights into GPU memory, and memory usage can vary quite a bit depending on the maximum context length being used8.

Illustration of memory consumption during training and inference

In contrast, training a model in half precision usually requires ~16GB of memory per 1B model parameters, which varies depending on the details of the training setup9. Similarly to inference, we load the model weights into GPU memory for training, but we must also store other training-related data (e.g., optimizer states and gradients). We also need enough GPU memory to store model activations during training, so memory consumption still increases with context length.

“As LMs are scaled up, computing gradients for backpropagation requires a prohibitive amount of memory—in our test, up to 12× the memory required for inference—because it needs to cache activations during the forward pass, gradients during the backward pass, and, in the case of Adam, store gradient history.” - source

With this in mind, the fact that GRPO does not use a critic not only saves on compute costs relative to PPO, but it drastically reduces memory consumption—we are now training a single model instead of two models. Eliminating a trainable model has a much larger impact on memory consumption compared to removing a model that is only used for inference (e.g., the reward model).

GRPO & reward models. GRPO became popular primarily in the context of LRM training with RLVR. For this reason, GRPO is mostly used in verifiable reward settings without a neural reward model. A common misconception about GRPO is that it eliminates the need for a reward model, but GRPO can be used with or without a reward model. In fact, the original GRPO paper used a reward model instead of verifiable rewards [1]! Removing the reward model is a benefit of verifiable rewards, not an intrinsic benefit of GRPO itself—the primary advantage of GRPO is the elimination of the critic.

Implementing GRPO

To make this discussion more concrete, let’s implement the GRPO loss function in PyTorch pseudocode. This implementation is adapted from the RLHF book 10, which has a fantastic explanation of GRPO and other policy gradient algorithms.

In the code below, B is our batch size, G is the group size, and L is the context length or number of tokens in each completion. We present two options for approximating KL divergence, including a simple KL estimate (kl_div) that is commonly used for LLMs and a slightly more complex variant (kl_div_alt) that matches the approximation used in the original GRPO paper [1]. More details on why this particular KL divergence estimate is used will be provided later on.

import torch
import torch.nn.functional as F

# constants
kl_beta = 0.1
eps = 0.2

# sample G completions for B prompts
# compute outcome reward for each completion
with torch.no_grad():
    completions = LLM.generate(prompts)  # (B*G, L)
    rewards = RM(completions)  # (B*G)

# create a padding mask from lengths of completions in batch
completion_mask = <... mask out padding tokens ...>

# get policy logprobs for each action
llm_out = LLM(completions)
per_token_logps = F.log_softmax(llm_out, dim=-1)  # (B*G, L)

# get reference logprobs for each action
ref_out = REF(completions)
ref_per_token_logps = F.log_softmax(ref_out, dim=-1)  # (B*G, L)

# compute KL divergence between policy and reference policy
kl_div = per_token_logps - ref_per_token_logps

# alternative KL divergence used by DeepSeekMath [1]
kl_div_alt = (
    torch.exp(ref_per_token_logps - per_token_logps)
    - (ref_per_token_logps - per_token_logps)
    - 1
)

# compute mean and std of grouped rewards
reward_mean = rewards.view(-1, G).mean(dim=1)  # (B,)
reward_std = rewards.view(-1, G).std(dim=1)  # (B,)

# compute advantage for GRPO
advantage = (rewards.view(-1, G) - reward_mean)
advantage /= (reward_std + 1e-8)  # (B, G)
advantage = advantage.view(-1, 1)  # (B*G, 1)

# compute the policy ratio
policy_ratio = torch.exp(
    per_token_logps - old_per_token_logps,
)  # (B*G, L)
clip_policy_ratio = torch.clamp(
    policy_ratio,
    min=1.0 - eps,
    max=1.0 + eps,
)

# compute clipped loss
loss = torch.min(
    advantage * policy_ratio,
    advantage * clip_policy_ratio,
)  # (B*G, L)

# kl divergence added as penalty term to loss
loss = -loss + kl_beta * kl_div

# aggregate the loss across tokens (many options exist here)
loss = ((loss * completion_mask).sum(axis=-1) /
        completion_mask.sum(axis=-1)).mean()

# perform policy gradient update
optimizer.zero_grad()
loss.backward()
optimizer.step()

The implementation above relies upon old_per_token_logps to compute the policy ratio. The old policy refers to the initial policy parameters prior to any policy updates being performed for a batch of data. Before the first update for a batch, we must store these log probabilities so that they can be used for several subsequent policy updates over the same batch. The code above only outlines a single policy update, but if this were our first update over a batch of data we could simply set old_per_token_logps = per_token_logps.detach(). Then, we could re-run this code—excluding the part that samples new completions and computes their rewards—to perform several policy updates over the batch.

Key Publications with GRPO

We now understand the key ideas underlying GRPO, which are relatively simple compared to optimizers like PPO. Next, we will build upon this understanding by outlining a few key papers that demonstrate the practical application of GRPO. Specifically, we will review DeepSeekMath [1] and DeepSeek-R1 [8]. The former paper proposed the GRPO algorithm in the context of training specialized LLMs for solving math problems. This work was later extended by DeepSeek-R1, which used GRPO to train a state-of-the-art open LRM using RLVR. As we will see, this was the first open model to nearly match the performance of closed LRMs like OpenAI’s o1 [9], which led to a subsequent explosion of open LRM releases.

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models [1]

GRPO was proposed with the release of DeepSeekMath [1], a small and open language model for mathematical reasoning. DeepSeekMath uses a combination of i) continued pretraining on a high-quality, math-focused corpus and ii) further training with RL to surpass the performance of similar open-source LLMs—and nearly match the performance of top proprietary models like GPT-4; see below.

(from [1])

Despite its far-reaching impact, GRPO was first proposed in [1] specifically for training domain-specific LLMs. Authors cite simplicity and memory efficiency as key benefits of GRPO relative to PPO. Additionally, we see in [1] that further RL finetuning via GRPO boosts the mathematical reasoning capabilities of even strong models that have already undergone extensive instruction tuning.

“Our research provides compelling evidence that the publicly accessible Common Crawl data contains valuable information for mathematical purposes…. We successfully construct the DeepSeekMath Corpus, a high-quality dataset of 120B tokens from web pages filtered for mathematical content.” - from [1]

The DeepSeekMath Corpus is a high-quality corpus of 120B math-focused tokens—mined from CommonCrawl—used for continued pretraining of DeepSeekMath models. The impressive performance of DeepSeekMath is partially attributed to the “meticulously engineered data selection pipeline” that produces this data. The high-level structure of this data selection pipeline is depicted below.

(from [1])

The DeepSeekMath corpus is created iteratively. During the first iteration of data selection, we train a fastText model to identify high-quality math content by using OpenWebMath [2] as a seed corpus. In other words, the OpenWebMath data is used as positive examples of high-quality math content, and we sample 500,000 data points from CommonCrawl to serve as negative examples (i.e., data that are not math-focused). The fastText model is then trained over this data to classify high-quality math content. After deduplicating the web pages in CommonCrawl, we have ~40B web pages that are then ranked by the output of the fastText model—the 40B top-scoring tokens are retained for further refinement.

We further refine this fastText classifier by grouping CommonCrawl into domains with the same base URL. A domain is considered to be “math-related” if more than 10% of the pages in this domain have been identified as math-related by the fastText model. Human annotators manually annotate the URLs in these math-related domains, allowing more math-focused examples to be identified for retraining the fastText model. This process is repeated three times, yielding a total of 120B math-focused tokens. Data collection ends after the fourth iteration because authors found that 98% of the identified data was already collected.

(from [1])

Is the data good? To validate the DeepSeekMath corpus’ quality, pretraining experiments are performed over several different datasets. Models trained on the DeepSeekMath corpus clearly lead on all downstream benchmarks. As shown above, the performance of these models has a steeper learning curve, indicating that the average quality of the DeepSeekMath corpus is higher relative to other math-focused corpora. Additionally, this new corpus is multilingual—primarily English and Chinese—and nearly an order of magnitude larger than alternatives.

DeepSeekMath-Base is the initial base model trained in [1] for mathematical reasoning. It is initialized with the weights of a code model—DeepSeek-Coder-7B-Base-v1.5 in particular—and undergoes continued pretraining on 500B tokens from the DeepSeekMath corpus (and other sources like arXiv papers, Github code, and general language data). DeepSeekMath-7B-Base outperforms other open-source base models on mathematical reasoning—both with and without tool use—and formal theorem proving tasks. Going further, we see in [1] that DeepSeekMath-7B-Base also retains key capabilities in other domains. For example, its performance on coding and general language / reasoning tasks is still strong.

“DeepSeekMath-Base 7B exhibits significant enhancements in performance on MMLU and BBH… illustrating the positive impact of math training on language understanding and reasoning… by including code tokens for continual training, DeepSeekMath-Base 7B effectively maintains the performance of DeepSeek-Coder-Base-v1.5 on the two coding benchmarks.” - from [1]

(from [3, 4, 5])

Instruction tuning. After continued pretraining, DeepSeekMath-Base undergoes an instruction tuning phase in which the model is trained with supervised finetuning (SFT) over a curated dataset for mathematical reasoning. Authors collect a set of math problems in both English and Chinese that span diverse fields and levels of complexity. Solutions to these problems are created using three different formats (depicted above):

Chain of Thought [3]: prompts the model to output intermediate reasoning steps prior to its final answer.
Program of Thoughts [4]: separates reasoning from computation by prompting the model to output its reasoning steps as a structured program that is then solved by an external code interpreter.
Tool-Integrated Reasoning [5]: teaches the model to perform complex mathematical reasoning via a trajectory of interleaved natural language reasoning and tool usage (e.g., computation libraries or symbolic solvers).

The final instruction tuning dataset contains a total of 776K examples and is used to train DeepSeekMath-7B-Instruct, starting from DeepSeekMath-7B-Base. As shown below, the instruction tuned model outperforms all other open-source models—even those that are much larger—on chain of thought and tool-integrated reasoning tasks. The model can perform relatively well with or without tools. DeepSeekMath-7B-Instruct also rivals the performance of proprietary models (e.g., Gemini Pro) in some cases but tends to lag behind top-performing models (e.g., Gemini Ultra and GPT-4), especially in the tool-integrated domain.

(from [1])

RL training with GRPO. The above table also presents the performance of DeepSeekMath-RL, which undergoes one final RL training phase using GRPO as the underlying optimizer. In fact, GRPO was initially proposed in [1], where authors cite the practicality of GRPO—specifically its memory efficiency, compute efficiency, and simplicity relative to PPO—as key design criteria. Although GRPO is usually used in tandem with verifiable rewards, authors in [1] score completions using a reward model. Additionally, an outcome reward setting is used, meaning that rewards are assigned at the end of a full completion.

“The group relative way that GRPO… calculates the advantages aligns well with the comparative nature of rewards models, as reward models are typically trained on datasets of comparisons between outputs on the same question.” - from [1]

More GRPO details. DeepSeekMath-7B-Instruct is further trained using GRPO over a subset of data from the instruction tuning set—some subsets of this data are purposely left out to test generalization capabilities. During training, the objective is regularized via an added KL divergence penalty between the current policy and the SFT model (i.e., DeepSeekMath-7B-Instruct). Interestingly, authors in [1] adopt a modified estimator of the KL divergence, as shown below.

Different techniques for approximating the KL divergence

Both of these expressions are valid estimators for the KL divergence; see [7] for details. The estimator for KL divergence that is typically used when training LLMs (top of the figure above) is unbiased but has high-variance. In fact, this estimator can oftentimes be negative in value, whereas the KL divergence is a non-negative metric. In contrast, the estimator used in [1] (bottom of the figure above) is both unbiased and has lower variance—it is guaranteed to be positive—which makes it a desirable estimator for the KL divergence. Due to its use in DeepSeekMath, this estimator has also been adopted in public implementations of GRPO (e.g., this estimator is used in the TRL GRPO trainer).

Training DeepSeekMath-7B-Instruct with GRPO yields the DeepSeekMath-7B-RL model. During GRPO training, we only perform a single policy update for each batch of data. On the other hand, it is common in PPO to perform 2-4 policy updates over the same batch of data [6]. Additionally, GRPO training uses batch sizes that are quite large—a total batch size of 1,024 with 16 prompts and a group size of 64 completions. Large batch sizes are characteristic of GRPO and tend to be a practical necessity for the training process to be stable. As mentioned previously, many samples per prompt are needed because we estimate the advantage purely based on other rewards that are observed within a group.

Impact of RL. After further RL training, DeepSeekMath-7B-RL is found to outperform all open-source models and the majority of proprietary models. Interestingly, the RL-trained model also outperforms DeepSeekMath-Instruct across all benchmarks, despite the constrained scope of its training data—only a small subset of the instruction tuning data (i.e., 144K of 776K total examples) is used during RL. This finding suggests that RL training generalizes well and tends to enhance both in-domain and out-of-domain performance.

“Does code training improve reasoning abilities? We believe it does, at least for mathematical reasoning.” - from [1]

Code, math and beyond. One interesting aspect of the analysis in [1] is the focus upon understanding the interplay between coding and math. As shown in the table below, two training strategies are tested:

A two-stage pipeline that first trains on either code data or general data, then on math data.
A one-stage pipeline that (optionally) mixes code data into the math dataset.

In the two-stage pipeline, we see that training the model on coding data—as opposed to general data—prior to training on math data benefits the model’s downstream performance on math benchmarks; see below. This insight motivates initializing DeepSeekMath with the weights of a coding model.

(from [1])

In the one-stage pipeline, the impact of including code data is mixed. Including code in the data mixture helps to avoid catastrophic forgetting and retain coding abilities. However, this data mixture actually degrades performance on certain math benchmarks—particularly those that do not permit tool use—compared to just training on math data. However, this negative result may be due to issues in the data mixture. Namely, the one-stage pipeline uses 150B math tokens and 400B code tokens, which can cause coding capabilities to be prioritized over math.

“We observe the math training also improves model capability on MMLU and BBH benchmarks, indicating it does not only enhance the model’s mathematical abilities but also amplifies general reasoning capabilities.” - from [1]

Beyond studying the interplay between code and math, authors in [1] note that math-focused training tends to improve general model capabilities as well. For example, we see that DeepSeekMath models also have improved performance on general benchmarks like MMLU and BBH, as explained in the quote above.

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [8]

(from [8])

Although GRPO was proposed in [1], the algorithm was more widely popularized by its use in training DeepSeek-R1 [8]. During the early days of LRMs, nearly all high-quality reasoning models—such as OpenAI’s o-series models [9]—were closed-source11. For this reason, there was a lot of speculation outside of top labs about how these models actually worked. DeepSeek-R1 [8] was the first open LRM to reach o1-level performance in a transparent way. As detailed in the report, this model is finetuned from DeepSeek-V3 [10]—a 671 billion parameter Mixture-of-Experts (MoE) model—using RLVR. The RL training process uses GRPO and is primarily focused on verifiable domains like math and coding.

(from [9])

Prior to the popularization of LRMs, the scale of RL training performed with LLMs was (relatively) small—post-training was a fraction12 of total LLM training cost. However, a new kind of scaling law emerged with LRMs [8, 9]; see above. Model performance was shown to smoothly improve with respect to:

The amount of compute spent on RL training.
The amount of inference-time compute (e.g., by generating multiple outputs or a single output with a longer rationale).

For this reason, the ratio of LLM training cost spent on post-training—and RL in particular—has rapidly increased. In [8], we see exactly this, where DeepSeek-R1 undergoes extensive RL training with GRPO to improve its reasoning abilities.

DeepSeek-R1-Zero is the first model proposed in [8]. This model is initialized with the weights of DeepSeek-V3 [10] and post-trained with large-scale RL. Unlike a standard post-training procedure, no SFT training is used for training R1-Zero—the model is trained purely with GRPO. Interestingly, we see in [8] that R1-Zero naturally learns through RL to leverage its reasoning trajectory to solve complex problems. This was the first open research effort to show that reasoning abilities could be developed in an LLM without supervised training.

“DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors.” - from [1]

This model was created by the same authors of DeepSeekMath [1], so R1-Zero also uses GRPO for RL training. Authors cite familiar reasons for this choice:

Reducing the computational cost of RL training.
Memory savings from eliminating the critic model.

Verifiable rewards. Authors in [8] choose to avoid using neural reward models when training R1-Zero due to issues with reward hacking in larger-scale RL training runs. Put simply, if we train the LLM for long enough, it will eventually figure out an exploit for the reward model. To solve this issue, R1-Zero is trained using RLVR—using only verifiable reward signals makes the RL training process harder to game. More specifically, two types of rewards are used:

Accuracy reward: evaluates whether the model’s response is correct.
Format reward: enforces a desired format on the model’s output.

The accuracy reward is computed using task-specific heuristics. For math problems, the model can provide its answer in a specified format, allowing us to verify via basic string matching. Similarly, coding problems can be verified by executing the code produced by the LLM in a sandbox over predefined test cases. In contrast, the format reward simply rewards the model for formatting its output correctly. As shown below, the output format for R1-Zero just uses special tokens to separate the model’s reasoning process from its final output or answer.

(from [8])

Matching o1. Despite using no SFT, R1-Zero shows clear progress in its reasoning capabilities. The model’s performance on AIME 2024 is plotted below as RL training progresses. Here, we see that performance improves smoothly with the amount of RL training, eventually reaching parity with o1-preview.

(from [8])

A performance comparison between R1-Zero and o1 models from OpenAI is provided below. R1-Zero matches or exceeds the performance of o1-mini in most cases and performs comparably to o1-preview on several tasks. However, R1-Zero is clearly outperformed by o1 models on coding tasks. As we will see, however, this coding issue was fixed in future iterations of the model.

The beauty of RL. We might begin to wonder how R1-Zero develops such impressive reasoning capabilities during RL training. Luckily, the model’s learning process is observable—we can just monitor the reasoning traces produced by the model over time. By doing this, we see (as shown below) that R1-Zero learns to generate progressively longer chains of thought to improve its reasoning process throughout training. In other words, the model naturally learns that using more inference-time compute is useful for solving difficult reasoning problems.

(from [8])

Additionally, R1-Zero learns to do more than just generate a long chain of thought. Authors in [8] also observe several meaningful behaviors that emerge naturally from RL training. For example, the model develops an ability to reflect upon its own solutions by revisiting and evaluating prior components of its reasoning process. Similarly, the model begins to explicitly test out and explore alternative solutions or approaches while trying to solve a problem.

“The self-evolution of DeepSeek-R1-Zero is a fascinating demonstration of how RL can drive a model to improve its reasoning capabilities autonomously.” - from [8]

Notably, this behavior is not explicitly programmed into the model. Rather, RL allows the model to explore different strategies for arriving at a correct solution. To steer the training process, we reward the model for producing correct answers with proper formatting. From these rewards alone, R1-Zero uses an RL-based “self-evolution” process to naturally learn how to solve reasoning problems. We simply create the correct incentives that facilitate the model’s learning process.

DeepSeek-R1. Despite the impressive reasoning abilities of DeepSeek-R1-Zero, the fact that the model is trained purely with RL—and thus forgoes common best practices for alignment and post-training—causes it to have some bugs. For example, its readability is poor (e.g., no markdown formatting to make its answers easier to read or parse), and it incorrectly mixes languages together. To solve these issues, authors in [8] train the DeepSeek-R1 model, which uses a multi-stage training process to find a balance between standard LLM capabilities and reasoning.

“To prevent the early unstable cold start phase of RL training from the base model, for DeepSeek-R1 we construct and collect a small amount of long CoT data to fine-tune the model as the initial RL actor.” - from [1]

Phase One: SFT Cold Start. Prior to RL training, R1 is trained via SFT over a small dataset of long CoT examples, which is referred to as “cold start” data. This data is collected using a few different approaches:

Prompt a model (e.g., DeepSeek-V3) to produce long CoT data, either with few-shot examples or by instructing the model to generate detailed answers with accompanied reflection and verification.
Use the R1-Zero model to generate a large number of long CoT outputs, then ask human annotators to post-process and select the model’s best outputs.

Authors in [1] combine these approaches to collect “thousands of cold-start data” on which DeepSeek-V3 is finetuned directly via SFT. Because we are using long CoT data, this is a reasoning-oriented finetuning process. From this cold start data, the model learns a viable (initial) template for solving reasoning problems. The reasoning-oriented SFT data introduces a human prior into training—we have control over the style and pattern of data used in this phase. For example, authors in [1] structure the data to include summaries of each long CoT, which teaches the model to summarize its reasoning process prior to its final answer. We are simply setting a stronger seed from which to start the RL self-evolution process13.

Stage Two: Reasoning-Oriented RL. After SFT, we repeat the large-scale RL training process with GRPO (i.e., the same RL training setup used for R1-Zero) to enhance R1’s reasoning capabilities. The only change made for R1 is the addition of a language consistency reward—calculated as the portion of the model’s output written in the desired target language—into RLVR. This language consistency reward is shown in [1] to slightly deteriorate the model’s reasoning capabilities. However, language consistency helps to avoid the language mixing observed in R1-Zero, which makes the model’s output more fluent and readable.

Stage Three: Rejection sampling. After the convergence of reasoning-oriented RL, we use the resulting model to collect a large and diverse SFT dataset. Unlike the initial cold start SFT phase, however, we collect both reasoning-focused and general data, allowing the model to learn from a broader set of domains. The reasoning data for this stage is collected by:

Curating a diverse set of reasoning-based prompts.
Generating candidate trajectories using the model from after stage two.
Performing rejection sampling (i.e., filtering and selecting the top trajectories based on quality and correctness).

Interestingly, the SFT dataset from this stage includes a substantial ratio of non-reasoning data (e.g., writing or translation examples) that is sourced from the post-training dataset for DeepSeek-V3. To match the style of data used for training R1, this data is augmented by adding a CoT—generated by another LLM—to explain the outputs of complex prompts. Simpler prompts are left with no rationale.

“We reuse portions of the SFT dataset of DeepSeek-V3. For certain non-reasoning tasks, we call DeepSeek-V3 to generate a potential chain-of-thought before answering the question by prompting.” - from [1]

Unlike reasoning-oriented data, we cannot use rule-based verification for general-purpose data. Instead, authors in [8] use DeepSeek-V3 as a generative reward model or verifier for this data. After data verification and heuristic filtering (e.g., removing language mixing or long paragraphs), we have a set of 600,000 reasoning examples and 200,000 general-purpose examples, yielding a dataset of 800,000 examples over which we further finetune R1 using SFT.

Stage Four: RLVR & RLHF. The final training stage of R1 aligns the model with human preferences while continuing to hone its reasoning abilities. Similarly to the prior stage, we train the model over a combination of reasoning-based data and general-purpose data reused from the training pipeline of DeepSeek-V3. This stage uses RL with two styles of rewards:

Rules-based rewards (same as R1-Zero) for reasoning-based problems.
Neural reward models—trained over human preference pairs, just as in standard RLHF—for general-purpose data.

DeepSeek-R1 is aligned to be more helpful and harmless—two standard alignment criteria for LLMs—on general data. Each criterion is modeled using a separate reward model. For helpfulness, only the final answer (i.e., excluding the long CoT) from the model is passed into the reward model. On the other hand, harmlessness is predicted by passing the entire reasoning trajectory to the reward model. This combination of verifiable and preference-based (neural) rewards allows R1 to be aligned to human preferences while maintaining strong reasoning abilities.

(from [1])

R1 performance. As shown above, R1 matches or surpasses the performance of OpenAI’s o1 model on most reasoning tasks. Unlike R1-Zero, R1 also has strong coding abilities and can handle general-purpose tasks due to its hybrid training pipeline. In general, R1 is a capable model that can handle both traditional and reasoning-oriented tasks. However, we should note that differences exist between LRMs and LLMs—reasoning models are not clearly better in all areas. For example, R1 performs poorly on instruction following benchmarks (e.g., IF-Eval) compared to standard LLMs. However, this trend is likely to be reversed in the future as the balance between standard LLMs and reasoning continues to be refined.

Distilled variants of R1. Given that R1 is a very large model (i.e., 671B parameter MoE), the main R1 model is also distilled to create a series of smaller, dense models. A very simple pipeline is adopted for distillation. Beginning with two base models (i.e., Qwen-2.5 and Llama-3), we simply:

Generate ~800,000 supervised training examples by sampling completions from the full DeepSeek-R1 model.
Finetune the base models using SFT over this data.

This is the simplest form of distillation that can be used, which just trains the student on completions from the teacher using SFT. Such an approach is referred to as off-policy distillation [11]. This off-policy distillation procedure works well for the R1 model. In fact, distilling from R1 actually outperforms direct training of smaller models with RL; see below. However, we can usually achieve better performance via logit distillation (i.e., training the student model on the full log probabilities outputted by the teacher for each token) or on-policy distillation.

(from [8])

Conclusion

The advent of large reasoning models has completely transformed LLM research, especially the domain of reinforcement learning. For years, research on RL has centered around complex algorithms like PPO that require substantial domain knowledge and extensive compute resources. As a result, much of the research in this area has been confined to a handful of top research labs. This trend has recently changed, however, as open reasoning models and simpler RL algorithms like GRPO have become increasingly popular. Today, there are more public resources than ever before for doing useful research at the intersection of RL and LLMs. Hopefully, the details outlined in this post will contribute to further democratizing research on this important and rapidly evolving topic.

New to the newsletter?

Subscribe now

Bibliography

[1] Shao, Zhihong, et al. “Deepseekmath: Pushing the limits of mathematical reasoning in open language models.” arXiv preprint arXiv:2402.03300 (2024).

[2] Paster, Keiran, et al. “Openwebmath: An open dataset of high-quality mathematical web text.” arXiv preprint arXiv:2310.06786 (2023).

[3] Wei, Jason, et al. “Chain-of-thought prompting elicits reasoning in large language models.” Advances in neural information processing systems 35 (2022): 24824-24837.

[4] Chen, Wenhu, et al. “Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks.” arXiv preprint arXiv:2211.12588 (2022).

[5] Gou, Zhibin, et al. “Tora: A tool-integrated reasoning agent for mathematical problem solving.” arXiv preprint arXiv:2309.17452 (2023).

[6] Lambert, Nathan. “Reinforcement Learning from Human Feedback.” Online (2025).

https://rlhfbook.com

[7] Schulman, John. “Approximating KL Divergence.” Online (2020). http://joschu.net/blog/kl-approx.html.

[8] Guo, Daya, et al. “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.” arXiv preprint arXiv:2501.12948 (2025).

[9] OpenAI et al. “Learning to Reason with LLMs.” https://openai.com/index/learning-to-reason-with-llms/ (2024).

[10] Liu, Aixin, et al. “Deepseek-v3 technical report.” arXiv preprint arXiv:2412.19437 (2024).

[11] Lu, Kevin et al. “On-Policy Distillation.” https://thinkingmachines.ai/blog/on-policy-distillation/ (2025).

[12] Schulman, John, et al. “Proximal policy optimization algorithms.” arXiv preprint arXiv:1707.06347 (2017).

[13] Schulman, John, et al. “High-dimensional continuous control using generalized advantage estimation.” arXiv preprint arXiv:1506.02438 (2015).

[14] Team, Kimi, et al. “Kimi k2: Open agentic intelligence.” arXiv preprint arXiv:2507.20534 (2025).

[15] Khatri, Devvrit, et al. “The art of scaling reinforcement learning compute for llms.” arXiv preprint arXiv:2510.13786 (2025).

[16] Ouyang, Long, et al. “Training language models to follow instructions with human feedback.” Advances in neural information processing systems 35 (2022): 27730-27744.

[17] Stiennon, Nisan, et al. “Learning to summarize with human feedback.” Advances in neural information processing systems 33 (2020): 3008-3021.

[18] Bai, Yuntao, et al. “Training a helpful and harmless assistant with reinforcement learning from human feedback.” arXiv preprint arXiv:2204.05862 (2022).

[19] Lambert, Nathan, et al. “Tulu 3: Pushing frontiers in open language model post-training.” arXiv preprint arXiv:2411.15124 (2024).

[20] Bespoke Labs et al. “Scaling up Open Reasoning with OpenThinker-32B.” https://www.bespokelabs.ai/blog/scaling-up-open-reasoning-with-openthinker-32b (2025).

In fact, some researchers argue that the distinction between an LLM and an LRM is an unnecessary gray area—they are still the same types of models.

Frontier labs have argued that the LRM’s chain of thought is a useful artifact for monitoring the model for harmful behavior. To maintain our ability to monitor, the reasoning process is usually kept “unsafe”—we apply no safety post-training to it to ensure that the model does not learn to omit info from its reasoning process for safety purposes. As a result, the reasoning process is potentially unsafe and will be kept that way for monitoring benefits) and cannot be directly exposed to the end user. Alternatively, top labs could be simply omitting the reasoning trajectory to make distilling from their best reasoning models more difficult.

This naming stems from the fact that the surrogate objective is different from the RL training objective. In RL, we aim to maximize cumulative reward. However, directly maximizing this objective can lead to instability. The surrogate is a more stable proxy that can be optimized in place of the true objective.

The critic is very similar to a reward model—both models predict rewards. However, the critic predicts reward per-token, while a reward model usually predicts outcome rewards for an entire completion. Additionally, reward models are usually fixed during RL training while the critic is trained alongside the policy itself.

The bias comes from relying on an approximate value model for this estimate and only using a small amount of exact reward information r_t.

A commonly used setting for λ is ~0.95.

The stop gradient is used here because, when using the GRPO loss function, we are computing the gradient of the loss with respect to our policy. Usually, the policy in the denominator of this expression is the old policy. We consider the output of this policy to be a constant when computing the gradient. When performing only a single policy update per batch of data, the old policy is equal to our current policy, but we still consider this denominator term a constant when computing the gradient. This is accomplished via the stop gradient operation.

For example, hosting Qwen-3-32B in half precision with its full context length (131K tokens) would increase the memory footprint from ~70GB to ~400GB.

This exact number will vary drastically depending on our exact training settings. For example, this calculation assumes that we are using the AdamW optimizer, which maintains three separate optimizer states for every model parameter at full precision (default setting for AdamW parameters and optimizer states). We can reduce memory by using an 8-bit AdamW optimizer. Additionally, we can adopt various sharding (e.g., ZeRO, FSDP, and more) or pipelining strategies if we have multiple GPUs or nodes available for training to reduce per-GPU memory consumption significantly.

The implementation also draws upon code from a prior PPO tutorial, as well as the implementation of GRPO in TRL.

Some open reasoning models like QwQ preceded the release of DeepSeek-R1.

The cost of training an LLM is dominated by pretraining. However, the cost of post-training can still be expensive, especially when human data annotation is considered; see here for more details. Therefore, the ratio of cost spent on post-training varies, but it would generally be <10% of the total LLM training cost.

See here for more info on the role of SFT in training reasoning models.

PPO for LLMs: A Guide for Normal People

Cameron R. Wolfe, Ph.D. — Mon, 27 Oct 2025 09:33:23 GMT

(from [4, 5, 8])

Over the last several years, reinforcement learning (RL) has been one of the most impactful areas of research for large language models (LLMs). Early research used RL to align LLMs to human preferences, and this initial work on applying RL to LLMs relied almost exclusively on Proximal Policy Optimization (PPO) [1]. This choice led PPO to become the default RL algorithm in LLM post-training for years—this is a long reign given the fast pace of LLM research! Only in recent work on LLM reasoning have researchers begun to use alternative algorithms like GRPO.

Despite its importance, PPO is poorly understood outside of top research labs. This lack of understanding is for good reason. Not only is PPO a complicated algorithm packed with nuanced implementation details, but its high compute and memory overhead make experimentation difficult without extensive compute resources. Successfully leveraging PPO requires both a deep understanding of the algorithm and substantial domain knowledge or practical experience.

This overview will begin with basic concepts in RL and develop a detailed understanding of PPO step-by-step. Building on this foundation, we will explain key practical considerations for using PPO, including pseudocode for PPO and its various components. Finally, we will tie all of this knowledge together by examining several seminal works that popularized PPO in the LLM domain.

Reinforcement Learning (RL) Preliminaries

Before learning more about PPO, we need to learn about RL in general. This section will cover basic problem setup and terminology for RL. Additionally, we will derive a simple policy gradient expression, which forms a basis for PPO.

Problem Setup and Terminology

When running RL training, we have an agent that takes actions within some environment; see below.

Basic problem setup for RL

These actions are predicted by a policy—we can think of the policy as the agent’s brain—that is usually parameterized. For example, the policy is the LLM itself in the context of training LLMs. We can model the probability of a given action under our policy as π_θ(a_t | s_t). When the policy outputs an action, the state of the environment will be updated according to a transition function, which is part of the environment. We will denote our transition function as P(s_t+1 | a_t, s_t). However, transition functions are less relevant for LLMs because they are typically a pass-through; i.e., we assume s_t = {x, a_1, a_2, …, a_t}, where x is the prompt.

Finally, each state visited by the agent receives a reward from the environment that may be positive, negative, or zero (i.e., no reward). As shown in the prior figure, our agent acts iteratively and each action (a_t), reward (r_t), and state (s_t) are associated with a time step t. Combining these time steps together yields a trajectory; see below. Here, we assume that the agent takes a total of T steps in the environment for this particular trajectory.

Using the chain rule of probabilities, we can also compute the probability of a full trajectory by combining the probabilities of:

Each action a_t given by our policy π_θ(a_t | s_t).
Each state s_t+1 given by the transition function P(s_t+1 | a_t, s_t).

The full expression for the probability of a trajectory is provided below.

Computing the probability of a trajectory

RL objective. When training a model with RL, our goal is to maximize the cumulative reward over the entire trajectory (i.e., the sum of r_t). However, there are a few variations of this objective that commonly appear. Specifically, the reward that we maximize can either be discounted or non-discounted1; see below. By incorporating a discount factor γ, we reward our policy for achieving rewards sooner rather than later. In other words, money now is better than money later.

Our objective is usually expressed as an expected cumulative reward, where the expectation is taken over the trajectory. Expanding this expectation yields a sum over trajectories weighted by their probabilities. We can formulate this in a continuous or discrete manner; see below.

State, value, and advantage functions. Related to RL objective, we can also define the following set of functions:

Value Function V(s): the expected cumulative reward when you start in state s and act according to your current policy π_θ.
Action-Value Function Q(s, a): the expected cumulative reward when you start in state s, take action a, then act according to your policy π_θ.
Advantage Function A(s, a): the difference between the action-value and value function; i.e., A(s, a) = Q(s, a) - V(s).

Intuitively, the advantage function tells us how useful some action a is by taking the difference between the expected reward after taking action a in state s and the general expected reward from state s. The advantage will be positive if the reward from action a is higher than expected and vice versa. Advantage functions play a huge role in RL research—they are used to compute the gradient for our policy.

“Sometimes in RL, we don’t need to describe how good an action is in an absolute sense, but only how much better it is than others on average. That is to say, we want to know the relative advantage of that action. We make this concept precise with the advantage function.” - Spinning up in Deep RL

RL Formulation for LLMs

RL terminology mapping for LLMs

Now that we understand RL basics, we need to map the terminology that we have learned to the setting of LLM training. We can do this as follows (shown above):

Our policy is the LLM itself.
Our initial state is the prompt.
The LLM’s output—either each token or the entire completion—is an action.
Our state is the combination of our prompt with the LLM’s output.
The entire completion from the LLM forms a trajectory.
The reward comes from a verifier or reward model (more details to follow).

Notably, there is no transition function in this setup because the transition function is completely deterministic. If we start with a prompt x and our LLM predicts tokens t_1 and t_2 given this prompt as input, then our updated state simply becomes s_2 = {x, t_1, t_2}. In other words, our state is just the running completion being generated by the LLM for a given prompt x.

MDP formulation. For LLMs, there are two key ways in which RL can be formulated that differ in how they model actions:

Bandit formulation: the entire completion or response from the LLM is modeled as a single action.
Markov Decision Process (MDP) formulation: each token within the LLM’s output is modeled as an individual action.

We outlined the details for both of these formulations in a prior overview. However, PPO relies upon the MDP formulation, so we will primarily focus upon the MDP formulation here. As we should recall, an LLM generates output via next token prediction; i.e., by generating each token in the output completion sequentially. This autoregressive process is depicted below.

Autoregressive next token prediction with an LLM

Next token prediction maps easily to an RL setup—we can model each token as an action! This setup is called the Markov Decision Process (MDP) formulation. An MDP is a probabilistic framework for modeling decision-making that includes states, actions, transition probabilities and rewards—this is exactly the setup we have discussed so far for RL! The MDP formulation used for RL is shown below.

When modeling RL as an MDP for LLMs, our initial state is the prompt and our policy acts by predicting individual tokens. Our LLM forms a (stochastic) policy that predicts a probability distribution over tokens. During generation, actions are taken by selecting a token from this distribution—each token is its own action. After a token is predicted, it is added to the current state and used by the LLM to predict the next token—this is just autoregressive next token prediction! Eventually, the LLM predicts a stop token (e.g., <|end_of_text|> or ) to complete the generation process, thus yielding a complete trajectory.

Policy Gradient Basics

During RL training, we want to maximize our objective—the cumulative (possibly discounted) reward. To accomplish this, we can just use gradient ascent; see below.

Solving the RL objective with gradient ascent

To put this in the context of LLMs, RL training follows the sequence of steps shown below. We first sample a batch of prompts and generate completions to these prompts with our LLM or policy. Then, we compute the rewards for these completions (more details to follow in later sections) and use these rewards to derive a policy update. This final policy update step is where gradient ascent is used.

Key steps in RL training for LLMs

To be more specific, we use the completions and rewards to estimate the gradient of the RL training objective with respect to the parameters of our policy—this is called the “policy gradient”. If we can compute this gradient, then we can train our policy using gradient ascent. But, the question is: How do we compute this gradient?

“The goal of reinforcement learning is to find an optimal behavior strategy for the agent to obtain optimal rewards. The policy gradient methods target at modeling and optimizing the policy directly.” - Lilian Weng

Policy gradients. Nearly all RL optimizers used for LLM training (e.g., PPO [1], GRPO, and REINFORCE) are policy gradient algorithms, which operate by i) estimating the policy gradient and ii) performing gradient ascent with this estimate. These algorithms use different approaches for estimating the policy gradient, but the high-level idea behind all of them is quite similar—we just tweak small details depending on the exact technique being used. To understand policy gradient algorithms more deeply, we will first derive the simplest form of a policy gradient. Then, we will extend this idea to recover more intricate policy gradient algorithms like Trust Region Policy Optimization (TRPO) [6] and PPO [1].

The Vanilla Policy Gradient (VPG) has been extensively covered by many online resources. Other useful explanations of the VPG include:

Intro to Policy Optimization from OpenAI [link]
RLHF Book from Nathan Lambert [link]
Policy Optimization Algorithms from Lilian Weng [link]
Policy Gradient Algorithms from this blog2 [link]

However, we will again derive some simple forms of the policy gradient here for completeness. As we already know, our goal in RL is to maximize cumulative rewards. If we try to compute the gradient of this objective with respect to the parameters of our policy θ, we can derive the following:

(source)

This derivation starts with the gradient of our RL training objective (cumulative reward) and ends with a basic expression for the policy gradient. The steps used in this derivation are enumerated above. The only complicated steps here are the use of the log-derivative trick and the final step, which leverages our definition for the probability of a trajectory. In the final step, we substitute in our definition for the probability of a trajectory and observe that the gradients of the initial state probability and transition function with respect to the policy parameters are always zero because neither of them depend on the policy; see below.

(source)

Implementing a basic policy gradient. The basic policy gradient expression we have derived so far is theoretical—it involves an expectation. If we want to actually compute this gradient in practice, we must approximate it with a sample mean. In other words, we sample a fixed number of trajectories—or prompts and completions in the case of an LLM—and take an average over the policy gradient expression for each of these trajectories. The basic policy gradient expression contains two key quantities that we already know how to compute:

The reward comes directly from a verifier or reward model.
Log probabilities of actions can be computed with our LLM (i.e., these are just the token probabilities from the LLM’s output).

To make the process of computing the basic policy gradient more concrete, a step-by-step implementation in PyTorch pseudocode has been provided below.

One key detail that we should notice in the above implementation is that we do not compute the policy gradient directly. Rather, we formulate a loss function for which the gradient is equal to the policy gradient then use autodiff in PyTorch to compute the policy gradient—this happens during loss.backward(). The exact loss function used to compute the policy gradient is shown below.

Creating a loss function for the policy gradient

This distinction is important to understand because we will formulate PPO (and TRPO!) via a loss function rather than a direct expression for the policy gradient.

Problems with the basic policy gradient. The basic policy gradient expression is straightforward, but it suffers from several notable issues:

High Variance: The gradient estimates can have high variance, making training unstable.
Unstable Policy Updates: There is no mechanism to prevent large, potentially destabilizing updates to the policy.

Due to the high variance, accurately estimating the policy gradient often requires sampling many trajectories per training iteration, which is computationally expensive. We must generate many completions with the LLM and compute the rewards and token log probabilities for all of these completions.

Additionally, this high variance increases the risk of training instability—large and inaccurate updates could potentially cause significant harm to our policy. To solve these issues, most policy gradient algorithms focus on reducing the variance of policy gradient estimates and enforcing a trust region on policy updates (i.e., limiting how much the policy can change in a single update).

“Taking a step with this gradient pushes up the log-probabilities of each action in proportion to R(𝜏), the sum of all rewards ever obtained.” - Spinning up in Deep RL

Reward-to-go. For example, we see in our basic policy gradient (copied below for reference) that we are increasing the probability of a given action based upon the cumulative reward of a trajectory. Therefore, we may increase the probability of an action due to rewards that were observed before the action even occurred!

Basic policy gradient expression

This simple observation led to the creation of the “reward-to-go” policy gradient; see below. This modified policy gradient expression just replaces the cumulative reward with the sum of rewards observed after an action. Using the EGLP lemma, we can show that this reward-to-go formulation is an unbiased estimator of the policy gradient. Additionally, the reward-to-go policy gradient has provably lower variance compared to the basic policy gradient expression from before.

The reward-to-go policy gradient

Baselines. To further reduce variance, we can also add a baseline to our policy gradient expression; see below. Similarly to the reward-to-go policy gradient, we can use the EGLP lemma to show that a baselined version of our policy gradient is unbiased and has lower variance. Due to the EGLP lemma, this baseline must only depend upon the current state (i.e., otherwise an assumption of the EGLP lemma is violated and the proofs are no longer valid).

Adding a baseline to our policy gradient expression

This expression is nearly identical to the reward-to-go policy gradient—we just subtract an additional baseline from the reward-to-go term. There are many possible choices for baselines that can be used in policy gradient estimates. One common baseline is the value function. Using the value function as a baseline positively reinforces actions that achieve a cumulative reward that is higher than expected.

A common problem with vanilla policy gradient algorithms is the high variance in gradient updates… In order to alleviate this, various techniques are used to normalize the value estimation, called baselines. Baselines accomplish this in multiple ways, effectively normalizing by the value of the state relative to the downstream action (e.g. in the case of Advantage, which is the difference between the Q value and the value). The simplest baselines are averages over the batch of rewards or a moving average. - RLHF book

Generic policy gradient. In [3], the options for computing the policy gradient were summarized with a more generic policy gradient expression; see below.

(from [3])

This expression is nearly identical to expressions we have seen so far. The only difference is that we have changed our reward term R(𝜏) to a generic Ψ_t term, which can be set equal to several different expressions. For example, we can:

Set Ψ_t = R(𝜏) to recover our basic policy gradient expression.
Set Ψ_t equal to rewards received after time t to recover our reward-to-go variant of the policy gradient.
Set Ψ_t equal to a baselined version of the reward; e.g., the difference between cumulative reward R(𝜏) and the value function V(s_t).
Set Ψ_t equal to the state-action (Q) or advantage function (A).

Despite the many possible formulations, PPO—and nearly all of the RL optimizers used in the domain of LLMs—focuses upon setting Ψ_t equal to the advantage function A(s_t, a_t). This setting is referred to as the vanilla policy gradient (VPG); see below. In theory, the VPG yields the lowest-variance gradient estimate.

The vanilla policy gradient

Although the VPG has low variance, there is still no mechanism to enforce a trust region in the policy update—a large and destructive policy update can still destabilize the training process. PPO was created as a solution to this problem. As we will see, PPO resembles the basic policy gradient expressions we have seen but has added mechanisms for enforcing a trust region on the policy update. We will now learn more about PPO and the many practical details involved in its implementation.

Proximal Policy Optimization (PPO)

Now that we understand RL basics, we will spend the next section learning about Proximal Policy Optimization (PPO) [1]. This explanation will build upon the VPG expression that we derived in the last section, beginning with Trust Region Policy Optimization (TRPO) [6]—a predecessor to PPO. TRPO is effective at stabilizing training, but it is also relatively complex. PPO was developed as a more practical alternative with similar benefits. To conclude the section, we will also cover Generalized Advantage Estimation (GAE) [3], which is the most common approach for computing the advantage function in PPO.

Trust Region Policy Optimization (TRPO) [6]

“TRPO uses a hard constraint rather than a penalty because it is hard to choose a single value of β that performs well across different problems—or even within a single problem, where the characteristics change over the course of learning.” - from [1]

Prior to learning about PPO, we need to take a look at its predecessor, Trust Region Policy Optimization (TRPO) [6]. The key motivation behind TRPO is creating an algorithm that is data efficient and does not require too much hyperparameter tuning. To do this, authors in [6] propose the constrained objective below, which is guaranteed to monotonically improve our policy. This objective enforces a trust region on the policy update, thus eliminating the risk of large and destructive policy updates that could destabilize training.

Surrogate objective for TRPO (from [1])

Surrogate objective. This objective shown above is called the surrogate objective in TRPO. This naming stems from the fact that the surrogate objective is different from the standard RL training objective. In RL, we aim to maximize cumulative reward, but—as we have seen in our discussion of the VPG—directly maximizing this “true” objective of RL can lead to training instability. TRPO formulates the surrogate objective to maximize in place of the true objective.

There are a few noticeable differences between the above expression for TRPO and the VPG:

Action probabilities in the current policy are normalized by the probability of that action in the old policy (i.e., the policy prior to training)—this forms the policy ratio (also called an importance ratio). We also use probabilities in this formulation instead of log probabilities.
There is a constraint placed on the objective to ensure that the expected KL divergence between the new and old policies is less than a threshold δ.

Otherwise, the TRPO loss function shares a similar structure to that of VPG—it includes the advantage function and a sum over token-level probabilities in a trajectory.

Policy ratio. The centerpiece of the TRPO loss function is the policy ratio, defined as shown below. The policy ratio tells us how much more likely a given action is in our current policy relative to the probability of that action before the training process started—this is denoted as the “old” policy.

The policy (or importance) ratio

This quantity serves the purpose of assigning an importance to different actions within our trajectory. If the new policy assigns a higher probability to an action than the old policy did, this ratio is greater than one, increasing the influence of that action’s advantage in the objective. Conversely, if the new policy assigns a lower probability, the ratio is less than one, reducing the influence of that action. The policy ratio ensures that the policy update emphasizes actions that the new policy is making more likely—especially if those actions have high advantage—while suppressing actions that are becoming less likely under the new policy. By doing this, we ensure that the update is properly weighted according to how the new policy differs from the old, enabling stable and efficient policy improvement.

Solving the surrogate objective. Although this objective yields stable policy updates, solving it can be quite involved. By introducing an explicit constraint into our objective, we eliminate the ability to solve this objective with simple gradient ascent3. Instead, we have to solve this objective via the more complex conjugate gradient algorithm. Alternatively, we could remove this constraint and instead add the KL divergence as a penalty into our loss function; see below. This unconstrained loss is simpler and can again be solved with basic gradient ascent.

The penalty objective for TRPO

From TRPO to PPO. Formulating the constraint from TRPO as a penalty allows us to avoid complicated optimization techniques and rely upon basic gradient ascent. However, a new hyperparameter β is introduced to the optimization process that makes tuning difficult. Properly setting the value of β is essential for this objective to perform well, and finding a single value of β that generalizes to many domains is hard. As a result, both of the above objectives have their issues:

The TRPO surrogate objective is too complex to solve in practice.
The reformulated penalty objective is sensitive to the setting of β.

We want to develop an algorithm that retains the benefits of TRPO—such as stability, data efficiency, and reliability—while avoiding its complexity. Ideally, the algorithm should be broadly applicable and solvable using basic gradient ascent. These goals led to the proposal of PPO, which is largely inspired by TRPO. PPO’s objective is inspired by the TRPO surrogate objective but replaces the hard KL constraint with a clipping mechanism to enforce a trust region in a simpler way.

Proximal Policy Optimization Algorithms [1]

“We propose a new family of policy gradient methods for RL, which alternate between sampling data through interaction with the environment, and optimizing a surrogate objective function using stochastic gradient ascent.” - from [1]

The VPG is simple to compute in practice, but it has poor data efficiency (i.e., the model must be trained over many samples to perform well) and high variance in the policy updates. These problems are largely solved by TRPO but at the cost of significant added complexity. PPO is an algorithm with the data efficiency and reliability benefits of TRPO that is still solvable with gradient ascent. In this way, PPO is a simpler algorithm compared to TRPO. As we will see, however, PPO is still a complex algorithm with many implementation complexities of its own.

Update procedure in PPO (from [1])

Training process. Similarly to TRPO, PPO focuses upon optimizing a surrogate objective, but the objective in PPO has no constraint and has been slightly modified. As shown in the algorithm above, PPO performs more than a single policy update in each step, instead alternating between:

Sampling new data or trajectories from the policy.
Performing several epochs of optimization on the sampled data.

The PPO surrogate objective is again based upon the policy ratio between the current policy and the old model (i.e., the policy before any training is performed). To match notation in [1], we will denote the policy ratio as r_t(θ), which is similar to the r_t notation used for the reward for time step t. However, the policy ratio is unrelated to the reward! To obtain the PPO objective, we start with the surrogate objective being maximized by TRPO with no KL constraint; see below.

The unclipped PPO objective

We will call this formulation the “unclipped” objective. Because it does not have a constraint, this objective can be easily computed to derive the policy gradient by i) estimating the advantage and ii) computing the policy ratio. However, if we try to maximize this unconstrained objective, this will potentially lead to large and destructive policy gradient updates that make the training process unstable. To solve this issue, PPO introduces a novel clipping mechanism into the surrogate objective that helps us with maintaining the trust region; see below.

The PPO surrogate objective

The main term in the objective is unchanged, but there is an added term with a clipped version of the policy ratio—the policy ratio must fall in the range [1 - ε, 1 + ε]4. The clipping term disincentivizes the RL training process from moving the policy ratio away from a value of one. The PPO surrogate objective takes the minimum of clipped and unclipped objectives. In this way, the PPO objective is a pessimistic (lower) bound for the original, unclipped objective5.

(from [1])

Depending upon whether the advantage is positive or negative, the behavior of clipping is slightly different; see above. The use of a minimum in the surrogate objective causes clipping to be applied in only one direction. In particular, we can arbitrarily decrease surrogate objective by moving the policy ratio far away from a value of one, but clipping prevents arbitrarily increasing the objective via the policy ratio. In this way, PPO de-incentivize large policy ratios so that our policy does not deviate too much from the old policy after training updates.

“With this scheme, we only ignore the change in probability ratio when it would make the objective improve, and we include it when it makes the objective worse.” - from [1]

To more deeply understand the clipping logic of PPO, we can consider each of the four possible cases that can arise when optimizing the surrogate objective:

Case #1 [A > 0, r_t(θ) ≤ 1 + ε]: advantage is positive—this is an action that we want to reinforce. Our policy ratio is below 1 + ε, so we perform a normal policy gradient update to increase the probability of this action.
Case #2 [A > 0, r_t(θ) > 1 + ε]: advantage is positive again, but our policy ratio is greater than 1 + ε. This means that this action is already more likely in the new policy relative to the old policy. The objective gets clipped, and the gradient with respect to further increases in the policy ratio is zero. This prevents the policy from making the action even more likely
Case #3 [A < 0, r_t(θ) ≥ 1 - ε]: advantage is negative—this is an action we want to negatively reinforce (i.e., decrease probability). Our policy ratio is above 1 - ε, so we perform a normal policy gradient update to decrease the probability of this action.
Case #4 [A < 0, r_t(θ) < 1 - ε]: advantage is negative again, but our policy ratio is less than 1 - ε. This means that this action is already less likely in the new policy relative to the old policy. The objective gets clipped, and the gradient with respect to further decreases in the policy ratio is zero. This prevents the policy from making the action even less likely.

The policy ratio is computed between the current and old policies. The old policy is updated to match the current policy each time new data is sampled in PPO. In the context of LLMs, we perform 2-4 gradient updates (or sometimes more) [2] for each batch of data, so the old model is updated frequently. The clipping operation in PPO, therefore, maintains a trust region for a particular batch of data.

KL divergence. When training LLMs with PPO, we usually incorporate the KL divergence between the current policy and a reference policy—usually some policy from before RL training begins (e.g., the SFT model)—into the training process. This added KL divergence term penalizes the policy from becoming too different from the reference policy, which has a regularizing effect. We compute KL divergence per token by comparing the token probability distributions outputted by the two LLMs for each token within the sequence. Details on how exactly the KL divergence is computed in practice can be found here.

Incorporating KL divergence into the reward

There are two common ways of adding the KL divergence into PPO training. First, we can directly subtract the KL divergence from the reward in RL; see above. Alternatively, we can add the KL divergence as a penalty term to the RL training objective as shown below. In both cases, we simply want to maximize rewards without making our new policy too different from the reference.

Incorporating a KL penalty into the RL training objective

Such a KL divergence term is almost universally used in RL training for LLMs, though the exact implementation varies. Both of the approaches outlined above have been used successfully. However, capturing the KL divergence via a penalty term in the training objective is probably more common (and a bit simpler).

The critic. Recall that the advantage function is defined as the difference between the state-action value function and the value function. In PPO, we estimate the state-action value function—the expected reward for taking a specific action in a given state—by using the actual reward observed for a trajectory. The value function, in contrast, is typically estimated using a learned model; see below.

For example, we can create a separate copy of our policy, or—for better parameter efficiency—add a dedicated value head that shares weights with the policy to predict the value function. This learned value function is often referred to as a value model or critic. Taking a partial response as input, the critic predicts the expected final reward for every token position within the sequence; see below.

Critic versus reward model. In the context of LLMs, we predict the reward with a reward model. Additionally, most LLMs are trained using outcome supervision, meaning that a reward is only assigned after the model has generated a complete response (i.e., after the token has been outputted). The critic and reward model are similar in that they are both learned models—usually another copy of our LLM policy—that predict rewards. However, the critic predicts expected rewards given a partial completion as input, while the reward model typically predicts the reward received by an entire response; see below. Going further, the reward model is fixed throughout RL training, while the critic is continually updated.

Value model versus reward model

Critic training. The value function is on-policy—it is dependent upon the current parameters of our policy. Unlike reward models which are fixed at the beginning of RL training, the critic is trained alongside the LLM in each policy update to ensure that its predictions remain on-policy—this is called an actor-critic setup6. This is accomplished by adding an extra mean-squared error (MSE) loss—between the rewards predicted by the critic and actual rewards—to the surrogate loss.

PPO implementation. To make each of these ideas more complete, we have implemented PPO in PyTorch pseudocode below. In this implementation, we see several of the key ideas we have discussed so far, such as:

Computing the KL divergence between the current policy and a reference model, then directly subtracting this KL divergence from our reward.
Using a learned critic to compute the advantage (and training this critic via an MSE loss alongside the policy itself).
Computing the policy ratio with respect to the old model. The script below performs a single policy update, but PPO usually performs several (i.e., 2-4 in the case of LLMs [2]) policy updates for each batch of data. The “old” model in the policy ratio is the model from before the first update for a batch.
Computing the full (clipped) PPO loss. We take the negative of this loss because PyTorch performs gradient descent (not ascent) by default.
Aggregating or averaging the token-level PPO loss across a batch of sequences. There are many ways to aggregate the loss in a batch, and the approach used can significantly impact results [2]7.

One interesting detail we see here is that—despite the PPO loss using token probabilities and not log probabilities—we choose to work with token log probabilities and exponentiate them instead of using raw probabilities when computing the policy ratio. This is a commonly-used numerical stability trick.

import torch
import torch.nn.functional as F

# constants
kl_beta = 0.1
critic_weight = 0.5
ppo_eps = 0.2

# sample prompt completions and rewards
with torch.no_grad():
    completions = LLM.generate(prompts)  # (B*G, L)
    rewards = RM(completions)  # (B*G, 1)

# create a padding mask from lengths of completions in batch
completion_mask = <... mask out padding tokens ...>

# compute value function / critic output
values = CRITIC(completions)  # (B*G, L) - predicted reward per token!

# get policy logprobs for each action
llm_out = LLM(completions)
per_token_logps = F.log_softmax(llm_out, dim=-1)  # (B*G, L)

# get reference logprobs for each action
ref_out = REF(completions)
ref_per_token_logps = F.log_softmax(ref_out, dim=-1)  # (B*G, L)

# compute KL divergence between policy and reference policy
kl_div = per_token_logps - ref_per_token_logps

# directly subtract KL divergence from rewards
# NOTE: KL div is per token, so reward becomes per token and reward
# for all tokens (besides last token) is just kl divergence.
# Reward for last token is sum of outcome reward and KL div.
rewards -= kl_beta * kl_div # (B*G, L)

# compute the advantage - simple approach
advantage = rewards - values.detach()  # (B*G, L)

# compute the policy ratio
# NOTE: old_per_token_logps must be persisted during first policy
# update for this batch of data and re-used in each subsequent update
policy_ratio = torch.exp(
    per_token_logps - old_per_token_logps,
)  # (B*G, L)
clip_policy_ratio = torch.clamp(
    policy_ratio,
    min=1.0 - ppo_eps,
    max=1.0 + ppo_eps,
)

# compute the ppo loss
ppo_loss = torch.min(
    advantage * policy_ratio,
    advantage * clip_policy_ratio,
)  # (B*G, L)
ppo_loss = -ppo_loss

# combine ppo loss and critic mse loss
critic_loss = ((rewards - values) ** 2)  # (B*G, L)
loss = ppo_loss + critic_weight * critic_loss

# aggregate the loss across tokens (many options exist here)
loss = ((loss * completion_mask).sum(axis=-1) /
        completion_mask.sum(axis=-1)).mean()

# perform policy gradient update
optimizer.zero_grad()
loss.backward()
optimizer.step()

Experiments. The LLM setting is not considered in [1], as PPO was proposed during the heyday of DeepRL—well before the proliferation of LLMs. Understanding the experimental results in [1] is nonetheless useful for gaining intuition on the mechanics of PPO. In these experiments, PPO is used to train fully-connected multi-layer perceptrons (MLPs) from scratch on a variety of robotics and video game tasks. The policy and critic are kept separate (i.e., no parameter sharing).

First, authors use several simulated robotics tasks from the OpenAI Gym to test different formulations of the surrogate loss in PPO:

The clipped objective (standard for PPO).
The unclipped objective.
The unclipped objective with (adaptive8) KL divergence.

Unlike the typical RL training setup for LLMs, these experiments compute the KL divergence between the current policy and the old model, with the goal of testing whether this approach works better than the standard PPO clipping mechanism. Ordinarily, when training LLMs with PPO, the KL divergence is computed between the current policy and a reference model (e.g., the SFT model), not the old model9. However, in these experiments, using a reference model for the KL divergence is not possible because we are training models from scratch—there is no pretrained model to serve as a reference.

The results from testing these different objectives are outlined below—the clipped objective for PPO stabilizes training and clearly outperforms the other options.

(from [1])

PPO is also tested on 49 games in the Atari gameplay domain and compared to strong baseline RL algorithms like A2C and ACER. Performance is measured based on two metrics:

Average reward throughout training (favors faster learning).
Average reward over the last 100 training steps (favors final quality / reward).

For each of these metrics, we compute a “win rate”, which captures the number of times each algorithm achieves the top score across all Atari games. The results of these experiments are shown below, where we see that baseline algorithms like ACER perform similarly to or better than PPO but learn much slower. PPO stabilizes training, performs well, and yields an improvement in sample complexity10.

(from [1])

Generalized Advantage Estimation (GAE) [3]

The advantage tells us how much better a given action is compared to the average action in a given state: A(s_t, a_t) = Q(s_t, a_t) - V(s_t). The value function in this formulation is estimated by our critic, but we have not yet discussed in detail how the advantage function can be computed. In PPO, the advantage function is estimated on a per-token (or action) basis. There are two main approaches that can be used to compute the advantage, and these approaches form the basis for most other techniques.

(1) Monte Carlo (MC). An MC estimate of the advantage relies upon the actual reward observed for the full trajectory. Namely, the advantage is computed as the difference between the cumulative reward for the full trajectory R(s_t)11 and the value function for the current state V(s_t), as predicted by the critic.

So far, our discussions of PPO have assumed an MC approach for estimating the advantage. The MC estimate has low bias because it relies on the actual reward observed for the trajectory (exact information), but MC estimates also have high variance. Therefore, we need to take many samples and make a sufficient number of observations to yield an accurate advantage estimate—this can be expensive.

(2) Temporal Difference (TD). The TD residual uses per-token value predictions from the critic to form a one-step estimate of the advantage, as shown below.

Temporal difference (TD) residual

This TD residual analyzes how much the expected reward changes after predicting a single token and observing the actual reward for that action12. We subtract the value for the current state V(s_t) from the sum of:

The observed reward for the current state r_t.
The (discounted) value of the next state V(s_{t+1}).

Similarly to V(s_t), the sum of these two terms captures the expected return at state s_t. However, the reward for the current state is captured via the actual observed reward r_t rather than being estimated by the critic. Therefore, the difference between these terms is capturing how much better the actual reward observed at state s_t is than expected—this is the advantage!

By using the actual reward r_t, we incorporate some exact information into our advantage estimate—the terms in the estimate come partly from our critic and partly from real rewards. Using such token-level rewards to estimate the advantage lowers the variance of the policy gradient. If our value function were exact, then the TD residual would also form an unbiased advantage estimate. Unfortunately, we do not have access to the ground truth value function, so we train a critic to estimate the value function13. Because accurately anticipating final rewards from a partial response is difficult, the TD residual is biased.

N-step estimators. The TD residual analyzes the difference between actual and expected reward for a single step. However, we can generalize this idea to capture any number of steps. As shown below, an N-step advantage estimator has a similar structure to the TD residual, but it incorporates real rewards for N states, where N can be greater than one.

N-step advantage estimators

Taking this further, we can even recover an MC estimate by setting N equal to the total number of steps in the trajectory! This setting of N simply yields the difference between cumulative reward and the value of the current state V(s_t). Therefore, different settings of N yield different tradeoffs in bias and variance, spanning all the way from the single-step TD residual (high bias, low variance) to an MC estimate (high variance, low bias).

“GAE is an alternate method to compute the advantage for policy gradient algorithms that better balances the bias-variance tradeoff. Traditional single-step advantage estimates can introduce too much bias, while using complete trajectories often suffer from high variance. GAE works by combining two ideas – multi-step prediction and weighted running average (or just one of these).” - from [2]

Generalized Advantage Estimation (GAE), which is the most commonly-used approach for estimating the advantage with PPO, makes use of N-step advantage estimates. Instead of choosing a single value of N, however, GAE uses all values of N by taking an average of N-step advantage estimates with different values of N. This is done by introducing a mixing parameter λ for GAE as shown below14.

GAE formulation

In this formulation, setting λ = 0 yields a single-step TD residual because only the first term in the sum receives a non-zero weight. Additionally, a setting of λ = 1 recovers the MC estimate. To see this, we can expand the definition of each TD residual in the sum, yielding the difference in cumulative discounted rewards and the value function of the current state V(s_t); see below.

The benefit of GAE is that the value of λ ∈ [0, 1] controls the bias variance tradeoff. As we increase the value of λ, more exact reward information is used in the advantage estimate, thus lowering the bias (but increasing variance). Similarly, we can use lower values of λ to reduce variance at the cost of higher bias.

Outcome rewards. When we are working with LLMs, we usually use an outcome reward setup, which simplifies GAE. The reward is always zero15, unless we are at the final step of the trajectory. In this scenario, most of the TD residual terms in our GAE summation are simply the difference in (discounted) value functions between two time steps γV(s_{t + 1}) - V(s_t). The final term in the summation contains the actual outcome reward observed for the trajectory.

GAE implementation. To make the concept of GAE more concrete, let’s examine a real-world example adapted from AI2’s OpenInstruct library. The full PPO training script, available here, is a great resource for learning the details of PPO in a production-grade training setting. The GAE component of this script is shown below with some additional comments for clarity. We can efficiently compute the GAE recursion by iterating through the sequence in reverse order.

import torch

# store advantages in reverse order while iterating thru sequence
advantages_reversed = []

# iterate backward to compute GAE recursion
lastgaelam = 0
gen_length = responses.shape[1]
for t in reversed(range(gen_length)):
    if t < gen_length - 1:
        # get value model prediction for time t + 1
        nextvalues = values[:, t + 1]
    else:
        # no values predicted beyond end of sequence
        nextvalues = 0.0

    # compute TD residual at time t    
    delta = rewards[:, t] + gamma * nextvalues - values[:, t]

    # add to the discounted sum of TD residuals for GAE    
    lastgaelam = delta + gamma * lam * lastgaelam

    # store the advantage for step t in our list
    advantages_reversed.append(lastgaelam)

# put the list of advantages in the correct order
advantages = torch.stack(advantages_reversed[::-1], axis=1)

Using PPO for LLMs

(from [7])

There are two different types of RL training that are commonly used to train LLMs (shown above):

Reinforcement Learning from Human Feedback (RLHF) trains the LLM using RL with rewards derived from a human preference reward model.
Reinforcement Learning with Verifiable Rewards (RLVR) trains the LLM using RL with rewards derived from rules-based or deterministic verifiers.

These RL training techniques differ mainly in how they derive the reward for training, but other details of the algorithms are mostly similar. As depicted below, they both operate by generating completions over a set of prompts, computing the reward for these completions, and using the rewards to derive a policy update—or an update to the LLM’s parameters—with an RL optimizer (e.g., PPO).

Visual walkthrough of RL training for LLMs

RLHF was the original form of RL explored by LLMs like InstructGPT [8], the predecessor to ChatGPT. Early research on RLHF for LLMs used PPO as the default RL optimizer, which ultimately made PPO a standard choice for training LLMs with RL. RLVR was introduced more recently, and most works in this space use GRPO as the underlying RL optimizer instead of PPO.

“PPO has been positioned as the canonical method for RLHF. However, it involves both high computational cost and sensitive hyperparameter tuning.” - from [9]

Downsides of PPO. Though it quickly became the default RL optimizer for RLHF, PPO is a complex actor-critic algorithm with high compute and memory overhead, as well as many low-level implementation complexities. The memory overhead of PPO is high because we keep four copies of the LLM in memory:

The policy.
The reference policy.
The critic.
The reward model (if we are using a reward model).

Additionally, we are updating the parameters of our critic alongside the policy itself and running inference for all of these models simultaneously, leading to high compute costs. Beyond memory and compute overhead, there are also many implementation details that we must carefully consider during PPO training:

How do we initialize the critic and reward model? What training settings should we adopt for these models?
What value of ε should we use for clipping in PPO?
Which model should we use as our reference model for the KL divergence?
How many policy updates should we perform for a batch of data?
Do we add the KL divergence as a penalty to the loss or directly incorporate it into the reward function? What scaling factor β should we use?
How should we weight the critic’s loss relative to the main PPO loss?
Should we use GAE? What setting should we use for λ?

Each of these choices may impact the results of RL training! PPO is a sensitive algorithm that is prone to instability—we may spend a lot of compute and time on training a model that ultimately performs poorly due to an incorrect hyperparameter setting. For these reasons, simpler RL algorithms like REINFORCE and GRPO—or even RL-free techniques like DPO—have become popular alternatives to PPO.

PPO for LLMs. In this final section, we will take what we have learned and study PPO specifically in the context of LLM training. We will focus particularly on the foundational works that were the first to use PPO for training LLMs [5, 8]—this research laid the groundwork for the modern LLM boom shortly after. While studying these papers, we will emphasize implementation details and practical lessons that are necessary to obtain a working PPO implementation.

Learning to Summarize from Human Feedback [5]

Abstractive summarization—or using models to create a human-readable, concise summary of a piece of text—has been studied for a long time. Prior to the rise of LLMs and RLHF, most papers on this topic trained language models using a supervised learning approach with human-written reference summaries and evaluated these models using traditional metrics like the ROUGE score.

These approaches can work well, but supervised learning and ROUGE are both proxies for what is actually desired—a model that writes high-quality summaries. In [5], authors solve this problem by replacing supervised learning with RLHF. Such an approach allows us to finetune language models to produce better summaries by directly using human feedback on model outputs as a training signal.

PPO for summarization. Authors in [5] are commonly credited with proposing the first RLHF framework for LLM finetuning. The proposed approach allows us to optimize an LLM based on the quality of its responses, as assessed by human annotators. Beginning with a pretrained LLM, we can iteratively:

Collect human preference data.
Train a reward model over this preference data.
Finetune our LLM with RL using this reward model.

Notably, authors in [5] adopt PPO as their underlying RL optimizer, which led PPO to become the common choice in subsequent RLHF research. With this RL training strategy, we can train an LLM to produce summaries that surpass the quality of human summaries and are even better than those produced by larger LLMs trained with a supervised learning approach; see below.

(from [5])

SFT stage. In [5], the LLM is first trained using supervised finetuning over human reference summaries for a single epoch, producing a supervised baseline that is later finetuned via RLHF. The methodology for RLHF proposed in [5]—as illustrated in the figure shown below—is tailored to the summarization task.

(from [5])

Preferences and reward models. In [5], a preference dataset is constructed by:

Grabbing a textual input to summarize—this is our prompt.
Producing many summaries of the input using several different policies—these are different responses to the same prompt.
Sampling two summaries or responses for the prompt.
Asking a human annotator to identify the better of the two summaries.

Authors in [5] collect this preference data in large batches. Once we have finished collecting a new batch of preference data, we train a reward model on the data such that it accurately predicts human preference scores given an LLM-generated summary. Then, we use this reward model to finetune our policy with PPO.

A KL divergence term is used for PPO in [5] to minimize divergence from the SFT model. Interestingly, authors in [5] were not the first to use this strategy—it was actually adopted from prior work. The KL divergence is directly subtracted from the rewards instead of being added to the PPO loss as a penalty term. We see in [5] that adding the KL divergence into RL training helps to prevent the model’s summaries from becoming too different from those seen during training.

(from [5])

Experiments. In [5], large pretrained models matching the style of GPT-3 with 1.3B to 6.7B parameters are finetuned over the TL;DR dataset. This dataset, which contains over three million posts from Reddit with author-written summaries, is filtered to only 120K high-quality examples; see above. Models are first trained using SFT—these supervised models are also used as baselines across experiments—and then further finetuned with RLHF. Given that summary length can impact the resulting quality score, the authors in [5] constrain generated summaries to 48 tokens and finetune the model accordingly.

Finetuning language models with human feedback outperforms a variety of strong English summarization baselines. Notably, the 1.3B summarization model outperforms a 10× larger model trained with SFT, and the 6.7B summarization model performs even better than the 1.3B model, revealing that summarization quality improves with model scale. Furthermore, we see that summarization models trained via RLHF generalize better to new domains. In particular, the models in [5] are applied to summarizing news articles—a domain outside of the training data—and found to perform well without further finetuning; see below.

(from [5])

From here, summarization models are evaluated in terms of:

Coverage: the summary covers all information from the original post.
Accuracy: statements in the summary are accurate.
Coherence: the summary is easy to read on its own.
Quality: the overall quality of the summary is good.

When evaluated in this manner, we see that summarization models trained via RLHF benefit the most in terms of coverage, while coherence and accuracy are only slightly improved compared to supervised baseline models; see below.

(from [5])

Beyond summarization. Although RLHF was explored only in the context of summarization in [5], the authors of this paper had an incredible amount of foresight about what was to come. The approach proposed in [5] later became a standard part of LLM post-training, as we will soon see with InstructGPT [8].

“The methods we present in this paper are motivated in part by longer-term concerns about the misalignment of AI systems with what humans want them to do. When misaligned summarization models make up facts, their mistakes are fairly low-risk and easy to spot. However, as AI systems become more powerful and are given increasingly important tasks, the mistakes they make will likely become more subtle and safety-critical, making this an important area for further research.” - from [1]

Interestingly, authors in [5] explicitly state their intent to leverage the proposed methodology to better align LLMs to human desires in the long term. This statement was made over two years prior to the proposal of ChatGPT! Work in [5] was a building block for major advancements in AI that were yet to come.

The N+ Implementation Details of RLHF with PPO [4]

(from [4])

There are many moving parts in PPO training, including multiple copies of the LLM (i.e., policy, reference, critic, and reward model) and various hyperparameter settings that must be carefully tuned to ensure stable training. For these reasons—and due to computational expense—reproducing RL training results is difficult.

“It has proven challenging to reproduce OpenAI’s RLHF pipeline… for several reasons: 1) RL and RLHF have many subtle implementation details that can significantly impact training stability, 2) the models are challenging to evaluate… 3) they take a long time to train and iterate.” - from [4]

As a starting point for democratizing understanding of RL, authors in [4] focus on a simple setup—OpenAI’s prior work on RLHF for summarization [5]. Though many details are already provided in the original work, authors in [4] fully reproduce these results while enumerating all implementation details needed to arrive at a working PPO implementation. The TL;DR summarization task is simple relative to most modern RLHF pipelines. However, this study—based on Pythia models [10] with 1B, 2.8B, and 6.8B parameters—provides a clear and comprehensive view of key practical considerations when training an LLM with PPO.

Dataset considerations. Authors in [4] enumerate around 20 practical details needed to obtain a working RLHF pipeline with PPO. Nearly half of these details are not related to PPO—they focus on the training data. For those who have worked with LLMs, this data emphasis should not come as a surprise: data quality is the key determinant of success in all forms of LLM training, including RL.

All experiments in [4] use the TL;DR summarization dataset from OpenAI, which contains both an SFT and preference dataset. Some notable remarks about the data used for PPO in [4] include:

There is a misalignment in completion lengths between the SFT and preference portion of the TL;DR dataset—the preference data tends to have longer completions.
Data must occasionally be truncated to fit within the fixed sequence length used in [4], but the authors choose to truncate at paragraph boundaries—determined by newline characters—instead of performing a hard truncation at the maximum sequence length.
All completions are followed by an token. Authors in [4] emphasize that this token must be different than the padding token used by the LLM. Otherwise, the loss for the token will be masked with the other padding tokens, preventing the model from learning to properly complete each sequence with an token.

Reward model. Several choices exist for initializing the reward model in RLHF. In [4], we initialize with the weights of the SFT model, which matches settings used in [5]. A randomly-initialized linear head that is used to predict the reward is then added to the reward model’s architecture before the model is trained for a single epoch over the available preference data.

An outcome reward setting is used in [4]. To extract the reward, a forward pass is performed on the full sequence, and we extract the reward prediction from the token only. To teach the policy to consistently output sequences of reasonable length with a corresponding token, the EOS trick is used, which assigns a reward of -1 to any sequence with no token.

“If the padding token does not exist, the extracted reward will then be logits corresponding to the last token of the sequence – if that token is not the EOS token, its reward won’t be used for PPO training” - from [4]

After the reward model is trained, authors follow the recommendation in [5] of normalizing rewards outputted by the model. Specifically, the reward model is used to predict rewards for the entire SFT dataset. Then, we compute the mean reward across this dataset and use this mean to center the average reward. In other words, this mean is subtracted as a bias from the reward model’s output, ensuring that rewards predicted over the SFT dataset have an average of zero. Normalizing the reward model’s output benefits training stability for PPO.

Critic settings. We must also choose how to initialize the critic. In [4], the critic is initialized with the weights of the reward model at the beginning of PPO training. After all, the value model is effectively a reward model that predicts the reward on a per-token basis. Authors observe in [4] that the reward model’s predictions are usually negative for all tokens except the token; see below.

(from [4])

Therefore, the value estimated by the critic is negative for nearly every token at the start of PPO training. However, we see in [4] that warm starting the critic in this way helps to improve the initial stability of gradients during training.

Reward and advantage whitening. In addition to normalizing rewards after training the reward model, many PPO implementations perform reward and advantage whitening. An example implementation of the whitening operation is shown below, where the values can be a list of either rewards or advantages.

(from [4])

When whitening rewards, we usually do not shift the mean (i.e., shift_mean = False in the above code) so that we can retain the magnitude and sign of the rewards. However, the mean is usually shifted when whitening advantages. Based on results in [4], whitening rewards and advantages does not seem to have a huge positive or negative performance impact on the resulting policy. However, whitening is a common implementation detail in PPO. Usually, whitening is applied over the set of rewards or advantages within a batch of data.

“Where normalization bounds all the values from the RM to be between 0 and 1, which can help with learning stability, whitening the rewards or the advantage estimates… can provide an even stronger boost to stability.” - from [2]

Beware of dropout. We must also be sure to avoid using dropout in PPO. Dropout adds noise to the model’s forward pass, making the computation of policy ratios and KL divergence unreliable. This implementation detail can cause optimization issues and tends to be impactful—dropout is a perfect example of small but important practical details in PPO. For example, the OpenInstruct PPO script explicitly disables dropout in the policy, critic, reference, and reward models.

Final results. After enumerating various practical choices and hyperparameter settings, the policies in [4] successfully replicate the original results of [5]. PPO models outperform those trained with SFT, and there are clear scaling trends that can be observed (i.e., larger models achieve better performance metrics) for SFT models, reward models, and the final RL policies. Additionally, the preference rate of the RL policies over human reference summaries—as predicted by a GPT-3.5-based LLM judge—scales predictably with model size; see below.

(from [4])

Training language models to follow instructions with human feedback [8]

Going beyond the summarization domain, authors in [8] explore the use of RLHF for language model alignment by directly learning from human feedback. The resulting model, called InstructGPT, is the sister model and predecessor to ChatGPT. Since this model is outlined and explained in detail in [8], the work provides significant insight into how early LLMs at OpenAI were trained.

(from [8])

Following an approach similar to [5], we start with a set of prompts that are either written by human annotators or collected from OpenAI’s API. We can then have annotators write responses to these prompts and finetune a pretrained LLM—GPT-3 in particular—over these examples using SFT. Using this model, we can then collect comparison data by asking humans to select their preferred outputs from the LLM and apply the same RLHF process outlined in [5] for finetuning. As shown above, the resulting model is heavily preferred by humans and much better at following detailed instructions provided within the prompt.

“Making language models bigger does not inherently make them better at following a user’s intent.” - from [8]

The alignment process. Pretrained LLMs have a number of undesirable properties that we want to fix during post-training; e.g., hallucinations or an inability to follow detailed instructions. To fix these issues, we align the LLM in [8] according to the following set of criteria:

Helpful: follows the user’s instructions and infers intention from few-shot prompts or other patterns.
Honest: makes correct factual statements about the world.
Harmless: avoids harmful outputs, such as those that denigrate a protected class or contain sexual/violent content.

Using RLHF, we can teach an LLM to reflect each of these qualities within its output. Specifically, this is done by constructing preference pairs where the preferred responses are chosen based upon adherence to these criteria.

(from [8])

More on RLHF. Authors in [8] curate a team of 40 human annotators, who are screened with a test to judge their annotation quality, to collect preference data for the LLM. The approach for RLHF used in [8] matches the approach used in [5] almost completely. Using a pretrained LLM and a set of prompts for finetuning, the alignment process proceeds according to the following steps:

Collect human demonstrations of responses for each prompt.
Train the model in a supervised fashion over human demonstrations.
Collect preference data.
Train a reward model.
Optimize the underlying LLM or policy with PPO.
Repeat steps 3-5.

The distribution of prompts used for finetuning in [8] is outlined in the table below. For SFT, a dataset of over 13K prompt and response pairs is constructed. The reward model is trained over 33K prompts, while a dataset of size 31K is used for finetuning with PPO. Unlike [5], human annotators are shown 4-9 responses to a prompt (i.e., instead of two) when collecting comparison data, allowing them to quickly rank responses and generate larger amounts of comparison data more efficiently. However, later work on RLHF largely abandoned this approach in favor of binary preferences. The dataset used in [8] is also 96% English.

(from [8])

Similarly to [5], a KL divergence term between the policy and the SFT model is directly subtracted from the reward to avoid drifting too far away from the SFT model. Additionally, extra pretraining updates are “mixed in” to the RLHF optimization process, which authors find to help with maintaining the model’s performance across various benchmarks. These pretraining updates, which use a supervised loss, are simply added to the PPO loss used during RL.

“We were able to mitigate most of the performance degradations introduced by our fine-tuning. If this was not the case, these performance degradations would constitute an alignment tax—an additional cost for aligning the model.” - from [2]

Experimental findings. In [8], authors train three models with 1.3B, 6B, and 175B (i.e., same as GPT-3) parameters. From these experiments, we learn that human annotators prefer InstructGPT outputs over those of GPT-3, even for models with 10× fewer parameters; see below. This result is similar to observations in [5], where finetuning via RLHF enables much smaller models to outperform larger models trained in a supervised manner.

(from [8])

Notably, outputs from InstructGPT-1.3B are preferred to those of GPT-3, which has 100× more parameters. Additionally, we see that InstructGPT-175B produces outputs that are preferred to GPT-3 85% of the time. Going further, InstructGPT models are found to more reliably follow explicit constraints and instructions provided by a human user within the model’s prompt; see below.

(from [8])

Compared to pretrained and supervised models, InstructGPT is also found to be:

More truthful.
Slightly less toxic.
Generalizable to instructions beyond the training dataset.

For example, InstructGPT can answer questions about code and handle prompts written in different languages, despite the finetuning dataset lacking sufficient data within this distribution. Although the model did not receive as much recognition as ChatGPT, InstructGPT was a major step forward in AI that introduced many core concepts used for training modern LLMs.

Conclusion

PPO is one of the most widely used RL algorithms for LLMs that has—through its key role in RLHF pipelines—directly contributed to fundamental advancements in AI. As we learned, research on PPO was an important factor in the creation of models like InstructGPT and ChatGPT. These influential models catalyzed the ongoing boom in LLM research in which we currently find ourselves.

We cannot overstate the impact of PPO on LLM research, and PPO continues to play an important role in LLM post-training pipelines today. However, the barrier to entry for PPO is high due to its memory and compute overhead. Additionally, the results of PPO can vary based on a wide variety of practical implementation details and hyperparameter settings. For these reasons, most research on PPO has been centralized within top frontier labs. Only a small number of groups have sufficient compute resources to empirically tune and obtain a working PPO implementation at scale.

Nonetheless, understanding PPO is essential due to its fundamental role in AI research. The cost and complexity of PPO remains high, but RL researchers have recently expanded and improved upon ideas proposed by PPO. For example, REINFORCE and GRPO are simpler (and more stable) policy gradient algorithms that can be used to train LLMs, which use less memory than PPO by avoiding the critic. A working understanding of PPO makes understanding these new algorithms—or even developing our own—much simpler!

New to the newsletter?

Subscribe now

Bibliography

[1] Schulman, John, et al. “Proximal policy optimization algorithms.” arXiv preprint arXiv:1707.06347 (2017).

[2] Lambert, Nathan. “Reinforcement Learning from Human Feedback.” Online (2025). https://rlhfbook.com

[3] Schulman, John, et al. “High-dimensional continuous control using generalized advantage estimation.” arXiv preprint arXiv:1506.02438 (2015).

[4] Huang, Shengyi, et al. “The n+ implementation details of rlhf with ppo: A case study on tl; dr summarization.” arXiv preprint arXiv:2403.17031 (2024).

[5] Stiennon, Nisan, et al. “Learning to summarize with human feedback.” Advances in neural information processing systems 33 (2020): 3008-3021.

[6] Schulman, John, et al. “Trust region policy optimization.” International conference on machine learning. PMLR, 2015.

[7] Lambert, Nathan, et al. “Tulu 3: Pushing frontiers in open language model post-training.” arXiv preprint arXiv:2411.15124 (2024).

[8] Ouyang, Long, et al. “Training language models to follow instructions with human feedback.” Advances in neural information processing systems 35 (2022): 27730-27744.

[9] Ahmadian, Arash, et al. “Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms.” arXiv preprint arXiv:2402.14740 (2024).

[10] Biderman, Stella, et al. “Pythia: A suite for analyzing large language models across training and scaling.” International Conference on Machine Learning. PMLR, 2023.

As we can see, the discounted reward has an infinite horizon in this case. In other words, the total number of steps in the trajectory is infinite T = ∞. This is known as the infinite-horizon discounted return.

The VPG was also partially covered in my overview of REINFORCE that was released a few weeks ago; see here.

Specifically, if we wanted to solve a constrained optimization problem like this with gradient ascent, we would have to use constrained gradient ascent. However, this method requires that we project our solution into the space of valid solutions that satisfy the constraint after every optimization step, which would be computationally intractable for neural network parameters. The KL divergence is a very complex constraint for which to perform this projection!

More specifically, if the policy ratio is greater than 1 + ε, we set it equal to 1 + ε. If the policy ratio is less than 1 - ε, we set it to 1 - ε. Otherwise, we keep the value of the policy ratio unchanged.

The clipped objective will always be less than or equal to the unclipped objective due to the fact that we are taking the minimum of the unclipped and clipped objectives.

The “actor” refers to the LLM—or the model that is taking actions—and the “critic” refers to the value model. The value model is called a critic due to the fact that it is predicting the reward associated with each action (i.e., effectively critiquing the action).

For more details on loss aggregation in RL, see this section of the RLHF book, which provides concrete examples of different aggregation strategies and their impact.

The adaptive KL divergence is explained in Section 4 of [1]. Instead of setting a fixed scaling factor for the KL divergence, authors propose dynamically adjusting this factor throughout training such that the KL divergence stays close to a target KL divergence d_targ. Put differently, instead of choosing the scaling factor, we specify what we want our KL divergence to be and dynamically adjust the scaling factor throughout training to keep the KL divergence in this range. This approach is not commonly used for recent LLMs, and it is much more common to set a fixed β coefficient for the KL divergence.

The reference and old models are different models in PPO! The reference model is the policy parameters before any RL training is performed. For LLMs, the SFT model is usually the reference model. We usually perform multiple updates over a batch of data in PPO, and the old model is the model before the first update. The old model is updated each time a new batch of data is sampled, whereas the reference model is fixed.

This means that less data is required to achieve a given level of performance (i.e., the learning process is faster).

Specifically, we would use the cumulative reward after state s_t. However, for LLMs this distinction does not usually matter due to the use of outcome rewards.

In fact, this is where the name for the TD residual comes from. We are computing the difference in value between two time steps.

The critic is just a model that imperfectly estimates of the value function. The bias in the TD residual comes from the fact that the critic makes mistakes in estimating the value.

To derive this expression, we begin with the original formula for the GAE showed in the top line, expand the definitions of the N-step advantage estimates, rearrange the terms, then use the geometric series formula to derive the final expression.

This statement assumes that the KL divergence is added to the loss and not directly incorporated into the reward.

REINFORCE: Easy Online RL for LLMs

Cameron R. Wolfe, Ph.D. — Mon, 29 Sep 2025 09:33:23 GMT

Reinforcement learning (RL) is playing an increasingly important role in research on large language models (LLMs). Initially, RL was used to power LLM alignment via approaches like Reinforcement Learning from Human Feedback (RLHF). More recently, it has become foundational for training powerful large reasoning models (LRMs). When training LLMs with RL, online algorithms such as Proximal Policy Optimization (PPO) are often used by default. However, these algorithms are expensive and complex compared to alternatives like supervised finetuning (SFT) or direct preference optimization (DPO):

Four different copies of the LLM must be kept in memory.
The online training process is difficult to orchestrate and can be unstable.
There are many training hyperparameters that must be tuned properly.

The complexity of PPO arises from the need to stabilize the online training process. This algorithm was developed in an earlier generation of research, which focused on training neural networks from scratch to solve tasks like robotic locomotion and Atari gameplay. The RL setting for LLMs is much different—we are fine-tuning pretrained models that already have a powerful prior.

“PPO has been positioned as the canonical method for RLHF. However, it involves both high computational cost and sensitive hyperparameter tuning. We posit that the motivational principles that led to the development of PPO are less of a practical concern in RLHF and advocate for a less computationally expensive method that preserves and even increases performance.” - from [3]

Many practitioners avoid the use of online RL when training LLMs due to cost and complexity. In this overview, we will learn that online RL does not have to be so difficult! Due to the unique properties of the LLM domain, we can use simpler algorithms—like REINFORCE or REINFORCE leave-one-out (RLOO)—and still achieve performance similar to that of PPO. Therefore, instead of avoiding online RL in favor of simpler RL-free or offline alternatives, we can just use algorithms that provide the benefits of online RL without the unnecessary complexity.

Basics of RL for LLMs

We will begin by covering the basics of reinforcement learning (RL). To start, we will explore the problem setup and terminology commonly used in RL, as well as how these formalisms can be translated to the LLM domain. After covering RL fundamentals and how RL is applied in the context of LLMs, we will spend the majority of this section focusing on policy optimization by deriving the standard policy gradient expression frequently used in RL and outlining concrete implementations for the most basic forms of these training algorithms.

Problem Setup and Terminology for RL

When running RL training, we have an agent that takes actions within some environment; see below.

Basic problem setup for RL

These actions are predicted by a policy—we can think of the policy as the agent’s brain—that is usually parameterized (e.g., the policy is the LLM itself in the context of training LLMs). Our policy can either be deterministic or stochastic, but in this overview we will assume the policy is stochastic1. We can model the probability of a given action under our policy as π_θ(a_t | s_t).

When the policy outputs an action, the state of the environment will be updated according to a transition function, which is part of the environment. We will denote our transition function as P(s_t+1 | a_t, s_t). However, transition functions are less relevant for LLMs because they are typically a pass-through; i.e., we assume s_t = {x, a_1, a_2, …, a_t}, where x is the prompt.

Using the chain rule of probabilities, we can also compute the probability of a full trajectory by combining the probabilities of:

Each action a_t given by our policy π_θ(a_t | s_t).
Each state s_t+1 given by the transition function P(s_t+1 | a_t, s_t).

The full expression for the probability of a trajectory is provided below.

Computing the probability of a trajectory

RL objective. When training a model with RL, our goal is to maximize the cumulative reward over the entire trajectory (i.e., the sum of r_t). However, there are a few variations of this objective that commonly appear. Specifically, the reward that we maximize can either be discounted or non-discounted2; see below. By incorporating a discount factor, we reward our policy for achieving rewards sooner rather than later. In other words, money now is better than money later.

Our objective is usually expressed as an expected cumulative reward, where the expectation is taken over the trajectory. Expanding this expectation yields a weighted sum of rewards for each trajectory—the weight is just the trajectory’s probability. We can formulate this in a continuous or discrete manner; see below.

We want to maximize this objective during training, which can be accomplished via gradient ascent 3; see below. Given this setup, the lingering question that we have to answer is: How do we compute this gradient? As we will see, much of the research on RL focuses on answering this question, and many techniques exist.

Solving the RL objective with gradient ascent

State, value, and advantage functions. Related to RL objective, we can also define the following set of functions:

Value Function V(s): the expected cumulative reward when you start in state s and act according to your current policy π_θ.
Action-Value Function Q(s, a): the expected cumulative reward when you start in state s, take action a, then act according to your policy π_θ.
Advantage Function A(s, a): the difference between the action-value and value function; i.e., A(s, a) = Q(s, a) - V(s).

“Sometimes in RL, we don’t need to describe how good an action is in an absolute sense, but only how much better it is than others on average. That is to say, we want to know the relative advantage of that action. We make this concept precise with the advantage function.” - Spinning up in Deep RL

Markov Decision Process (MDP) versus Bandit Formulation

RL terminology mapping for LLMs

Now that we understand RL basics, we need to map the terminology that we have learned to the setting of LLM training. We can do this as follows (shown above):

Our policy is the LLM itself.
Our initial state is the prompt.
The LLM’s output—either each token or the entire completion—is an action.
Our state is the combination of our prompt with the LLM’s output.
The entire completion from the LLM forms a trajectory.

Markov decision process (MDP) formulation. For LLMs, there are two key ways in which RL can be formulated that differ in how they model actions. We should recall that an LLM generates output via next token prediction; i.e., by generating each token in the output completion sequentially. This autoregressive process is depicted below. As we can see, the next token prediction process maps very easily to an RL setup—we can just model each token as an individual action!

The approach of modeling each token in the LLM’s output as an individual action is called the Markov Decision Process (MDP) formulation. An MDP is simply a probabilistic framework for modeling decision-making that includes states, actions, transition probabilities and rewards—this is exactly the setup we have discussed so far for RL! The MDP formulation used for RL is shown below.

When modeling RL as an MDP for LLMs, our initial state is the prompt and our policy acts by predicting individual tokens. Our LLM forms a stochastic policy that predicts a distribution over tokens. During generation, actions are taken by selecting a token from this distribution—each token is its own action. After a token is predicted, it is added to the current state and used by the LLM to predict the next token—this is just autoregressive next token prediction! Eventually, the LLM predicts a stop token (e.g., <|end_of_text|> or ) to complete the generation process, thus yielding a complete trajectory.

Bandit formulation. In the above depiction of an MDP, we assume that a reward is provided for every time step, but the reward mechanism for an LLM is usually a bit different from this. Most LLMs are trained using outcome supervision4, meaning that a reward is only assigned after the model has generated a complete response (i.e., after the token has been outputted).

Outcome versus process supervision for LLMs

In an outcome supervision setting, we may begin to question the utility of modeling each token as its own action. How will we know whether any single action is helpful or not in this scenario? As an alternative, we could model the entire response as a single action that receives an outcome reward. This is the key idea behind the bandit formulation for RL training with LLMs; see below.

This name comes from the idea of a contextual bandit in probability theory. The bandit setup is simple: our agent chooses an action, receives a reward and the episode ends. Our complete trajectory is a single action and reward! For LLMs, our action is the full completion generated for a prompt, which receives an outcome reward.

Which formulation should we use? In the context of LLMs, we already know how to compute the probability of both individual tokens and the full completion for a prompt. Therefore, we have the ability to model RL using either an MDP or bandit formulation. Given that LLMs usually only receive outcome rewards, however, the bandit formulation—despite being very simple—is quite fitting for LLMs. As we will learn, both REINFORCE and RLOO adopt the bandit formulation, while algorithms like PPO use a per-token MDP formulation. In other words, both RL formulations are viable and used for training LLMs.

RL Training for LLMs

Given the terminology and setup explained so far, we can now discuss how RL is actually used to train LLMs. There are two broad categories of RL training that are commonly used for LLMs today:

Reinforcement Learning from Human Feedback (RLHF) trains the LLM using RL with rewards derived from a human preference reward model.
Reinforcement Learning with Verifiable Rewards (RLVR) trains the LLM using RL with rewards derived from rules-based or deterministic verifiers.

Visual depiction of RL for LLMs

The last step of this process is a gradient ascent step on the RL objective, just as we saw before. However, the actual objective used in RL training goes beyond maximizing cumulative reward. We try to maximize the reward while minimizing KL divergence of our policy with respect to a reference policy—usually an LLM checkpoint from the start of RL training. We want to maximize reward without making our new model significantly different from the reference; see below.

RL training objective with KL divergence

Computing the gradient of this objective with respect to the policy’s parameters is where most of the complexity lies in understanding RL. In the context of LLMs, we use policy gradient algorithms (e.g., PPO, GRPO, and REINFORCE) to compute this gradient. This overview will primarily focus on REINFORCE and its variants, but to learn how these algorithms work we need to first understand the simplest form of a policy gradient—the vanilla policy gradient (VPG).

Deriving the Vanilla Policy Gradient (VPG)

We will cover the full derivation of the vanilla policy gradient (VPG) here for completeness. However, there are many existing overviews that explain VPG very well. A few great resources for further learning are as follows:

Intro to Policy Optimization from OpenAI [link]
RLHF Book from Nathan Lambert [link]
Policy Optimization Algorithms from Lilian Weng [link]

Additionally, the prior breakdown of VPG and policy optimization from this newsletter is linked below for easy reference. Our discussion in this section will largely be sampled from this more detailed exposition of policy gradients.

A basic policy gradient. Our goal in policy optimization is to compute the policy gradient, or the gradient of our RL objective—here we will assume our objective is cumulative reward—with respect to the parameters of our policy. As a first step in computing the policy gradient, we can perform the derivation shown below.

(source)

This derivation starts with the gradient of our RL training objective (cumulative reward) and ends with a basic expression for the policy gradient. To arrive at the policy gradient, we use mostly simple steps like i) the definition of an expectation over a continuous random variable and ii) the log-derivative trick.

The most complicated step of this derivation is the final step, which transforms the gradient of the log probability of a trajectory into a sum over the gradients of log probabilities of actions. This step uses our prior expression for the probability of a trajectory, converts the product into a sum (i.e., because we are working with log probabilities), and observes that the gradients of the initial state probability and transition function with respect to the policy parameters are always zero because neither of these components depend on the policy; see below.

(source)

Implementing a basic policy gradient. The basic policy gradient expression that we derived above is actually pretty easy to compute. Specifically, this expression contains two key quantities that we already know how to compute:

The reward comes directly from a verifier or reward model.
Log probabilities of actions can be computed with our LLM (i.e., these are just the token probabilities from the LLM’s output).

To make the process of computing the basic policy gradient more concrete, a step-by-step implementation in PyTorch pseudocode has been provided below.

The core intuition behind the structure of this basic policy gradient is that we are increasing the probability of actions from trajectories with high rewards.

“Taking a step with this gradient pushes up the log-probabilities of each action in proportion to R(𝜏), the sum of all rewards ever obtained.” - Spinning up in Deep RL

This form of the policy gradient is simple, but it still appears in practice! For example, Cursor uses this exact expression in their recent blog on online RL. However, the expression in their blog assumes a bandit formulation, which causes the sum in the expression to be removed (i.e., because there is only one action).

(source)

Reducing variance. Our current policy gradient expression is simple, but it suffers from a few notable issues:

The gradients can have high variance.
There is no protection against large, unstable policy updates.

Most subsequent policy gradient algorithms aim to solve these problems by reducing variance of the policy gradient and enforcing a trust region on policy updates—or, in other words, restricting how much we can change the model in a single update. To do this, we usually replace the reward term in our policy gradient with a slightly different term; see below for some of the common options.

(from [4])

As we can see, this expression is nearly identical to what we saw before. The only difference is that we have switched R(𝜏) with the generic Ψ_t term, which can be set equal to a couple of different things. For example, we can:

Set Ψ_t = R(𝜏) to recover our basic policy gradient expression.
Set Ψ_t equal to rewards received after time t (i.e., the reward-to-go policy gradient) to avoid crediting actions with rewards that came before them.
Set Ψ_t to a baselined version of the reward.
Set Ψ_t equal to the state-action (Q) or advantage function (A).

A full overview of these choices and how they are derived can be found here. A common theme among these algorithms is the use of baselines, or extra terms—which must only depend on the state s_t—that we subtract from the reward as shown below. Baselines serve the purpose of normalizing the reward (or value) for a state and can be shown to reduce the variance of policy gradients5.

Adding a baseline to rewards in the policy gradient

A common problem with vanilla policy gradient algorithms is the high variance in gradient updates… In order to alleviate this, various techniques are used to normalize the value estimation, called baselines. Baselines accomplish this in multiple ways, effectively normalizing by the value of the state relative to the downstream action (e.g. in the case of Advantage, which is the difference between the Q value and the value). The simplest baselines are averages over the batch of rewards or a moving average. - RLHF book

Most of the algorithms we will see focus on setting Ψ_t equal to the advantage function—this is known as the vanilla policy gradient (VPG) algorithm. The advantage function is commonly used because it yields the lowest-variance policy gradient.

The vanilla policy gradient

Actor-critic. We should recall that the advantage function is the difference between the state-action value function and the value function. In other words, the VPG algorithm effectively uses the value function as a baseline in the policy gradient. The value function is on-policy, meaning that it depends on the exact parameters of our policy in the current training iteration. Usually, we estimate the value function with a neural network. For LLMs, the value function is approximated with a separate value head6 (or model) that is initialized from the weights of the LLM and trained to predict the value function.

The LLM used to estimate the value function is referred to as a value model or critic. The critic predicts the value function—or the expected reward starting from a given token or state—for every token within a sequence. During RL training, the critic is actively updated alongside the LLM for each policy update—this is referred to as an actor-critic setup7. Unlike reward models which are fixed at the beginning of RL training, the critic is dependent upon the current parameters of the policy. Therefore, to remain on-policy and avoid its predictions becoming stale, the critic must be updated along with the LLM itself. PPO is a notable example of a policy gradient algorithm that adopts such an actor-critic setup.

The critic is usually updated using a mean-squared error (MSE) loss between the predicted and actual rewards. A pseudocode implementation of an actor-critic algorithm is provided below. Although this is a common setup, the use of a value model can be quite expensive—this requires keeping an entire additional copy of the LLM in memory! In fact, using a critic is part of the reason why PPO has high computational overhead. Next, we will learn about algorithms that adopt simpler and more efficient approaches for estimating the value function.

import torch
import torch.nn.functional as F

# sample prompt completions and rewards
with torch.no_grad():
    completions = LLM(prompts)  # (B*G, L)
    rewards = RM(completions)  # (B*G, 1)

# compute value function / critic output
values = CRITIC(completions)  # (B*G, L) - per token!
advantage = rewards - values.detach()

# get logprobs for each action
completion_mask = <... mask out padding tokens ...>
llm_out = LLM(completions)
token_logp = F.log_softmax(llm_out, dim=-1)

# loss includes a weighted combination of the policy gradient
# loss and the MSE loss for the critic
loss = (- token_logp * advantage) * completion_mask
loss += _beta * (0.5 * (values - rewards)**2)

# aggregate the loss (many options exist here)
loss = (loss.sum(axis=-1) /
        completion_mask.sum(axis=-1)).mean()

# gradient update
optimizer.zero_grad()
loss.backward()
optimizer.step()

REINFORCE and RLOO for LLMs

So far, we have learned about basic concepts in policy optimization and RL for LLMs. The basic policy gradient that we derived is easy to compute practically, but such a formulation leads to high-variance policy gradients and unstable training. To reduce variance, we need an RL optimizer that incorporates an advantage estimate into the policy gradient. However, popular algorithms like PPO accomplish this with a complicated actor-critic framework that introduces substantial overhead. Given this added complexity, we might wonder: Should we just avoid online RL techniques altogether when training LLMs?

“Recent works propose RL-free methods such as DPO or iterative fine-tuning approaches to LLM preference training. However, these works fail to question whether a simpler solution within an RL paradigm exists.” - from [3]

Although many offline and RL-free training alternatives exist, there are also simple online RL algorithms that can be used to train LLMs. In this section, we will learn about REINFORCE and a slightly modified version of this algorithm called REINFORCE leave one out (RLOO). These online RL algorithms eliminate the need for a critic by estimating the value function with the average of rewards observed throughout training. In theory, such an approach yields higher-variance policy gradients compared to actor-critic algorithms like PPO. However, recent research [3, 5] has found that this increase in variance does not impact LLM training, yielding easy-to-use and highly-performant options for online RL training.

REward Increment = Nonnegative Factor x Offset Reinforcement x Characteristic Eligibility (REINFORCE) [1]

REINFORCE is a particular implementation of VPG that has low overhead, is simple to understand, and tends to be effective for training LLMs. The structure of the policy gradient used by REINFORCE is similar to the baselined policy gradient estimate we covered before. However, REINFORCE specifically uses the average of rewards observed during RL training as a baseline. This average can be computed in a few different ways; e.g., a moving average of rewards throughout training or an average of rewards present in the current batch.

The expression for the policy gradient in REINFORCE is shown above. To compute a gradient update over a batch, we perform the following steps:

Generate a completion for each prompt using the current policy.
Store the log probabilities for the tokens in each completion
Assign a reward to each completion (usually with a reward model).
Obtain a baseline by taking an average of rewards.
Compute the advantage by subtracting the baseline from the reward.
Compute the sum of log probabilities multiplied by the advantage for each completion, then average over the batch to form a Monte Carlo estimate.

What does the acronym mean? The REINFORCE acronym is composed of three key components:

Reward Increment.
Non-negative factor.
Offset reinforcement.
Characteristic eligibility.

The first component is simply our update—or increment—to the policy’s parameters (i.e, the policy gradient), which is a product of the three other components. The manner in which these components are combined to form the policy gradient is shown below (top term). To clarify the meaning of each term, we also map the components of REINFORCE to the more familiar expression for a policy gradient. As we can see, these are the same terms we have learned about before (e.g., log probabilities, reward, and baseline)! Additionally, REINFORCE includes the learning rate—a “non-negative factor” because we are performing gradient ascent and trying to maximize rewards—within its expression.

Mapping REINFORCE components to a familiar policy gradient expression

The term “offset reinforcement” is straightforward to understand. The baseline is directly subtracted from the reward in our policy gradient expression. In other words, the baseline is used to offset the reward, which is the reinforcement signal in RL (i.e., the reward determines whether actions are good or bad). The baseline is, therefore, an offset to the reinforcement signal. Unpacking the term “characteristic eligibility” requires a slightly deeper understanding of RL terminology.

“Characteristic Eligibility: This is how the learning becomes attributed per token. It can be a general value, per parameter, but is often log probabilities of the policy in modern equations.” - RLHF book

“Eligibility” is a jargon term in RL related to the credit assignment problem—or the problem of determining which specific actions contributed to the reward received by the policy. Specifically, eligibility refers to whether a particular action taken by the LLM is actually responsible for a given reward. In the policy gradient expression, credit assignment is handled by the log probabilities of actions under the policy.

Incorporating KL divergence. As with most other RL training algorithms, we also incorporate the Kullback-Leibler (KL) Divergence with respect to a reference policy—usually a prior SFT-trained checkpoint of our model—into REINFORCE. We have several different approaches for approximating KL divergence. A common approach is to approximate KL divergence as the difference in log probabilities between the policy and reference policy. Once we’ve made this approximation, the KL divergence is directly incorporated into the reward as shown below.

This approach of subtracting the KL penalty from the reward varies depending on the RL training algorithm or implementation. For ex ample, GRPO incorporates the KL divergence into the loss function rather than directly into the reward. Adding the KL divergence into RL regularizes the training process and allows us to ensure that our policy does not deviate significantly from the reference policy.

Efficiency & overhead. Compared to algorithms like PPO, REINFORCE has reduced overhead, as it does not require the use of a value (or critic) model to compute the advantage estimate—the average of rewards is used in place of the critic. Therefore, there are only three LLM involved in the training process (i.e., policy, reference policy, and reward model), rather than four; see below. The downside of estimating the advantage in this way is higher variance. As we will see, however, the high variance of REINFORCE is not always a problem in the domain of finetuning LLMs—this simple algorithm is actually quite effective in practice.

Key models involved in training with REINFORCE

Modeling full completions. There is one final detail missing from the image above: How do we aggregate the log probabilities, KL divergences, and rewards to form the policy gradient update? One of the key distinguishing aspects of REINFORCE is that is uses a bandit formulation. The policy is trained by considering the full completion, rather than each token in the completion, as a single action.

“[REINFORCE] treats the entire model completion as a single action, whereas regular PPO treats each completion token as individual actions. Typically, only the EOS token gets a true reward, which is very sparse. Regular PPO would attribute a reward to the EOS token, whereas [REINFORCE] would attribute that EOS reward to the entire completion.” - from [5]

As we’ve learned, most LLMs are trained using an outcome reward setting, meaning that only the final token generated by the LLM is assigned a reward. However, the KL divergence is computed on the per-token basis, and—as mentioned before—the KL divergence is directly subtracted from the reward in REINFORCE. Therefore, we end up with a setup where the reward for all tokens in the completion is just the KL divergence, but the final token in the completion receives an additional reward from the reward model; see below.

Bandit formulation in REINFORCE

We create a completion-level (bandit formulation) reward by summing per-token KL divergences and rewards over the sequence. Similarly, we can sum token-level log probabilities to get the log probability of the completion (or trajectory)8. As shown above, we can then use these completion-level components to compute the policy gradient similarly to before:

Subtract the baseline (average reward) from the completion-level reward.
Multiply this difference by the completion log probability.
Run a backward pass to compute the final policy gradient9.

This process computes the policy gradient for a single prompt and completion pair, but we generally average this gradient over a batch of completions.

Pseudocode. As a final step, we will make this discussion more concrete by implementing the computation of the policy gradient for REINFORCE in basic PyTorch10. We will assume that the baseline is computed by taking an average of rewards in the batch (i.e., rather than by using a moving average) so that the entire gradient update can be outlined within a single script; see below.

import torch

# constants
kl_beta = 0.1

# batch of two completions with three tokens each
per_token_logprobs = torch.tensor(
    [
        [-12.3, -8.3, -2.3],
        [-10.0, -7.0, -3.0],
    ],
    requires_grad=True,
)
reference_per_token_logprobs = torch.tensor([
    [-11.3, -8.4, -2.0],
    [-9.5, -7.2, -2.8],
])

# compute KL divergence approximation
kl_div = per_token_logprobs - reference_per_token_logprobs
kl_div = -kl_beta * kl_div

# get reward for each completion (e.g., from reward model)
score_from_rm = torch.tensor([1.0, 0.5])

# reward is attributed to final  token
per_token_reward = kl_div.clone()
per_token_reward[range(per_token_reward.size(0)), -1] += score_from_rm

# compute REINFORCE update over full sequence
entire_completion_reward = per_token_reward.sum(dim=1)
baseline = entire_completion_reward.mean().detach()

# compute advantage
advantage = entire_completion_reward - baseline

# compute loss and gradient update
reinforce_loss = -per_token_logprobs.sum(dim=1) * advantage
reinforce_loss.mean().backward()

REINFORCE Leave One Out (RLOO) [2]

In REINFORCE, we generate a single on-policy completion per prompt during training and use the rewards from these completions to form our baseline via a moving average or an average of rewards in the batch. REINFORCE leave-one-out (RLOO) [2] changes this approach by:

Sampling multiple (K) completions per prompt.
Using these multiple completions to compute the reward average separately for each individual prompt.

Given K completions {y_1, y_2, …, y_K} to a prompt x, RLOO defines the baseline for completion y_i as shown below, which is simply an average over all rewards for completions to prompt x excluding the completion itself y_i. We “leave out” the reward of the completion for which the policy gradient is being computed and average over the rewards of other completions to the same prompt.

Computing the baseline for RLOO

From here, we can compute the advantage estimate in RLOO by i) computing this baseline for every completion in the batch and ii) subtracting the baseline from the reward received by the completion; see below (first equation). To efficiently compute the baseline for RLOO, we can first compute a fixed average reward over the K completions and reformulate the advantage as in the second equation below. This approach allows us to compute the average reward once and avoid re-computing the leave one out average for all K completions to the prompt x.

Advantage estimate in RLOO

This modified advantage estimate can be plugged into the same policy gradient expression used by REINFORCE. Similarly to REINFORCE, RLOO uses a per-completion—as opposed to per-token—loss, and we have no learned value model. However, the leave one out baseline used by RLOO lowers variance relative to the standard REINFORCE algorithm by using multiple samples per prompt to derive the policy gradient estimate. Compared to a single-sample approach, taking multiple samples per prompt benefits training stability, speed, and performance.

“The common case of sampling one prediction per datapoint is data-inefficient. We show that by drawing multiple samples per datapoint, we can learn with significantly less data, as we freely obtain a REINFORCE baseline to reduce variance.” - from [2]

Practical usage. After the popularization of RLOO for LLMs, a great blog on this topic was published by HuggingFace [5] exploring the implementation and practical performance of RLOO. This analysis extends authors’ prior work on correctly implementing and tuning PPO-based RLHF on summarization tasks [6]—OpenAI’s TL;DR summarization dataset in particular. In [5], these results are extended by training Pythia 1B and 6.9B models with RLOO, starting from the same SFT checkpoints and reward models from [6]. Models are evaluated by comparing their output to a reference summary with a GPT-4 judge; see below.

(from [5])

As we can see, RLOO uses 50-70% less memory than PPO and runs 2-3× faster. These savings increase with the size of the model. In addition to these gains in efficiency, RLOO performs competitively to PPO and consistently outperforms offline algorithms like DPO. These results demonstrate the key value proposition of RLOO (and REINFORCE)—these algorithms maintain the performance benefits of online RL algorithms while being simpler to implement and less costly to run.

Pseudocode. To implement RLOO, we can modify our original REINFORCE example as shown below. Here, we assume that three completions are sampled per prompt (i.e., K = 3) and that our batch is composed of three prompts. For more production-ready code, both REINFORCE and RLOO are also supported within the volcano engine reinforcement learning (verl) library [7]; see here.

import torch

# constants
K = 3  # completions per prompt
kl_beta = 0.1

# batch of three prompts with three completions each
per_token_logprobs = torch.tensor(
    [
        # prompt 1
        [
            [-12.3, -8.3, -2.3], # completion 1
            [-10.0, -7.0, -3.0], # completion 2
            [-10.5, -12.2, -9.1], # completion 3
        ],

        # prompt 2
        [
            [-11.0, -10.3, -1.3],
            [-11.1, -11.1, -0.8],   
            [-8.2, -11.9, -0.1],        

        ],
        
        # prompt 3
        [
            [-1.8, -2.1, -0.2],
            [-0.7, -3.5, -0.1],
            [-1.0, -2.2, -1.1],
        ],
    ],
    requires_grad=True,
)
reference_per_token_logprobs = torch.tensor([
    [
        [-11.8, -8.4, -2.3], 
        [-10.1, -7.2, -3.1],
        [-10.3, -12.9, -9.1],
    ],
    [
        [-11.8, -9.7, -1.3],
        [-12.3, -11.9, -0.2],
        [-8.1, -12.0, -0.5],
    ],
    [
        [-2.7, -2.0, -1.2],
        [-0.7, -3.6, -0.2],
        [-0.7, -1.2, -0.9],
    ],
])

# compute KL divergence approximation
kl_div = per_token_logprobs - reference_per_token_logprobs
kl_div = -kl_beta * kl_div

# reward for each completion (grouped by prompt)
score_from_rm = torch.tensor([
    [1, 2, 3], # rewards for completions to prompt 1
    [2, 3, 4], # rewards for completions to prompt 2
    [3, 4, 5], # rewards for completions to prompt 3
]).float()

# reward attributed to final  token
per_token_reward = kl_div.clone()
per_token_reward[:, :, -1] += score_from_rm

# compute full sequence reward
entire_completion_reward = per_token_reward.sum(dim=-1)

# compute RLOO baseline in vectorized fashion
baseline = (
    entire_completion_reward.sum(dim=-1)[:, None]
    - entire_completion_reward
) / (K - 1)
baseline = baseline.detach()

# compute advantage and loss
advantage = entire_completion_reward - baseline
rloo_loss = -per_token_logprobs.sum(dim=-1) * advantage
rloo_loss.mean().backward()

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs [3]

“We posit that most of the motivational principles that led to the development of PPO are less of a practical concern in RLHF and advocate for a less computationally expensive method that preserves and even increases performance.” - from [3]

Although PPO is the de facto RL optimizer for RLHF, authors in [3] argue that the original motivations for PPO (i.e., avoiding large and unstable policy updates) are less relevant in the context of LLMs. Instead, we can use simpler RL optimizers—REINFORCE in particular—to save on compute and memory costs without sacrificing performance. In particular, we learn that aligning LLMS with a basic REINFORCE algorithm can achieve results that match or exceed those of PPO-based RLHF, as well as other algorithms like DPO and RAFT. This paper was a key contribution that popularized the use of simpler RL optimizers for LLMs.

LLMs versus DeepRL. The crux of the argument in [3] revolves around the fact that LLM finetuning is a unique setting for RL that differs significantly from the traditional DeepRL setting in which algorithms like PPO were proposed. The most notable difference between these two settings is that LLMs are not trained with RL from scratch. Rather, we are finetuning an LLM that has already undergone extensive pretraining. This difference has two key implications:

The risk of policy updates with catastrophically large variance is lower in LLM finetuning relative to the traditional DeepRL setting.
The LLM finetuning setting has less of a need for regularizing the learning process relative to the traditional DeepRL setting.

We can concretely test this hypothesis by tweaking the settings of PPO. Namely, most implementations of PPO use Generalized Advantage Estimation (GAE) [4] to estimate the advantage function. The details of GAE are beyond the scope of this post. However, GAE contains the λ ∈ [0.0, 1.0] hyperparameter that can be used to control the tradeoff between bias and variance in the advantage estimate.

(from [3])

Lowering λ reduces variance at the cost of increased bias, but this is a worthwhile tradeoff for domains—like DeepRL—with excessive variance in policy updates. As shown above, optimal performance in LLM alignment is achieved with a setting of λ = 1.0, which induces maximum possible variance in the policy gradient. Such a finding indicates that the level of variance in policy updates observed for LLM alignment is not detrimental to the LLM’s learning process.

“Large off-policy updates in our optimization regime are rare and do not have catastrophic effects on learning as they do in traditional DeepRL.” - from [3]

Effective action space. In addition to high variance, one complicating factor of RL training is the presence of a large action space. If there are many possible actions for the policy to take and rewards from these actions are noisy, learning a high-quality policy is difficult. Theoretically, the action space of an LLM is very large—it includes all completions that the LLM can generate for a given prompt.

(from [3])

Practically speaking, however, the effective action space of an LLM—the set of completions that the model is likely to generate—is actually quite small. When an LLM is performing generation, this process is conditioned upon the prompt provided to the LLM, which is shown in [3] to be a strong conditioning.

(from [3])

More specifically, we see in the figure above that probability mass in an LLM’s completions is highly-concentrated amongst a small number of tokens after the first step of the generation process (i.e., the first token that is outputted). Such an observation demonstrates that an LLM’s prompt provides strong conditioning for the generation process, which makes the mode’s effective action space quite small.

From PPO to REINFORCE. Given that variance is less of a concern for LLMs, authors in [3] perform RLHF experiments that use much simpler REINFORCE and RLOO algorithms as the RL optimizer in place of PPO. REINFORCE and RLOO make significant changes to the RL formulation used in PPO. Namely, PPO uses a per-token MDP formulation, while both REINFORCE and RLOO adopt a bandit formulation—the entire completion is modeled as a single action.

“We show that the modeling of partial sequences is unnecessary in this setting where rewards are only attributed to full generations… it is more appropriate and efficient to model the entire generation as a single action with the initial state determined by the prompt.” - from [3]

In addition to being simpler than the MDP formulation, modeling the full generation as a single action preserves the LLM’s performance and even speeds up learning, indicating that formulating each token as its own action is an unnecessary complexity in an outcome reward setting.

Experimental setup. Experiments in [3] are conducted on the TL;DR summarize and Anthropic HH datasets with Pythia-6.9b and Llama-7b models. Both reward models and policies are initialized using a model checkpoint obtained by running SFT on a curated dataset of high-quality completions for each respective dataset. During RL, training prompts are sampled from the SFT dataset. For evaluation, authors report each model’s average reward—from the fixed reward model used for RL training—on a hold out test set, as well as win-rates against GPT-4 using the AlpacaFarm framework (i.e., open-ended evaluation on chat-style prompts).

(from [3])

Is REINFORCE effective? As shown above, both REINFORCE and RLOO—in addition to being less memory intensive due to their lack of a learned critic model—consistently outperform PPO, confirming that modeling partial sequences is unnecessary for the RLHF setting in [3]. RLOO is also found to be more sample efficient than the RAFT algorithm [9]—given the same number of on-policy samples generated during training, RLOO tends to achieve better performance; see below.

(from [4])

This finding holds true for all models and data tested in [3]. The superior sample efficiency of RLOO makes intuitive sense given that all samples—even those with poor or negative reward—are used during training. In contrast, RAFT filters samples based on their reward and only trains on those with the best rewards.

When we evaluate models in terms of simulated win-rates on AlpacaFarm, many of the results above continue to be true, but we can compare the performance of each technique in a more human-understandable manner. As shown below, the best performance is consistently achieved with RLOO, and both REINFORCE and RLOO consistently outperform PPO. Notably, RLOO—with four on-policy samples per prompt—outperforms PPO by an absolute increase in win-rate of 10.4% and 14.5% for TL;DR and HH datasets. When used to align Llama, RLOO sees an even larger absolute win-rate improvement of 32.1% over PPO.

(from [3])

Improved robustness. Authors in [3] conclude by studying the robustness of RLOO relative to RAFT in two areas:

How does increasing the β term for KL divergence impact performance?
How does adding noise to the reward estimate impact performance?

Interestingly, RLOO is found to be noticeably more robust to noise relative to RAFT; see below. When increasing β, RAFT performs worse than RLOO and produces a policy with a larger KL divergence relative to the reference policy. Additionally, the performance of RAFT sees a larger negative impact from noisy reward estimates relative to RLOO. Such degraded robustness to noise is caused by the fact that RAFT only trains on the highest-reward completions, leading any perturbation to reward estimates to significantly impact training.

(from [3])

Conclusion

We now have a foundational understanding of RL for LLMs that spans from basic terminology to functional implementations of online RL algorithms. Most work on RL training for LLMs uses actor-critic algorithms like PPO as the underlying optimizer. But, these algorithms introduce complexity and overhead to reduce the variance of policy gradients. In the context of LLMs, we have learned that much simpler online RL algorithms are available! REINFORCE and RLOO adopt a completion-level bandit setup for RL and normalize rewards using either:

The average of rewards during training (for REINFORCE), or
The average of rewards for other completions to a prompt (for RLOO).

Because they estimate the value function in this way, neither REINFORCE or RLOO require a learned critic, which reduces memory overhead and speeds up the training process. If we want to avoid the complexity of algorithms like PPO, these simpler online RL algorithms offer an effective alternative, rather than immediately turning to approaches that are completely offline or RL-free.

New to the newsletter?

Subscribe now

Bibliography

[1] Williams, Ronald J. “Simple statistical gradient-following algorithms for connectionist reinforcement learning.” Machine learning 8.3 (1992): 229-256.

[2] Kool, Wouter, Herke van Hoof, and Max Welling. “Buy 4 reinforce samples, get a baseline for free!.” (2019).

[3] Ahmadian, Arash, et al. “Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms.” arXiv preprint arXiv:2402.14740 (2024).

[4] Schulman, John, et al. “High-dimensional continuous control using generalized advantage estimation.” arXiv preprint arXiv:1506.02438 (2015).

[5] Costa Huang, Shengyi, et al. “Putting RL back in RLHF” https://huggingface.co/blog/putting_rl_back_in_rlhf_with_rloo (2024).

[6] Huang, Shengyi, et al. “The n+ implementation details of rlhf with ppo: A case study on tl; dr summarization.” arXiv preprint arXiv:2403.17031 (2024).
[7] Sheng, Guangming, et al. “Hybridflow: A flexible and efficient rlhf framework.” Proceedings of the Twentieth European Conference on Computer Systems. 2025.

[8] Lightman, Hunter, et al. “Let’s verify step by step.” The Twelfth International Conference on Learning Representations. 2023.

[9] Dong, Hanze, et al. “Raft: Reward ranked finetuning for generative foundation model alignment.” arXiv preprint arXiv:2304.06767 (2023).

In other words, the output of our policy is not just a discrete action. Rather, it is a probability distribution over a set of possible actions. For example, LLMs output a probability distribution over the set of potential next tokens.

Additionally, we can have a finite or infinite-horizon setup in this return. However, in the context of LLMs, we usually assume a finite-horizon setup (i.e., the LLM does not continue generating tokens forever).

Here, we use gradient ascent (as opposed to descent) because we are trying to maximize a function. However, gradient ascent and descent are nearly identical. The only change is whether we subtract—if minimizing a function in gradient descent—or add—if maximizing a function in gradient ascent—the gradient to our model’s parameters.

Process supervision is possible and has been explored in research on large reasoning models (LRMs), but it is less common than the outcome reward setting.

Additionally, adding baselines to the policy gradient does not bias our gradient estimate. This fact can be proven by using the EGLP lemma, which also mandates that the baseline must only depend on the state s_t.

By “head”, we mean an extra small layer added to the end of the LLM that is trainable.

This stems from basic concepts in language modeling. Namely, we can take the product of probabilities for all tokens in a completion (or the sum of log probabilities) to get the probability of the full completion.

Our policy gradient term contains the gradient of log probabilities, but we have access to log probabilities (not the gradient of log probabilities) in our example. So, we need to take the gradient of these log probabilities—usually by running loss.backward() in PyTorch—to get the final policy gradient.

This implementation, as well as our later implementation of RLOO, is just a modified version of the code from this blog post.

Online versus Offline RL for LLMs

Cameron R. Wolfe, Ph.D. — Mon, 08 Sep 2025 09:33:21 GMT

(from [2, 5, 7, 9, 10])

The alignment process teaches large language models (LLMs) how to generate completions that receive high human preference scores. The traditional strategy for alignment includes supervised finetuning and proximal policy optimization (PPO)-based reinforcement learning from human feedback (RLHF). Although this approach works well, PPO-based RLHF is an online RL training algorithm that is complex to implement for a variety of reasons:

PPO actively runs inference to generate samples with the current LLM—known as “on-policy” samples—during training. The real-time generation of on-policy data is what makes PPO an online algorithm.
Online RL training is difficult to efficiently orchestrate—especially in synchronous training setups—and often suffers from stability issues
PPO requires storing multiple copies of the LLM during training, leading to significant memory overhead and high hardware requirements.
PPO involves a wide range of training settings and design decisions that must be managed for successful training [21].

We can try to avoid the complexities of online RL by i) using lower-overhead online RL algorithms, ii) developing offline algorithms, or even iii) eliminating RL from the alignment process altogether. However, online RL is highly performant, and simpler alignment algorithms tend to come at cost in performance.

“Some results show that online RL is quite important to attain good fine-tuning results, while others find (offline) contrastive or even purely supervised methods sufficient.” - from [5]

In this overview, we will explore alternatives to online, PPO-based reinforcement learning from human feedback for LLM alignment. In particular, our focus will be on analyzing the performance gap between online algorithms that perform on-policy sampling and offline algorithms that train the LLM over a fixed dataset. By studying papers in this area, we will answer the following questions:

Is reinforcement learning needed for high-quality LLM alignment?
Is sampling on-policy training data important for alignment?

As we will see, on-policy sampling provides a clear performance advantage, creating a gap between online and offline alignment algorithms. However, offline (or RL-free) approaches can still be effective despite this online-offline gap. In particular, enhancing offline algorithms with on-policy data can form semi-online algorithms that are effective and easier to implement relative to full online RL.

Alignment Algorithms for LLMs

To begin, we will quickly delve into the role of alignment in LLM training and outline the many variants of online and offline alignment algorithms that currently exist. Modern LLMs are trained in several stages, as depicted in the figure above. The key training stages for an LLM are as follows:

Pretraining is a large-scale training procedure that trains the LLM from scratch over internet-scale text data using a next token prediction training objective; see here.
Supervised finetuning (SFT) or instruction finetuning (IFT) also uses a (supervised) next token prediction training objective to train the LLM over a smaller set of high-quality completions that it learns to emulate; see here.
Reinforcement learning from human feedback (RLHF) or preference finetuning (PreFT) uses reinforcement learning (RL) to train the LLM over human preference data; see here.
Reinforcement learning from verifiable rewards (RLVR) or reinforcement finetuning (RFT) trains the LLM with RL on verifiable tasks, where a reward can be derived deterministically from rules or heuristics.

We can group the training strategies outlined above into distinct stages; see below. The pretraining (and midtraining) process focuses on building the core knowledge base of the LLM, while alignment teaches the LLM correct formatting and style for maximizing human preference scores. Reasoning training is a final step that yields an additional boost in performance on verifiable tasks.

Grouping training steps into distinct stages

This overview focuses on LLM alignment and the many algorithms—including SFT and many forms of RL-based and RL-free RLHF—that have been proposed. We will focus especially on the role and necessity of online RL—as opposed to using simpler, offline alignment algorithms—in the RLHF training process. In this section, we will kick off this discussion by explaining the many options that exist for alignment algorithms, including both online and offline algorithms.

Supervised Finetuning (SFT)

Next-token prediction training objective

One of the simplest LLM alignment strategies is supervised finetuning (SFT), which adopts the same next token prediction training objective used during pretraining. We train the LLM to predict the next token in a sequence given all prior tokens as context (shown above)—this is a self-supervised training objective that can be applied efficiently to large volumes of raw text data. A basic implementation of the next token prediction training objective is provided below for reference.

import torch
import torch.nn.functional as F

# token_indices: (batch_size, seq_length)
logits = LLM(token_indices)  # (batch_size, seq_length, vocab_size)

# shift to predict next token at each position
logits = logits[:, :-1, :]  # (batch_size, seq_length - 1, vocab_size)
targets = token_indices[:, 1:]  # (batch_size, seq_length - 1)

# resize tensors for cross-entropy loss
logits = logits.reshape(-1, logits.size(-1))
targets = targets.reshape(-1)

# compute cross-entropy loss
loss = F.cross_entropy(logits, targets)

During pretraining, this training objective is applied over a massive corpus of text scraped from the internet. In contrast, SFT focuses upon curating a smaller set of high-quality prompt-response pairs for aligning the LLM. For example, LIMA is a popular paper that aligned an LLM using SFT with a curated dataset of only 1K examples. Recent LLMs use a larger number of samples in the SFT dataset; e.g., Tulu-3 is trained with ~1M SFT examples. Put simply, SFT aligns an LLM by training the model over concrete demonstrations of preferable responses.

In most cases, we can achieve better performance by using a completion-only loss in SFT, meaning that the cross-entropy loss is masked for all prompt tokens and only applied to tokens within the response or completion1. For a more detailed exposition of SFT, please see my prior overview on this topic linked below.

Rejection sampling is an online variant of SFT that is an extremely effective and easy to use. The standard formulation for SFT is offline—we train the model over a fixed dataset of prompt-response pairs. Rejection sampling changes this setup by:

Starting with a dataset of prompts.
Generating completions for each prompt with the current LLM.
Scoring all of these completions using a reward model or LLM judge.
Selecting (or filtering) the top-scoring prompt-completion pairs2.
Performing SFT over these top examples.

The rejection sampling process is depicted below. This approach trains the LLM in a similar fashion to SFT, but the difference lies in the data. We are using the LLM itself to sample SFT training data in a semi-online fashion. The reward model is used to ensure that we are training over the highest-quality completions.

(from RLHF book, license)

We typically perform rejection sampling iteratively. For example, the Llama-2 alignment process uses four rounds of rejection sampling before RL-based RLHF.

In the discussion above, we described rejection sampling as a variant of SFT, since both use the same training objective. However, rejection sampling is actually a preference tuning technique and is most often used as a simpler alternative to RLHF—not as an alternative to SFT. In practice, rejection sampling is usually applied after SFT, rather than in place of it.

(from [13])

SFT variants. Beyond rejection sampling (also called Best-of-N sampling), there are several online or iterative variants of SFT that have been proposed. Some notable examples that we will encounter in this overview include:

Supervised Iterative Learning from Human Feedback (SuperHF) [13] is an online learning technique that samples a batch of on-policy outputs from a model, filters these outputs with a reward model, and optimizes the model using a supervised objective under a KL divergence constraint; see above.
Reinforced Self-Training (ReST) [14] uses the rejection sampling formulation outlined above, in which we iteratively sample on-policy data from the LLM, score each sample with a reward model, and train on the best samples.
Reward-Weighted Regression (RWR) [15] similarly uses the LLM to generate on-policy samples that are scored with a reward model. But, these scores are used to weight each sample in the training loss instead of for filtering.
Reward Ranked Finetuning (RAFT) [16] again adopts the standard rejection sampling setup that samples online completions from the LLM and filters these completions for use in SFT with scores from a reward model.

Reinforcement Learning (RL) Training

(from [16])

There are two different types of reinforcement learning (RL) training that are commonly used to train LLMs (shown above):

Reinforcement Learning from Human Feedback (RLHF) trains the LLM using RL with rewards derived from a human preference reward model.
Reinforcement Learning with Verifiable Rewards (RLVR) trains the LLM using RL with rewards derived from rules-based or deterministic verifiers.

Visual walkthrough of RL training for LLMs

When we are optimizing an LLM with RL, we are trying to solve the objective shown below. This objective maximizes the reward received by the LLM’s completions while minimizing the KL divergence of the model with respect to a reference model—usually an LLM checkpoint from the start of RL training. Put simply, this means that we want to maximize reward without making our new model significantly different from the original (reference) model.

RL training objective

On-policy sampling. As shown above, we perform on-policy sampling when training an LLM with RL. By “on-policy” sampling, we mean that completions used to train our LLM in the core RL training loop are generated in real-time by the LLM itself—the completions are not generated by another model or stored in an offline, pre-computed dataset. In the context of LLMs, training algorithms that use on-policy sampling are typically referred to as “online” training algorithms. On-policy sampling is not only used within the context of RL training; e.g., we learned about several online variants of SFT in the prior section.

More on RLHF. This overview is focused upon LLM alignment, so we will mostly encounter RLHF-style training. Early approaches to LLM alignment used the three-stage technique (shown below) that combines SFT with RLHF.

(from [7])

In RLHF, we begin by collecting a dataset of preference pairs, where each preference pair contains:

A prompt.
A chosen (or winning) completion.
A rejected (or losing) completion.

We then train a reward model over the preference dataset and optimize our LLM with the RL training loop described above. The completions in this preference dataset can come from a variety of sources; e.g., the reference model, prior model checkpoints, or even completely different models. The preference annotation—or selection of the chosen and rejected completion in the pair—is usually provided either by a human annotator or LLM judge (i.e., AI feedback). Notably, the preference data and reward model are fixed at the beginning of RL training. Making this a bit more formal, LLMs are trained with a variant of offline model-based RL.

RL optimizers. There is one detail missing from the above explanation of RL training: how do we compute the policy update? We will briefly address this question here, but interested readers should see this in-depth overview for full details. Usually, a policy gradient-based RL optimizer (e.g., REINFORCE, PPO, or GRPO) is used. PPO-based RLHF has been the de facto choice in the past, but PPO is computationally expensive due to estimating the value function with an LLM. In fact, PPO-based RLHF stores four different copies of the LLM during training (i.e., the policy, reference policy, value model, and reward model).

To reduce overhead, REINFORCE derives a monte carlo estimate of the policy gradient by approximating the value function with an average of rewards received by the model throughout training (i.e., instead of with an LLM). In a similar vein, GRPO approximates the value function with an average of rewards from multiple completions to the same prompt—referred to as a group. Because GRPO is the most common RL optimizer for RLVR, it is also commonly used without a reward model. In this case, we only store two copies of the LLM—the policy and reference policy—for RL training. However, the lack of a reward model is a byproduct of RLVR (i.e., GRPO can be used with or without a reward model).

Direct Alignment Techniques

(from [18])

Because online RL training is so expensive, researchers have also proposed offline alignment techniques like direct preference optimization (DPO) [18]. Compared to PPO-based RLHF, DPO avoids training an explicit reward model and instead derives a reward signal implicitly from the LLM itself. Using this implicit reward, the LLM is trained with the contrastive learning objective shown below, which can be optimized using standard gradient descent (i.e., without any RL training).

DPO training loss (from [18])

Intuitively, this contrastive loss increases the probability margin between chosen and rejected responses in a preference dataset. The LLM is trained on a fixed preference dataset—the same data that is used to train the reward model in RLHF. For this reason, DPO is characterized as an offline—meaning the training data is fixed and there is no on-policy sampling—direct alignment algorithm. Compared to RL-based alignment algorithms, DPO requires much less computational overhead, is easier to tune, and still tends to perform well; see below for more details.

Variants of DPO. Because DPO was so much simpler to use relative to PPO-based RLHF, this technique quickly became popular within LLM research. As a result, many variants of DPO were proposed, such as Identity Preference Optimization (IPO) [8], Kahneman-Tversky Optimization (KTO) [19], or Contrastive Preference Optimization (CPO) [20]. Many of these techniques make slight modifications to DPO that yield a mild boost in performance, but the core idea behind them—in terms of using direct alignment with a contrastive objective—is similar. Some of these techniques, however, are meaningfully different from DPO; e.g., KTO formulates a DPO-style loss than can be applied to a single completion with a binary (good or bad) rating as opposed to a preference pair.

Online or iterative DPO. In its standard formulation, DPO is a completely offline alignment algorithm. The preference dataset is fixed throughout DPO training, but we can create online (or semi-online) DPO variants by introducing on-policy samples into the training process. As depicted below, one example of this idea is self-rewarding language models [10]. In this framework, we periodically sample fresh data for DPO training as follows:

Start with a set of prompts.
Sample multiple completions to these prompts with the current LLM.
Rank these completions (e.g., using an LLM judge or a reward model) to create a preference dataset.
Train the LLM over this data using DPO.
Return to step one and repeat for several rounds.

In this process, we iteratively train the model with DPO, but the training data is periodically re-sampled from the current policy—this is a semi-online training setup. We can make this approach more on-policy by sampling completions from the current policy more regularly. In fact, we can even create a fully-online DPO variant by sampling on-policy completions for every batch of training data!

(from [10])

The Online-Offline Performance Gap

Although PPO-based RLHF was the standard choice for LLM alignment for some time, this approach is expensive, complex, and difficult to replicate outside of top LLM labs. As a result, researchers have developed a variety of simpler alignment algorithms based on offline and RL-free training strategies. In this section, we aim to answer the following question: Does using offline alignment techniques come at a cost in performance? To address this, we will review a range of papers that study the impact of offline training, the use of on-policy samples, contrastive training objectives, and other factors on LLM performance.

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study [6]

“Experiment results demonstrate that PPO is able to surpass other alignment methods in all… Particularly, in the most challenging code competition tasks, PPO achieves state-of-the-art results.” - from [6]

We see several different avenues of comparing PPO-based RLHF and (offline) DPO in [6], including theoretical analysis, synthetic experiments and practical training of LLMs. The goal of this work is to find and explain the limitations of DPO in LLM alignment. First, authors confirm there is a performance gap between DPO and PPO-based RLHF. Then, they provide analysis that uncovers the key reason for this trend—the performance of DPO is significantly impacted by the presence of out-of-distribution examples in its underlying preference dataset.

Reward hacking. When training an LLM with PPO-based RLHF, we generate completions to prompts in our prompt dataset in an online fashion and score them with a reward model. Given that our reward model is an LLM that is trained over a fixed (and biased) preference dataset, this model is an imperfect proxy for the actual, ground-truth reward—it can make mistakes in the scores that it provides! Going further, the LLM being trained by PPO can also learn to exploit these mistakes by finding a way to erroneously maximize rewards provided by the reward model without actually meeting human preference expectations.

This phenomenon—commonly referred to as “reward hacking”—has a long history of study within the RL literature. However, we see in [6] that similar issues can occur even when using RL-free, offline alignment algorithms like DPO. In particular, authors make the statement quoted below, which tells us that:

Any solution found by PPO also minimizes the training objective for DPO (i.e., the set of solutions to PPO is a subset of the solutions to DPO).
It is possible for PPO to find erroneous (or reward-hacked) solutions.
Therefore, the same erroneous solutions can also be discovered with DPO.

Given a ground-truth reward r and a preference dataset D, let Π_PPO be the class of policies induced by training reward model R_Φ over D and running PPO. Let Π_DPO be the class of policies induced by running DPO. We have the following conclusion: Π_PPO is a proper subset of Π_DPO. - from [6]

Due to not using an explicit reward model, DPO cannot be reward hacked in a similar manner to PPO. However, DPO still suffers from similar issues with out-of-distribution data in a different manner. Specifically, DPO learns a bias towards unseen—or out-of-distribution—completions as explained below.

“DPO can develop a biased distribution favoring unseen responses, directly impacting quality of the learned policy… DPO is prone to generating a biased policy that favors out-of-distribution responses, leading to unpredictable behaviors.” - from [6]

This bias is most pronounced when there is a large distribution shift between the reference model used in DPO and the model used to generate completions within the preference dataset. Ideally, these completions should be generated with the reference model used in DPO. While online algorithms like PPO generate on-policy completions during training, offline algorithms like DPO are trained over a fixed preference dataset, where completions can come from an arbitrary LLM.

Synthetic example. To validate DPO’s issues with out-of-distribution data, a simple synthetic training example is constructed in [6]. In this setup, the policy is a basic multi-layer perceptron that takes a one-hot vector as input (i.e., the prompt) and produces an eight-dimensional categorical distribution as output3. We assume that the optimal policy is diagonal as illustrated in the plots below.

(from [6])

Using this toy setup, we can create synthetic preference datasets that purposely omit certain preference pairs from the training data, thus testing the behavior of both DPO and PPO in handling out-of-distribution data. As shown above, PPO handles this coverage issue correctly and recovers the optimal policy. In contrast, DPO incorrectly learns to assign high probability to data that is out-of-distribution, which validates—at least at a small scale—the argument in [6] that DPO develops an erroneous bias towards out-of-distribution data in the preference dataset.

Practical experiments. Following this synthetic test, larger-scale preference tuning experiments are performed with various Llama-2-derived LLMs on the SafeRLHF dataset. Experiments begin with an SFT model trained on the Alpaca dataset, creating a distribution shift between the SFT model and the preference data—completions in the Alpaca dataset are much different than those of SafeRLHF.

(from [6])

As shown above, using the Alpaca SFT model directly as the starting point for DPO performs poorly, but performance improves drastically when we first finetune the Alpaca SFT model over preferred completions in the SafeRLHF dataset prior to performing DPO training. These results indicate that a distribution shift between the reference model and preference data in DPO is indeed detrimental to LLM performance in practical alignment scenarios. Notably, the approach of running additional SFT over preferred completions in the preference dataset prior to DPO was also recommended in the original DPO paper [1]!

“We generate new responses with SFT (Safe) and use a learned reward model for preference labeling. We further repeat this process and iteratively set the reference model as the latest DPO model in the last iteration.” - from [6]

A new approach for avoiding out-of-distribution data via iterative DPO is also proposed in [6]. We can run several rounds of DPO, where at each round we use the current reference policy to generate fresh completions that are automatically scored by reward model to create a preference dataset. After each round, our current policy becomes the new reference policy, and we repeat this process, thus ensuring there is no distribution shift between our reference policy and the preference dataset. Using this approach, we can train a model with comparable safety (but not helpfulness) ratings to those obtained with PPO, thus narrowing the performance gap between online and offline alignment algorithms.

Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data [7]

By conducting a comprehensive study that covers nearly every possible alignment strategy for an LLM, authors in [7] discover two key characteristics that create a successful alignment algorithm:

The use of on-policy sampling.
The presence of a “negative gradient” that decreases probability of bad responses; i.e., instead of only increasing the probability of good responses.

For example, SFT purely trains the LLM using a maximum likelihood objective over a set of high-quality completions, while DPO leverages a contrastive objective that both i) increases the probability of the chosen response and ii) decreases the probability of the rejected response. However, the training data is fixed for both of these strategies—they perform no on-policy sampling. We can fix these issues by using an online RL algorithm like PPO or adopting an iterative DPO strategy that periodically samples new data from the current policy.

“Our main finding is that, in general, approaches that use on-policy sampling or attempt to push down the likelihood on certain responses outperform offline and maximum likelihood objectives.” - from [7]

We also learn in [7] that on-policy sampling and negative gradients are most useful in difficult alignment cases, where the responses that receive high rewards are unlikely within the reference policy. In such cases, the alignment process must train the LLM by “moving” probability mass away from low-reward responses and toward high-reward responses. Offline and purely supervised alignment methods perform especially poor in these complex scenarios.

Alignment algorithms. Authors in [5] begin by characterizing a wide set of potential alignment algorithms (shown below) based on their use of on-policy sampling, negative gradients, and sample reuse (i.e., performing multiple gradient updates over the same data). As a concrete example of sample reuse, PPO executes two to four sequential gradient updates over each batch of training data, while GRPO4 and REINFORCE typically avoid such sample reuse [9].

(from [7])

All SFT and rejection sampling variants lack the negative gradient that is present in RL-based and direct alignment methods, where we explicitly decrease the probability of responses that are either rejected (for direct alignment) or receive a low reward (for RL). Finally, on-policy sampling may or may not be used by most techniques depending on the training setup. Direct alignment methods like DPO or IPO run contrastive training on a fixed preference dataset with no on-policy sampling, but we can create an online version of an offline algorithm by periodically sampling new training data from the current policy. However, some algorithms like PPO and REINFORCE are naturally based upon on-policy sampling.

Unified alignment algorithm. To capture the scope of possible alignment algorithms, authors in [7] create the framework shown below. This framework enables the systematic study of different settings within the underlying alignment algorithm. For example, steps one and two can be performed either:

With on-policy data collection (i.e., by generating responses from the current policy and automatically scoring them with a reward model).
By directly using offline preference data without any on-policy sampling (e.g., as in standard DPO).

Going further, we can vary the extent of on-policy sampling by changing the total number of samples B or varying total gradient steps T performed on a set of samples. Notably, increasing T introduces sample reuse while increasing B does not, thus allowing us to isolate the impact of reusing on-policy samples.

(from [7])

Notably, this unified algorithm does not capture any of the maximum likelihood alignment algorithms, though these algorithms are still considered in [7].

Training setup. The properties of these different alignment algorithms are analyzed using several experimental setups including:

Small-scale (didactic) bandit problems.
Synthetic LLM problems.
Full-scale LLM alignment.

In the synthetic alignment scenario, we use hand-crafted rewards based on the length of the LLM’s response. Specifically, two reward settings are considered—minimizing the response length and matching the average response length; see below. These reward scenarios test cases in which high-reward responses lie both within and outside of the region of probable completions for the reference policy5.

(from [7])

The didactic bandit problems also test multiple reward setups that change the optimum of the reward function. By changing the reward setup, we test each algorithm’s ability to assign probability to high-reward responses, even if these responses have low probability in the original reference policy; see above.

“The optimum of the reward function R1 is located in low likelihood regions of the reference policy, whereas the optimum of R2 is roughly aligned with the mode of the reference policy. We hypothesize that on-policy sampling will be crucial to optimize reward function R1, whereas offline or maximum likelihood methods could be sufficient for the optimization of R2.” - Bandit problem description from [7]

The full-scale alignment scenario uses public preference data from AlpacaFarm, UltraChat and UltraFeedback to align smaller-scale LLMs like Pythia-1.4B and Mistral-7B. This training setup is a more standard LLM alignment scenario, and models are evaluated using a golden human preference reward model.

The role of on-policy sampling. We learn from experiments in [7] that sampling on-policy data more frequently and in smaller batches—the most strictly on-policy setup possible—leads to the best performance. The impact of on-policy sampling is most noticeable in complex alignment scenarios, where high-reward responses do not already lie within the probable region of the reference policy.

“[We] observe strong and clear trends supporting that on-policy sampling with a smaller but frequently sampled batch results in better performance… The performance degradation with more off-policy updates is substantially milder for 𝑅2, indicating that when the peak in the reward function lies in the likely regions of the reference policy, a higher degree of off-policy updates is tolerable.” - from [7]

In simper alignment cases where responses that receive high rewards are already probable within the reference policy, the model can better tolerate the use of offline training algorithms. This phenomenon is confirmed in both synthetic and didactic problem setups. Additionally, we observe the same trend in full-scale LLM alignment experiments, where the highest reward comes from decreasing the batch size B to make the training process more on-policy; see below.

(from [7])

The negative gradient. Similarly to on-policy sampling, the use of a negative gradient is found to benefit alignment. Algorithms that employ a negative gradient have a noticeable boost in performance relative to those that do not, especially in difficult alignment cases where we must increase the probability of responses that were originally assigned low probability by the reference policy. As shown below (top figure), algorithms that employ a negative gradient increase the probability margin between chosen and rejected responses during training. Such a trend is not observed for algorithms that lack a negative gradient.

(from [7])

Interestingly, however, we see above (bottom plot) that the absolute probability of both chosen and rejected responses actually decreases during training despite an increasing margin. This same trend has also been observed in other papers [8].

(from [7])

On-policy sampling and negative gradients yield compounding benefits when used in tandem. For example, on-policy IPO and DPO have faster convergence and better performance compared to offline variants in both didactic bandit and synthetic LLM experiments; see above. In full-scale LLM experiments, online versions of contrastive alignment algorithms outperform PPO in some cases despite having lower computational costs and wall-clock training time.

Is sample reuse detrimental? Substantially increasing the value of T would trivially degrade performance due to the introduction of off-policy data into the training process. However, moderate settings of T could allow the model to incorporate off-policy updates into the training process without causing a large drop in performance. For example, the synthetic LLM setting with PPO has no noticeable degradation in performance when increasing T from 1 to 8; see below.

(from [7])

Maximum likelihood training objectives like rejection sampling (called Best-of-N in the figure above) are more sensitive to sample reuse6 but can still achieve good results with more moderate settings of T. Put simply, these results show that off-policy updates from sample reuse do not seem to hurt an LLM’s performance.

(from [7])

The key takeaways from alignment experiments in [7] are depicted in the figure above and can be summarized as follows:

On-policy sampling is crucial for high-quality alignment, especially if responses with optimal reward are not likely in the reference policy.
Moderate amounts of sample reuse can introduce off-policy updates without causing a noticeable deterioration in alignment quality.
The use of negative gradients leads to faster convergence and has a complementary benefit to on-policy sampling.
For simple alignment cases where the peak in rewards is already likely in the reference policy, fully offline or supervised methods—which use no on-policy sampling or negative gradient—can still perform well.

Each of these key points are also captured by the practical alignment takeaways presented in [7], which have been copied below for easier reference.

(from [7])

Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback [2]

(from [2])

In [2], authors perform an empirical comparison between online and offline RL algorithms—PPO-based RLHF and DPO in particular—for aligning medium-scale LLMs. This analysis tries to maximize the performance of a single LLM across a wide set of benchmarks spanning several domains by varying:

The type, source or scale of preference data being used.
The style of training algorithm (i.e., offline or online).

Additionally, several hyperparameter settings and training setups are considered for improving the performance of PPO-based RLHF, providing useful intuition for maximizing results with online RL. From this analysis, we learn that:

The choice of preference data has the greatest impact on LLM quality—data quality and composition are the key determinants of success in alignment.
Online RL algorithms consistently outperform offline algorithms like DPO.

(from [2])

The experimental setup in [2] adopts a standard approach for both PPO-based RLHF and DPO; see above. All experiments use Tulu-2-13B [3] as the starting model for both DPO and PPO. After preference tuning, models are evaluated over a wide set of benchmarks that measure performance in the following domains:

Factuality (e.g., MMLU)
Reasoning (e.g., GSM8K)
Truthfulness (e.g., TruthfulQA)
Coding (e.g., HumanEval+)
Safety (e.g., ToxiGen)
Instruction following (e.g., IFEval)

From these diverse benchmarks, we can observe the performance of models in individual domains, as well as their general performance across domains.

Data selection. Building upon recent work that leverages synthetic preferences for LLM alignment [4], we can derive preference data from three sources:

Human preferences.
Web scraping7.
Synthetic preferences.

Interestingly, we learn in [2] that synthetic preference datasets—and the UltraFeedback dataset in particular—yield the best results, even compared to human-annotated preference data. Going further, authors in [2] specifically mention the following important considerations for curating preference data:

The quality of preferences (i.e., the choice of chosen or rejected completion within a preference pair) is actually more important than the quality of the completions themselves.
Collecting per-aspect preference feedback yields a clear performance benefit—models trained on aggregated, per-aspect preferences outperform those trained on 15x the amount of standard preference data.
With the data that was considered in [1], preference tuning has the biggest impact on improving chat capabilities and output style, but the model does not seem to learn new facts or information.

Per-aspect preference feedback is collected by asking a human or model to score each aspect of the data (e.g., helpfulness and harmlessness) independently, then aggregating these per-aspect scores to yield a final, aggregated preference score. Compared to just asking annotators for a single overall preference score, such an approach is found to improve the quality of preference feedback, which in turn improves the quality of resulting models after preference tuning. Authors in [2] consider various factors that impact the quality of post-training, but the source and quality of preference data are found to have the most significant impact.

“PPO outperforms DPO by up to 2.5% in math and 1.2% in general domains. High-quality preference data leads to improvements of up to 8% in instruction following and truthfulness.” - from [2]

PPO vs. DPO. When directly comparing models trained with an online or offline approach, we see in [2] that online training algorithms have a clear edge. In fact, nearly all models trained with PPO-based RLHF across all datasets are found to outperform those trained with DPO using identical settings. Results in [1] provide clear evidence that online RL benefits preference tuning for LLMs; see below.

(from [2])

Why is online training so beneficial? The answer to this questions is complex and multi-faceted, but authors in [2] make an interesting observation regarding the difference between models trained with DPO and PPO. Namely, PPO models are far more likely to perform chain-of-thought reasoning for solving complex problems, even without being provided any examples of this behavior.

“Models trained with PPO are far more likely than DPO-trained models to perform chain-of-thought reasoning… even when not given in-context examples using chain-of-thought. This suggests that reasoning improvements from PPO may be due to increased chain-of-thought abilities.” - from [2]

Such behavior would be impossible for an LLM to learn with offline algorithms like DPO, as the completions from which the model learns are fixed within the preference dataset. On the other hand, PPO is able to learn such new behaviors because completions are sampled online during training, allowing the model to explore—and learn from—new behaviors like chain-of-though reasoning.

(from [2])

Other factors in online RL. Beyond the analysis of offline and online algorithms in [2], authors perform various other ablations to determine key factors to success in PPO-based RLHF. For example, increasing the size of the reward model—and the size of the preference dataset over which the reward model is trained—is found to improve the quality of the reward model. However, the impact of a better reward model on downstream evaluation benchmarks (i.e., after training the LLM with PPO-based RLHF) is less clear. The main performance benefits are observed in more complex domains like reasoning. Seemingly, a more powerful reward model is only impactful in challenging domains that actually require a better reward model.

“If we’re using a bigger reward model, we need to have data that is actually challenging the reward model.” - source

We can also boost the performance of the LLM in specific domains by curating a targeted prompt dataset for PPO that focuses on that domain—this is a unique benefit that can be exploited by PPO but is not possible in offline algorithms like DPO. However, such an approach does not yield performance improvements in general—it is only useful for tailoring the LLM to specific domains like math; see below.

(from [2])

The best training recipe. To conclude their analysis, authors in [2] emphasize the following aspects of LLM alignment:

The importance of preference data quality.
The superiority of online RL.
The benefit of better reward models in complex domains like reasoning.
The ability of targeted prompt datasets for PPO to curate an LLM’s performance to a particular domain.

The optimal approach for performing LLM alignment—as discovered by experiments performed in [2]—is summarized by the quote below.

“We take a high-quality, synthetic preference dataset, a large reward model, and train it using PPO. If we additionally wish to focus on a specific domain, we can additionally collect domain-specific prompts for policy training.” - from [2]

Understanding the performance gap between online and offline alignment algorithms [5]

“We show that on a suite of open source datasets, online algorithms generally outperform offline algorithms at the same optimization budget of KL divergence against the SFT policy” - from [5]

Authors in [5] analyze the importance of on-policy samples for aligning LLMs with RLHF. To begin, a clear gap in performance is demonstrated between online and offline alignment algorithms. Several intuitive explanations are proposed for this performance gap and investigated one-by-one via targeted data ablations. We learn from these experiments that the use of on-policy sampling seems to be the key performance differentiator for online alignment algorithms.

IPO loss function (from [8])

Experimental setup. All experiments in [5] evaluate models based on their win rate against a fixed policy and use the Identity Preference Optimization (IPO) algorithm, which uses the contrastive loss function shown above, for training. This algorithm is similar in nature to DPO. It can be used to align LLMs in an online or offline manner depending on how the training data is sampled.

(from [5])

Specifically, we can use IPO in an online fashion by sampling on-policy data from the current policy during training, automatically scoring these completions with a reward model, and training the model over these online samples via the IPO training objective outlined above. A depiction of the differences between online and offline IPO is provided above. Online IPO is used as the online alignment technique in [5] instead of PPO-based RLHF for a few different reasons:

Implementing PPO is complex and expensive due to the requirement of an additional value function.
There is no clear way to formulate the PPO optimization process in an offline manner (though DPO was derived as an offline equivalent of PPO).
As discussed above, formulating IPO in either an online or offline fashion is relatively straightforward.

Given that PPO-based RLHF is the most widely-used online alignment algorithm, this choice to purely rely upon contrastive learning objectives is a clear deviation from mainstream alignment research. Additionally, analysis in [5] is performed over smaller (i.e., <1B parameter) models. Despite these issues, the learnings from this work still provide useful intuition that helps us to better understand the key distinctions between online and offline alignment algorithms.

Relative to offline algorithms, online alignment algorithms perform inference during training and require an additional training procedure for the reward model. For these reasons, we cannot compare online and offline algorithms based on their total compute budget—offline alignment in general will always be much cheaper. Instead, authors in [5] choose to compare policies in terms of their KL divergence from the SFT model, capturing how much the model changes during the alignment process (i.e., an optimization “budget”) in a compute-agnostic manner.

“Online algorithms tend to be more computationally intensive than offline algorithms, due to sampling and training an extra reward model… we do not prioritize compute as a main factor during comparison, and instead adopt the KL divergence between the RLHF policy and reference SFT policy as a measure of budget.”- from [8]

Comparing online and offline RL. To begin their analysis, authors present the results of online and offline alignment depicted in the figure below. Here, we see that there is a clear gap between the performance of models trained with online and offline alignment algorithms across all levels of possible KL divergence. These results are consistent across several different open alignment datasets8.

(from [5])

Based on the observed superiority of online alignment, authors in [5] propose the following potential explanations for the existence of this performance gap:

Data coverage: online algorithms outperform offline algorithms simply because they have more diverse data than the offline algorithm.
Sub-optimal data: offline algorithms perform worse because the completions in their dataset are generated by the SFT policy and are, therefore, of lower quality compared to on-policy samples generated during alignment9.
Better classification: offline algorithms train the policy to classify preferred completions in a preference pair, while online algorithms accomplish this via an explicit reward model. The performance gap may be due to the online algorithm’s explicit reward model performing this classification more accurately relative to the offline policy.
Contrastive loss: the contrastive objective used by offline algorithms like IPO and DPO—not their lack of on-policy sampling—may lead to the performance gap with online algorithms.
Scaling laws: the performance gap could potentially disappear as we scale up the size of the underlying policy.

Next, each of these hypotheses is studied in a series of ablation experiments that deeply analyze the difference between online and offline algorithms.

Data coverage. To study the impact of data coverage on alignment quality, we can collect all of the completions generated via on-policy sampling during online training to form a dataset for offline alignment. If we preserve the exact order in which this data was sampled, then online and offline alignment are identical—the models see the same data in the same order and, therefore, receive the same parameter updates. If we shuffle this data and use it for offline alignment, however, we see in [5] that this new data does not yield noticeably better results. As shown in the figure below, the offline algorithm performs similarly using an offline dataset and the shuffled dataset generated via on-policy sampling during online training.

(from [5])

These results show that improving data coverage is not enough to overcome the performance limitations of offline alignment—data ordering is also important. However, this ordering need not be perfect. As we gradually increase the amount of shuffling in the on-policy samples, model performance remains stable up to a point, then rapidly deteriorates to the level observed with offline alignment.

“Offline algorithms, even when augmented with the same data coverage as the online algorithm, cannot obtain the same level of performance. This alludes to the importance of the exact sampling order obtained via on-policy sampling by a constantly evolving policy.” - from [5]

Sub-optimal data. We can easily test the impact of data quality on offline alignment algorithms by generating a preference dataset using a policy that is known to be high-quality. In [5], authors generate an offline training dataset using the final policy obtained via online alignment. When policies are trained over this dataset, there is only a slight improvement in quality; see below. Such a result indicates that the limitations of offline alignment algorithms are not purely due to the presence of lower-quality completions in their preference datasets.

(from [5])

Classification accuracy. Authors in [5] demonstrate that explicit reward models used by online alignment algorithms achieve higher preference classification accuracy compared to the implicit reward estimate of an offline policy. However, little correlation is found between preference classification accuracy and model performance; in fact, the only observed correlation is slightly negative. Based on these findings, the authors conclude that the superior preference classification accuracy of online algorithms' explicit reward models is unlikely to be the primary factor behind the improved performance of online alignment methods.

Contrastive objective. To study whether the sub-par performance of offline alignment algorithms stems from their use of a contrastive loss function, authors derive a non-contrastive loss for offline alignment called Best-of-2. Put simply, the Best-of-2 training algorithm takes chosen completions for each preference pair in a dataset and runs SFT over these completions. When we train a model using the Best-of-2 loss over our standard offline preference dataset, there is no noticeable change in performance. However, adding online samples to Best-of-2 training—even when these samples are shuffled to remove the ordering from online alignment—nearly closes the performance gap with online techniques; see below.

(from [5])

Such a result clearly demonstrates that data coverage is the key indicator of success for SFT, which motivates the inclusion of on-policy samples in SFT (i.e., rejection sampling). We can achieve impressive alignment results by simply including some level of on-policy data in offline training algorithms, forming practically effective LLM alignment baselines that are easy to implement.

Does scaling up help? Authors end their analysis in [5] by studying the impact of model scale on the gap between online and offline alignment algorithms. In these experiments, we see that the gap between offline and online algorithms:

Decreases at larger scales.
Is more heavily related to data coverage at large scales.

More specifically, training larger models over a shuffled dataset of on-policy samples nearly closes the online-offline performance gap; see below. Such a finding did not hold in data coverage experiments with smaller models.

(from [5])

Key takeaway. The detailed alignment analysis in [5] leaves us with one key finding: on-policy sampling is important for high-quality alignment. There are many alternative explanations for the superiority of online alignment algorithms (e.g., data coverage or quality). However, these theories are debunked—at least at a smaller scale—by the many data ablations in [5], revealing that on-policy samples are the key contributor to the online-offline performance gap. This finding is very powerful, as it allows us to rethink the data sampling process used for offline alignment algorithms—we can improve the performance of offline techniques by incorporating (semi-)online data samples as described below!

“The dichotomy of online vs. offline is often inaccurate in practice, since an offline algorithm with a repeatedly updated data stream is effectively an online algorithm. As a result, offline learning can be made less likely to suffer from the shortcomings identified in this work, by being more careful with the data generation process in general.” - from [5]

Bridging Offline and Online Reinforcement Learning for LLMs [9]

“We study offline, semi-online, and online configurations, across both verifiable and non-verifiable tasks. By examining the transition from offline to online training (i.e., by altering the speed of periodic model syncing), we aim to understand how these methods can be optimized for improved performance and efficiency.” - from [9]

To granularly study the relationship between online and offline RL, authors in [9] finetune LLMs while smoothly transitioning the training process from an offline to an online setting. In other words, we bridge the gap between online and offline RL by testing training techniques that fall in the middle. By performing such tests over both verifiable (e.g., math) and non-verifiable domains (e.g., chat or instruction-following), we can gain an understanding of how on-policy sampling impacts the RL training process. More specifically, when we compare an on-policy GRPO setup to offline, semi-online, and on-policy variants of DPO, we learn that:

Online and semi-online techniques significantly outperform offline training.
Semi-online DPO nearly matches the performance of online DPO.

Put simply, we learn in [9] that online training is beneficial to model performance, but we can reap much of this benefit with a more efficient, semi-online approach.

Online, semi-online, and offline. For experiments in [9], authors train the Llama-3.1-8b-Instruct model using both on-policy GRPO and several variants of DPO. Specifically, we can create variants of DPO with varying degrees of on-policy sampling by defining a period s such that the policy being trained is used to generate fresh on-policy samples for DPO every s training iterations. In other words, we sync the parameters of the policy being trained and the policy used to sample completions for our preference data every s parameter updates; see below.

(from [9])

Notably, iterative forms of DPO—where we generate a new set of completions for training with the current model at each iteration—have been explored by prior work [10, 11]. However, these methods usually perform rough iterations, where new completions are sampled relatively infrequently. By varying the setting of s, we can explore arbitrary granularities of semi-online DPO, even including a fully on-policy DPO setting where s = 1. Put simply, we can bridge the gap between offline, semi-online, and online DPO by slowly decreasing s from ∞ to 1.

Experimental setup. Experiments are conducted in two possible domains:

A non-verifiable domain where training data is drawn from WildChat-1M 10 and models are evaluated via LLM judges in terms of their chat capabilities (e.g., using AlpacaEval and Arena-Hard).
A math-focused, verifiable domain where training data is drawn from the NuminaMath dataset and evaluation is performed on several verifiable math benchmarks (e.g., Math500 and AMC23).

In the verifiable domain, the reward signal is obtained using the Math-Verify toolkit rather than exact string matching, which makes the reward more robust to variations in answer format11. The non-verifiable reward is derived from an off-the-shelf human preference reward model—Athene-RM-8b in particular—that is fixed throughout all experiments. To apply DPO in the verifiable domain, we simply generate several responses to each question, then choose a single correct and incorrect answer for each question to form preference pairs.

(from [9])

Is semi-online enough? The results of these experiments on both verifiable and non-verifiable tasks are shown above. Immediately, we see that training with an online or semi-online setup provides substantial gains over offline DPO in both domains. There is a clear performance gap between offline and online methods. But, the gap between online and semi-online settings is much less pronounced. In fact, online and semi-online DPO even outperform on-policy GRPO in some cases! These findings hold true even with relatively large values of s; e.g., in the verifiable domain s is increased to 100 with very promising results12.

“The efficiency gains of the semi-online variants opens up an interesting question of whether fully online RL is the only approach for post-training LLMs.” - from [9]

Such findings have interesting implications for the online-offline performance gap in RL. We see in [9] that there is a clear benefit to online sampling. However, we can potentially approximate this sampling more efficiently via a semi-online setup that intermittently collects fresh data instead of strict on-policy sampling.

Verifiable versus non-verifiable. Experiments are also performed in [9] to explore the interplay between verifiable and non-verifiable rewards, showing that the curriculum (or order) of rewards during RL training is important. If we compare settings in which the LLM is first trained on non-verifiable rewards then on verifiable rewards (NV → V) or vice versa (V → NV), we get better performance by first training on non-verifiable rewards (i.e., NV → V » V → NV).

Training on non-verifiable rewards after the LLM has been trained on verifiable rewards leads to a noticeable performance deterioration in verifiable domains. In contrast, further training on verifiable rewards actually improves the performance of the LLM, even in non-verifiable domains; see above. If we combined both non-verifiable and verifiable rewards within a single training run (V + NV) the model also performs well, revealing that the simplest approach may be just mixing the disparate reward signals into a single, unified training run!

Conclusion

There are many alignment algorithms for LLMs, each varying in complexity and performance. Online algorithms have a clear performance benefit over offline alignment algorithms. In this overview, we have learned that this gap in performance primarily arises due to the use of on-policy sampling in online alignment algorithms, as well as other—arguably less significant—factors like negative gradients. Interestingly, however, we have also learned that much simpler and equally effective alignment algorithms can be derived by including on-policy samples in the training dataset used for offline alignment, forming semi-online algorithms that are practically effective and easy to implement.

New to the newsletter?

Subscribe now

Bibliography

[1] Rafailov, Rafael, et al. "Direct preference optimization: Your language model is secretly a reward model." Advances in neural information processing systems 36 (2023): 53728-53741.
[2] Ivison, Hamish, et al. "Unpacking dpo and ppo: Disentangling best practices for learning from preference feedback." Advances in neural information processing systems 37 (2024): 36602-36633.

[3] Ivison, Hamish, et al. "Camels in a changing climate: Enhancing lm adaptation with tulu 2." arXiv preprint arXiv:2311.10702 (2023).

[4] Tunstall, Lewis, et al. "Zephyr: Direct distillation of lm alignment." arXiv preprint arXiv:2310.16944 (2023).

[5] Tang, Yunhao, et al. "Understanding the performance gap between online and offline alignment algorithms." arXiv preprint arXiv:2405.08448 (2024).

[6] Xu, Shusheng, et al. "Is dpo superior to ppo for llm alignment? a comprehensive study." arXiv preprint arXiv:2404.10719 (2024).

[7] Tajwar, Fahim, et al. "Preference fine-tuning of llms should leverage suboptimal, on-policy data." arXiv preprint arXiv:2404.14367 (2024).

[8] Azar, Mohammad Gheshlaghi, et al. "A general theoretical paradigm to understand learning from human preferences." International Conference on Artificial Intelligence and Statistics. PMLR, 2024.

[9] Lanchantin, Jack, et al. "Bridging Offline and Online Reinforcement Learning for LLMs." arXiv preprint arXiv:2506.21495 (2025).

[10] Yuan, Weizhe, et al. "Self-rewarding language models." arXiv preprint arXiv:2401.10020 3 (2024).

[11] Pang, Richard Yuanzhe, et al. "Iterative reasoning preference optimization." Advances in Neural Information Processing Systems 37 (2024): 116617-116637.

[12] Shao, Zhihong, et al. "Deepseekmath: Pushing the limits of mathematical reasoning in open language models." arXiv preprint arXiv:2402.03300 (2024).

[13] Mukobi, Gabriel, et al. "Superhf: Supervised iterative learning from human feedback." arXiv preprint arXiv:2310.16763 (2023).

[14] Gulcehre, Caglar, et al. "Reinforced self-training (rest) for language modeling." arXiv preprint arXiv:2308.08998 (2023).

[15] Hu, Jian, et al. "Aligning language models with offline learning from human feedback." arXiv preprint arXiv:2308.12050 (2023).

[16] Lambert, Nathan, et al. "Tulu 3: Pushing frontiers in open language model post-training." arXiv preprint arXiv:2411.15124 (2024).

[17] Ouyang, Long, et al. "Training language models to follow instructions with human feedback." Advances in neural information processing systems 35 (2022): 27730-27744.

[18] Rafailov, Rafael, et al. "Direct preference optimization: Your language model is secretly a reward model." Advances in neural information processing systems 36 (2023): 53728-53741.

[19] Ethayarajh, Kawin, et al. "Kto: Model alignment as prospect theoretic optimization." arXiv preprint arXiv:2402.01306 (2024).

[20] Xu, Haoran, et al. "Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation." arXiv preprint arXiv:2401.08417 (2024).

[21] Huang, Shengyi, et al. "The n+ implementation details of rlhf with ppo: A case study on tl; dr summarization." arXiv preprint arXiv:2403.17031 (2024).

This can be accomplished using a completion-only loss collator.

There are a few different ways this selection can be performed. For example, we can select the top completion for each prompt, or we can select the top-scoring completions across all prompts; see here for details.

In other words, the output is a vector of size eight to which a softmax function has been applied to form a probability distribution over these eight possible outcomes.

GRPO is not listed in this table due to the fact that both [7] and the GRPO paper [12] were published at very similar times.

The LLM before alignment already generates completions that are near the average length. In contrast, the LLM does not generate minimum (or zero) length completions, so learning to generate such responses requires probability mass to be moved into a new region that was previously unlikely.

This sensitivity is due to the fact that maximum likelihood algorithms do not have any explicit mechanism to protect against off-policy sampling, whereas PPO has the clipping operation and KL divergence that help to maintain the trust region.

As an example of how we can obtain preference data via web scraping, the Stack Exchange Preferences dataset takes questions from Stack Overflow with at least two answers and ranks answers based on implicit feedback (e.g., likes or upvotes).

Specifically, this work uses the OpenAI summarization, Anthropic Helpful and Harmless (hh-rlhf), and the Chatbot arena preference dataset.

We should note that one can make a similar argument against online algorithms! The reward model used in online algorithms is also trained over a fixed dataset, which can lead to similar limitations in the performance of online algorithms.

This is a general chat and instruction-following benchmark that is comprised of ~1M user interactions with ChatGPT.

For example, an LLM could provide an answer of 0.5 or 1/2 to a math question. Both of these answers would be correct, but one of them would likely be marked as wrong if we are verifying our reward via exact string match. For this reason, using a more robust validation system for mathematical expressions is helpful.

The value of s is much larger in the verifiable domain compared to the non-verifiable domain. Authors in [9] make this choice because the non-verifiable dataset is small and a setting of s = 32 spans a full epoch over the data. Therefore, the training process is not stable with larger values of s in the non-verifiable domain.

GPT-oss from the Ground Up

Cameron R. Wolfe, Ph.D. — Mon, 18 Aug 2025 09:33:11 GMT

(from [18, 20, 21])

Recently, OpenAI released GPT-oss [1, 2]—their first open LLM release since GPT-2 [13] over five years ago. In the time between GPT-2 and GPT-oss, LLM research has undergone a continuous transformation. Many of the key breakthroughs in LLM research during this time have come from OpenAI, but their research is almost always kept internal. GPT-oss provides a rare peek into LLM research at OpenAI. In this overview, we will take advantage of this infrequent opportunity by:

Exhaustively outlining every single technical detail revealed about GPT-oss in the report(s) provided by OpenAI.
Explaining how each of these details work from the ground up1.

This overview is long (probably too long), and it covers a wide variety of loosely related topics in LLM research. However, by taking the time to work through each of these topics, we will gain a deep understanding of how GPT-oss works and, in turn, form a better perspective on the state of LLM research at OpenAI.

GPT-oss at a Glance

“They were trained using a mix of reinforcement learning and techniques informed by OpenAI’s most advanced internal models, including o3 and other frontier systems.” - from [1]

The GPT-oss release includes two different models—GPT-oss-20b and GPT-oss-120b—that are both released with a permissive Apache-2.0 license. These are Mixture-of-Experts (MoE)-based reasoning models that are text-only and trained primarily on English data. Due to their MoE architecture and use of quantization-aware training, these models are compute and memory efficient. The 20b and 120b models have 5b and 3.5b active parameters respectively. Using MXFP4 (~4-bit) precision, the larger model can be hosted on a single 80Gb GPU, while GPT-oss-20b needs only ~16Gb of memory for hosting. These models are extensively post-trained to optimize their chain of thought (CoT) reasoning and safety.

Emphasis on agents. Both GPT-oss models are optimized for agentic workflows with a (reasonably) long context window of 131k tokens, as well as strong tool use, reasoning and instruction-following capabilities. To handle patterns from agentic workflows (e.g., function calling, tool use, reasoning, structured outputs, and more) more seamlessly, OpenAI released the new Harmony prompt format—a flexible, hierarchical chat template capable of capturing diverse LLM interaction patterns—for training and interacting with GPT-oss. The GPT-oss models also provide the ability to adjust their reasoning effort (i.e., to low, medium or high effort levels) by explicitly specifying an effort level in their system message.

(from [1])

Internal evaluations. Evaluations released by OpenAI reveal that GPT-oss-120b performs comparably to o4-mini, while GPT-oss-20b performs similarly to o3-mini; see above. Additionally, OpenAI heavily emphasized the strong capabilities of these models on health-related tasks—based on evaluations from their newly-released HealthBench—during the release; see below. However, GPT-oss models still fall short of the performance of the full o3 model on this benchmark.

(from [1])

As should be expected, OpenAI also highlights that the GPT-oss models obey the usual inference-time scaling laws with respect to their reasoning effort. Model performance improves as the models generate progressively longer reasoning traces—and therefore consume more compute—during inference; see below.

(from [1])

Public reception. After making their way around the open LLM community, the GPT-oss models have received mixed feedback. For example, some users have pointed out that these models have a high hallucination rate, while others say that the models are actually pretty good after initial hiccups related to model setup were fixed. Other common criticisms of the GPT-oss models include over-refusal of prompts, difficulty with properly setting up model quantization, and the Harmony prompt format being overly complex or hard to use. Put simply, the perception seemed poor at first, but slowly improved as lingering issues in common tools like ollama, llama.cpp, and unsloth and were resolved.

The reality of GPT-oss is somewhere in the middle of the polarizing and clickbaity reactions online. These are (obviously) not the best models ever, but they are open weights models released by one of the top LLM labs in the world. Given that few of the top American LLM labs (other than AI2, Cohere and Meta) are actively releasing open weights models, we would be foolish to not try out these models and gain a deep understanding of how they work. So, let’s start diving into the relevant technical details provided by OpenAI on GPT-oss.

Model Architecture

“The GPT-oss models are autoregressive Mixture-of-Experts (MoE) transformers that build upon the GPT-2 and GPT-3 architectures.” - from [1]

We will first cover the model architecture of the GPT-oss models. This discussion will start with a basic understanding of the transformer architecture2. From here, we will outline each unique component of the GPT-oss architecture with a from-scratch explanation. For further reading on this topic and comparison to other open models, see the great overview from Sebastian Raschka below.

Ahead of AI

From GPT-2 to gpt-oss: Analyzing the Architectural Advances

OpenAI just released their new open-weight LLMs this week: gpt-oss-120b and gpt-oss-20b, their first open-weight models since GPT-2 in 2019. And yes, thanks to some clever optimizations, they can run locally (but more about this later…

9 months ago · 169 likes · 17 comments · Sebastian Raschka, PhD

Transformer Structure

Decoder-only transformer architecture

A depiction of a standard, decoder-only transformer architecture is provided above. This architecture is used almost universally by modern GPT-style LLMs.

Embedding dimension. The input to this model is a sequence of token vectors, produced by tokenizing and embedding our textual input (or prompt). In the case of the GPT-oss models, these vectors have a fixed dimension of 2,880, and this same embedding dimension is maintained through every layer of the LLM.

Block structure. The decoder-only architecture is comprised of repeated decoder blocks—GPT-oss models contain either 24 (GPT-oss-20b) or 36 (GPT-oss-120b) of these blocks. As we can see above, each decoder block has the same key components: normalization, masked multi-headed self-attention, feed-forward transformation, and residual connections. The GPT-oss models adopt a pre-normalization structure, which is the most common choice in current LLM architectures3. This means that the normalization layers in the decoder block are placed before both the attention and feed-forward layers, yielding the following structure:

Decoder Block Input → Normalization → Masked Self-Attention → Residual Connection → Normalization → Feed-Forward Network → Residual Connection → Decoder Block Output

Although a pre-normalization structure is most common, there is no clear answer in terms of whether pre or post-normalization is superior. In fact, recent work has even shown that post-normalization benefits training stability [3]; see below.

(from [3])

Normalization. Initial transformers used layer normalization as the standard choice of normalization layer. More recently, many LLMs have replaced layer normalization with root mean square layer normalization (or RMSNorm for short) [4], which is a simpler—and more computationally efficient—version of layer normalization that has fewer trainable parameters and performs similarly. GPT-oss models adopt this choice by using RMSNorm in all decoder blocks. See here for an explanation of RMSNorm (and a comparison to layer normalization).

Attention Implementation

Depiction of masked self-attention with a single attention head

Masked self-attention. A masked self-attention operation is depicted above; see here for more details. Most LLMs—including GPT-oss—use multi-headed masked self-attention, meaning that there are multiple self-attention operations running in parallel for each self-attention layer. In the case of GPT-oss models, each self-attention layer has 64 parallel attention heads. Each of these attention heads use vectors with a dimension of 64, meaning that the key, query and value projections (shown above) transform embedding vectors from a size of 2,880 to 64.

(from [6])

Multi and grouped-query attention. Expanding on multi-headed self-attention, prior work has proposed both multi-query [5] and grouped-query attention [6]. As depicted above, instead of having unique keys and values for each attention head, these techniques share the keys and values (but not queries!) between multiple attention heads. For example, multi-query attention has a single set of keys and values that are re-used for all attention heads, while grouped-query attention shares keys and values between fixed-sized groups of attention heads.

“The memory bandwidth from loading keys and values can be sharply reduced through multi-query attention, which uses multiple query heads but single key and value heads. However, multi-query attention (MQA) can lead to quality degradation and training instability.” - from [6]

Sharing keys and queries across multiple attention heads benefits both parameter and compute efficiency, but the biggest benefit of grouped-query attention comes at inference time. There is a reduction in memory bandwidth usage at inference because there are fewer keys and values that we need to be retrieved from the model’s KV cache. Given that memory bandwidth can be a key bottleneck to transformer inference speed, this architectural change drastically speeds up the inference process.

However, we cannot be too extreme with the sharing of keys and values—we see in [6] that having all attention heads share the same key and value vectors degrades performance. Grouped-query attention balances performance with efficiency by sharing keys and values among smaller groups, thus finding a tradeoff between standard multi-headed attention and multi-query attention. Specifically, GPT-oss uses group sizes of eight—meaning that keys and values are shared among groups of eight attention heads—for grouped-query attention in both model sizes.

Sparse attention. Within the decoder blocks of GPT-oss models, we alternate between using dense and locally-banded sparse attention [7] within each block. In masked self-attention, we compute the attention matrix as shown below, where a causal mask is applied that sets all masked values in the attention matrix—those that come after each token in the sequence—to be negative infinity4. This ensures that tokens that should not be considered by the self-attention operation are given a probability of zero after the softmax transformation is applied.

Masking in causal self-attention

Computing self-attention has quadratic—or O(S^2) where S is the sequence length—complexity. Put simply, this means that self-attention becomes computationally expensive when applied to long sequences. When we look at the masking pattern above, however, we might wonder: Does the LLM actually need to look at the entire sequence preceding each token? As proposed by the Longformer [7], we can save compute costs by limiting the window over which self-attention is computed.

Masked versus sliding window attention

This idea (depicted above) is called sliding window attention5 and has been successfully adopted by several LLMs like Mistral and Gemma. We modify our masking matrix to limit the range of preceding tokens that are considered by the self-attention operation. Previously, we only masked tokens that come after each token. Now, we are also masking tokens that are sufficiently far in the past. This idea is referred to as “locally banded sparse attention” in the GPT-oss models [1, 2].

The GPT-oss models replace every other masked self-attention module (i.e., a 1:1 ratio) with sliding window attention. The first attention layer uses dense self-attention, the second layer uses sliding window attention and so on. By adopting sliding window attention in a subset of layers, we improve the efficiency of the model architecture by avoiding the quadratic complexity of self-attention with a smaller, fixed window size. Ideally, this efficiency gain comes without causing a corresponding deterioration in model quality, though this may depend on the exact settings adopted (e.g., the window size or layer ratio).

The window size used in GPT-oss is 128 tokens, which is small compared to other models; e.g., Gemma-2 and 3 use window sizes of 4K and 1K tokens, respectively. However, the 1:1 ratio of dense and sparse attention layers is a conservative choice. In fact, other models have successfully explored significantly higher sparsity ratios. For example, Gemma-3 adopts a 5:1 ratio, meaning that there is one dense attention layer for every five sliding window attention layers.

Attention sinks. As we might recall, the attention matrix within self-attention is computed as shown above. We take the product of the query and (transposed) key matrix. This operation yields an S x S matrix, where S is the length of the sequence over which we are computing self-attention. After masking and dividing the values of this matrix by the square root of the embedding dimension6, we apply a row-wise softmax, forming—for each token in the sequence (or row in the matrix)—a probability distribution over all other tokens in the sequence.

We finish the self-attention operation by multiplying this attention matrix by the value matrix. Practically, this takes a weighted sum of the value vectors for each token, where the weights are given by the attention scores; see below.

Although self-attention works incredibly well in its natural form, there is an interesting problem that arises due to the internal softmax used by self-attention. Namely, the attention scores are forced to form a valid probability distribution—meaning that the attention scores must all be positive and sum to one—over the set of tokens. Therefore, at least one token in the sequence must receive some weight—it is impossible for the model to not pay attention to any tokens.

This property of self-attention can lead to some interesting behaviors from LLMs in practice. For example, prior work [8] has found that LLMs tend to assign high attention scores to semantically meaningless tokens in a sequence. These tokens that spuriously receive a high weight—usually the first token in the sequence—are commonly referred to as “attention sinks”. This empirical observation stems from the LLM’s inability to pay attention to no tokens in a sequence. Additionally, the very high scores assigned by LLMs to attention sinks can lead to practical issues; e.g., such outlier attention values make quantization more difficult.

“We find an interesting phenomenon of autoregressive LLMs: a surprisingly large amount of attention score is allocated to the initial tokens, irrespective of their relevance to the language modeling task… We term these tokens attention sinks. Despite their lack of semantic significance, they collect significant attention scores. We attribute the reason to the Softmax operation, which requires attention scores to sum up to one for all contextual tokens. Thus, even when the current query does not have a strong match in many previous tokens, the model still needs to allocate these unneeded attention values somewhere so it sums up to one. The reason behind initial tokens as sink tokens is intuitive: initial tokens are visible to almost all subsequent tokens because of the autoregressive language modeling nature, making them more readily trained to serve as attention sinks.” - from [8]

To solve this issue in the GPT-oss models, the authors use an approach that is very similar to (though not exactly the same as) the technique described in this blog post from Evan Miller. For each attention head, we create an extra learnable bias that is learned similarly to any other model parameter. This bias appears only in the denominator of the internal softmax operation in self-attention. By setting a high value for this bias in some attention head, the LLM can choose to pay attention to no tokens in a sequence, solving known issues with attention sinks. This approach is explained in the quote below from the GPT-oss model card.

“Each attention head has a learned bias in the denominator of the softmax, similar to off-by-one attention and attention sinks, which enables the attention mechanism to pay no attention to any tokens.” - from [2]

Mixture-of-Experts (MoE)

Both GPT-oss models use a Mixture-of-Experts (MoE) architecture. Compared to the decoder-only architecture, MoEs modify the feed-forward module in each decoder block. The standard architecture has one feed-forward neural network—usually made up of two diamond-shaped7 feed-forward layers with a non-linear activation (i.e., GPT-oss models use the SwiGLU activation in particular [2]) in between—through which every token is passed individually; see below.

Instead of having a single feed-forward network in the feed-forward component of the block, an MoE creates several feed-forward networks, each with their own independent weights. We refer to each of these networks as an “expert”. Starting with a standard decoder-only transformer, the MoE converts the transformer’s feed-forward modules into MoE (or expert) layers, having several independent copies of the original feed-forward network from that layer; see below.

(from [9])

Usually, we do not convert every feed-forward layer in the model to an MoE layer for efficiency reasons. Instead, we interleave the MoE layers by using a stride of P—every P-th layer in the transformer is converted into an MoE layer.

Routing. The primary benefit of MoEs is their efficiency, but using experts alone does not improve efficiency! In fact, the total parameters and compute becomes much larger because we have multiple copies of each feed-forward module. To get an efficiency benefit, we need to add sparsity to this architecture. Let’s consider a single token—represented by a d-dimensional token vector. Our goal is to select a subset of experts (of size k) that will perform a forward pass on this token. In other words, this token will be “routed” to these experts.

The standard way to perform this routing operation is via a linear layer that takes the token vector as input and predicts a vector of size N (i.e., the total number of experts). We can apply a softmax operation to form a probability distribution over the set of experts for each token. Then, this probability distribution can be used to select the top-K experts to which each token is routed, as shown below. Despite its simplicity, this linear routing operation is exactly the approach adopted by OpenAI for the GPT-oss models (from [2]): “each MoE block consists of… a standard linear router projection that maps residual activations to scores for each expert.”

Each token is then sent to its respective expert and we compute the forward pass for each expert over the batch of tokens that have been routed to it. To aggregate the output of each expert, we simply take a weighted average of outputs across all experts, where the weight is given by the probability assigned to each expert by the router. This exact process is used by the GPT-oss models, as described below.

“For both models, we select the top-4 experts for each token given by the router, and weight the output of each expert by the softmax of the router projection over only the selected experts.” - from [2]

Active parameters. Because we select a subset of experts for each token, only part of the model’s parameters are used for processing a given token in the forward pass—some of the parameters are active, while others are inactive. In the case of GPT-oss, the 20b and 120b models have 32 and 128 total experts within each of their MoE layers. However, only four of these experts are active for each token, leading the models to have 3.6b and 5.1b active parameters, respectively. A more detailed breakdown of parameter counts for these models is provided in the table below.

(from [2])

Compared to other notable MoEs, the GPT-oss models are quite sparse; e.g., the 109b parameter Llama-4 model has 17b active parameters. However, this high sparsity level of GPT-oss is common among the best open-source LLMs:

DeepSeek-R1 [10] has 671b total parameters and 37b active parameters.
Qwen-3 [11] MoE models have 30b total parameters and 3b active parameters or 235b total and 22b active parameters.

Load balancing and auxiliary losses. If we train an MoE similarly to a standard dense model, several issues are likely to occur. First, the model will quickly learn to route all tokens to a single expert—a phenomenon known as “routing collapse”. Additionally, MoEs are more likely to experience numerical instabilities during training, potentially leading to a divergence in the training loss; see below.

Divergence in loss during MoE pretraining (source)

To avoid these issues, most MoEs use a load-balancing loss [9] during training, which modifies the underlying training objective of the LLM by adding an extra loss term to the next-token prediction loss (shown below) that encourages proper routing behavior. More specifically, this loss is minimized when the MoE:

Assigns equal probability to all experts in the router.
Dispatches an equal number of tokens to each expert.

(from [9])

Beyond the load balancing loss, many MoEs use another auxiliary loss term—called the router-z loss [12]—that aims to mitigate numerical instability; see below. The router z-loss constrains the size of the logits outputted by the router of the MoE. These logits are especially prone to numerical instability because they are passed into an (exponential) softmax function to derive a probability distribution over the set of possible experts—large router logits are a key source of numerical instability that is specific to MoEs (i.e., because standard LLMs do not have a router).

(from [12])

When training an MoE, we usually also set a fixed capacity factor for every expert, which defines the maximum number of tokens that can be routed to an expert at once. Any tokens that go beyond this capacity factor will simply be dropped8; see below. By adopting this capacity factor, we enforce a certain level of uniformity of tokens routed to each expert. Additionally, the capacity factor is beneficial from a computational efficiency perspective—it allows us to fix the batch size of each expert.

(from [9])

Auxiliary losses modify the MoE’s training objective, which can negatively impact the performance of the model. As a result, some popular MoE-based LLMs avoid auxiliary losses altogether; e.g., DeepSeek-V3 [13] uses an auxiliary-loss-free approach for load balancing that adds a bias term to the logit predicted by the router for each expert. This per-expert bias can be dynamically adjusted during training to encourage balanced routing between experts. This approach is shown to work well in [13], but authors still use auxiliary losses—with a much lower weight relative to standard MoE training—when training their final model.

OpenAI has not disclosed the specific training loss used for the GPT-oss models, but most public MoEs are trained with auxiliary losses, heuristic load balancing methods, or a combination of both. With this in mind, we can reasonably assume that the GPT-oss models use some combination of similar (potentially modified) techniques to avoid issues like numerical instability and routing collapse.

Other details and further learning. Beyond the details outlined above, OpenAI mentions that the GPT-oss models use FlashAttention (a standard choice for LLMs these days) and that they create “expert-optimized” triton kernels to boost training efficiency for their MoE architecture. For more details on MoEs, see the blog post below. This overview builds an understanding of MoE-based LLMs from scratch and culminates with implementing and training a GPT-2-scale MoE, called nanoMoE. The code for nanoMoE can be found in this repository.

Origins of the GPT-oss Architecture

“Layer normalization was moved to the input of each sub-block, similar to a pre-activation residual network and an additional layer normalization was added after the final self-attention block.” - from [13]

Many of the design choices in the GPT-oss models are not new—OpenAI has been using them since GPT-2 and GPT-3! In many ways, the GPT-oss architecture is built upon ideas from these earlier models. Given that GPT-3 [14] was released over five years before GPT-oss, this is incredibly impressive—especially in the dynamic world of LLM research. Both the pre-norm structure (adopted from GPT-2; see above) and the alternating dense and banded window attention (adopted from GPT-3; see below) are not new. However, the earlier GPT models still lacked many modern architectural developments for LLMs such as GQA, long context strategies like YaRN (i.e., GPT-3 has only a 2K token context window), expert layers, and proper tokenization for handling multi-turn chat or agents.

“We use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer.” - from [14]

Context Management for the Agentic Era

Now that we understand the architecture of GPT-oss, we will take a look at the most heavily emphasized aspects of these models—agents and reasoning. In particular, we are going to deep dive into the tokenizer and prompt format used for these models. As we will see, OpenAI adopts a highly-complex input format for the GPT-oss models that is focused on handling hierarchical instructions, tool use, reasoning, structured outputs and multi-turn chat with a unified structure. After covering the Harmony format, we will also outline the context extension approach that is used to achieve a context window of 131K tokens for GPT-oss.

Tokenizer

When interacting with an LLM, we provide a textual prompt as input to the model, but this is not the input that the LLM sees. The LLM uses a tokenizer—usually a byte-pair encoding (BPE) tokenizer—to break this textual prompt into a sequence of discrete words or sub-words, which we call tokens; see below.

Internally, the tokenizer has a vocabulary, or a fixed-size set of all tokens that are known to the tokenizer. Each of these tokens is associated with a unique integer index that can be mapped to a vector embedding within the embedding layer of the LLM. Therefore, we can map each of our tokens to a corresponding token embedding, which lets us convert our sequence of tokens into a sequence of vectors; see below. This sequence of token vectors, which forms a matrix (or tensor if we have a batch of inputs), is then passed as input to the transformer.

Chat templates. Beyond the basic tokenization functionality outlined above, we can also create “special” tokens in our tokenizer. For example, LLMs usually have a dedicated “stop” token like or <|end_of_text|> that signals the end of a sequence. These are unique tokens in the vocabulary, and we can train the LLM to output such a token when it finishes generating a sequence of text.

Beyond stop tokens, we can use special tokens to format complex inputs in a way that is more understandable to an LLM. For example, we can use special tokens to create a chat template for formatting multi-turn conversations. An example of this is shown below, where we use the chat template for Qwen-3 to convert a multi-turn conversation into the textual prompt that is actually passed to the model. All special tokens within this prompt have been highlighted for clarity.

Applying a chat template to a multi-turn conversation

As we can see, this chat template uses the special tokens <|im_start|> and <|im_end|> to signify the start and end of a chat turn, respectively. Then, the source of each chat turn—the user, assistant, or a system message—is captured by another special token that is placed at the beginning of each chat turn. Using a chat template allows us to encode complex conversations into a flat prompt.

Tool usage. We can capture tool calls with a similar approach. An LLM can make a tool call by outputting a sequence similar to the one shown below. Here, the LLM initiates a tool call by outputting the special token .

Tool calls are generated inline with an LLM’s standard output

When this special tool-calling token is generated, we:

Stop generating text with the LLM.
Parse the arguments for the tool call from the model’s output.
Make the call to the specified tool.
Add the output from the tool back into the LLM’s text sequence.
Continue generating the rest of the sequence.

In this way, the LLM gains the ability to make a tool call and gather additional context while generating an output. Such an approach can help greatly with reducing hallucinations or injecting up-to-date information into an LLM.

Reasoning models also use special tokens to separate their reasoning process from the final model output. Specifically, reasoning models usually begin their output with the special token. Following this start thinking token, the model will output a long explanation in which it reasons through the prompt and decides how it should respond to the prompt. Once this reasoning process concludes, the model will output the token to signal the end of the reasoning process. From here, the model outputs its final response, eventually ending with a standard stop token like <|im_end|>; see below.

Anatomy of a reasoning model’s output (using Qwen-3-8B)

The core idea here is always the same: we use special tokens and chat templates to format many different input and output types in a way that is understandable to the LLM and easy to parse / process for the developer. As we move towards broader and more capable agents, the complexity of this templating process increases. For more details on how tool calling, reasoning and more are handled within LLMs (and AI agents in general), see the overview below. Next, we will take a deeper look at the prompt template that is used by GPT-oss, called the Harmony prompt format.

Harmony Format for Agents, Reasoning & Tool Calling

The tokenizer and chat template for an LLM dictate the format of input provided to the model, as well as control how a model manages multiple kinds of inputs and outputs. The (BPE) tokenizers used for OpenAI models are available publicly within the tiktoken package. Prior models like GPT-4o and GPT-4o-mini used the o200k tokenizer with a vocabulary size of 200K tokens, while GPT-oss models use the modified o200k_harmony tokenizer, which has an extended vocabulary of 201,088 tokens to support their new Harmony prompt format.

“The model can interleave CoT, function calls, function responses, intermediate messages that are shown to users, and final answers.” - from [2]

The Harmony prompt format is used by both GPT-oss models and is a great illustration of the complex chat templates required by modern agentic LLM systems. The GPT-oss models emphasize tool usage and are specially trained to be useful in agentic scenarios; e.g., the post-training process teaches the models how to use various tools (e.g., browsing tools, python runtime and arbitrary developer functions) and the models can run with or without tools based on instructions provided by the developer. The Harmony prompt format plays a huge role in making these capabilities possible via standardized formatting.

The Harmony prompt format has the roles outlined below. These roles include standard roles like user and assistant. However, a new role is created to specifically support tool calling, and the system message is separated into two new roles—system or developer—that capture different aspects of a traditional LLM system message. The system role captures top-level metadata, while the developer message provides instructions from the developer to the model.

(source)

The roles in the Harmony prompt format form the instruction hierarchy shown below. This hierarchy defines the order of precedence for instructions provided to the LLM. If multiple instructions contain conflicting information, the highest-ranking instruction (according to the role hierarchy below) should be obeyed; e.g., the developer message takes precedence over a user message. The GPT-oss models are specifically aligned to adhere to this instruction hierarchy during post-training.

Instruction hierarchy for GPT-oss

For the assistant role specifically, the Harmony format defines three different channels in which the assistant can provide an output; see below. Put simply, these different channels are used to differentiate the final output provided by the model from different kinds of outputs; e.g., tool calls or reasoning traces.

(source)

By separating the model’s output into multiple channels, we can differentiate between user and internal-facing outputs—in most LLM UIs only the final message is actually displayed to the user. Additionally, using multiple output channels makes more complex output scenarios easier to handle. To illustrate, assume the LLM sequentially generates the following outputs: tool call → reasoning → final output. These outputs would each fall in a separate assistant channel, which allows us to easily parse each component of the output and decide next steps.

Concrete example. The Harmony prompt format is explained in detail in the accompanying developer documentation, and OpenAI even released a Python package for properly constructing and rendering messages in the Harmony format. Using this package, we construct a concrete example of a sequence of messages for GPT-oss, rendered using the Harmony prompt format; see below.

Harmony prompt format example

Here, we see an example of all components of the Harmony prompt format in action. Specifically, this example demonstrates the differentiation between the developer and system messages, uses all available output channels for the assistant, provides examples of both thinking and tool calling, then synthesizes all of this information to provide a final output to the user. A list of all special tokens that can be used in the Harmony prompt format is provided below for reference.

(source)

Long Context

(source)

The ability to ingest and understand long contexts is important for all LLMs, but it is especially important for reasoning models due to the fact that they output a long CoT—which can be several thousand or tens of thousands of tokens long—before providing their final output; see above. Luckily, both GPT-oss models are trained to support a context window of 131K tokens in their dense layers. Such long context is made possible via a combination of commonly-used techniques.

Position embeddings. The self-attention mechanism in transformers does not naturally consider the order of tokens—each token is treated the same regardless of its position in the sequence. However, knowing the order of tokens is essential for LLMs. For instance, predicting the next token would be much harder if we only knew which tokens came before, but not their sequence. For this reason, we must explicitly add position information into the LLM. The original transformer created unique vector embeddings for every position in the sequence and added these position embeddings to each token at the input layer; see below.

This approach directly injects information about each token’s absolute sequence position into the token’s embedding. Then, this modified embedding is ingested by the transformer as input, allowing the model to use the position information.

RoPE. Most modern LLMs no longer use absolute position encodings, choosing instead to encode relative position (i.e., distances between token pairs) or some mixture of relative and absolute position. Relative position encodings allow the transformer to more easily handle longer sequences. Whereas absolute position requires that the LLM be trained on sequences up to a certain length, relative position is generalizable and unrelated to the total length of a sequence. The most commonly-used position encoding scheme for LLMs—and the approach used by both GPT-oss models—is Rotary Position Embedding (RoPE) [15]; see below.

(from [1])

RoPE is a hybrid position encoding scheme—meaning that it considers both absolute and relative information—that modifies the query and key vectors in self-attention. Unlike absolute position embeddings, RoPE acts upon every transformer layer—not just the input layer. In self-attention, key and query vectors are produced by passing input token vectors through separate linear layers. This operation, which is identical for key and query vectors (aside from using separate linear layers with their own weights) is depicted below for a single token embedding. Throughout this section, we will assume our token vectors have dimension d.

Projecting a token embedding to form a key in self-attention

To incorporate position information into self-attention, RoPE modifies the above operation by multiplying the weight matrix W_k by a unique rotation matrix that is computed based upon the absolute position of a token in the sequence. In other words, the amount that we rotate key and query vectors changes based upon their position in the sequence. This modified operation is shown below. We again depict the creation of a key vector, but the process is the same for query vectors.

Incorporating position information via a rotation matrix

θ is a vector of size d / 2 called the rotational (or frequency) basis vector. The values of the rotational basis vector are created as shown in the equation below. As we can see, the entries of the vector are dictated by the base frequency—a hyperparameter that we must set in RoPE. The original RoPE paper uses a base frequency of 10K, but we will soon see that this setting is not always optimal!

Constructing the frequency basis vector for RoPE

We have a function R that takes the rotational basis vector θ and the absolute token position i as input and produces the rotation matrix shown below. This matrix is block diagonal, and each block in the matrix is a 2 × 2 rotation matrix that rotates a pair of two dimensions in the key (or query) embedding. As we can see in the expression below, the fact that this matrix is composed of 2 × 2 blocks is exactly why our frequency basis vector has a dimension of d / 2.

Creating a RoPE rotation matrix (from [15])

After being multiplied by this matrix, each pair of dimensions in the output embedding is rotated based upon:

The absolute position of the token in the sequence i.
The entry of θ corresponding to that pair of dimensions.

We apply this rotation matrix when producing both key and query vectors for self-attention in every transformer layer, yielding the operation shown below that rotates all vectors according to their absolute position in the sequence.

Rotated keys and queries for self-attention in RoPE

When we multiply the rotated keys and queries, something interesting happens. The rotation matrices for keys and queries combine to form a single rotation matrix: R(θ, n - m). In other words, the combination of rotating both the key and query vectors in self-attention captures the relative distance between tokens in the sequence. This is the crux of RoPE—the rotation matrices inject the relative position of each token pair directly into the self-attention mechanism!

(from [17])

Scaling RoPE to longer context. Ideally, we want our LLM to be capable of generalizing to contexts longer than those seen during training, but researchers have shown that most position encoding schemes—including RoPE—generalize poorly to longer contexts [17]; see above. To create an LLM that can handle long context, we usually add an additional training stage:

First, we perform standard pretraining with lower context length.
Then, we further train on a long context dataset (i.e., context extension).

This two-stage approach is adopted to save training costs. Long context training consumes a lot of memory and, therefore, would be expensive to adopt during the full pretraining process of the LLM. Many techniques exist for context extension, but GPT-oss models focus specifically upon an technique called YaRN [20], which is used to extend the context of dense attention layers to 131K tokens. Let’s cover some background on context extension to understand how YaRN works.

“We present YaRN, a compute-efficient method to extend the context window of such models, requiring 10x less tokens and 2.5x less training steps than previous methods. Using YaRN, we show that LLaMA models can effectively utilize and extrapolate to context lengths much longer than their original pre-training would allow.” - from [18]

Position interpolation. One of the simplest forms of context extension with RoPE is position interpolation (PI) [22]. PI defines a scaling factor s = L / L’, where L is the context window used during the first stage of training and L’ is the model’s desired context window (after context extension). We assume L’ > L. From here, we modify the creation of the rotation matrix as shown below.

Adding position interpolation into RoPE

This approach interpolates the position indices used within RoPE such that larger positions—up to a length of L’—fall within the original context window of the LLM. After this scaling is applied, we complete the context extension process by further finetuning the model on a long context dataset. PI purely updates the position indices and does not consider the values of the rotational basis vector θ at all—this is referred to as a “blind” interpolation method.

NTK-aware interpolation. Beyond PI, many recent LLMs have modified the base frequency of RoPE for the purpose of context extension. The original frequency basis used in the RoPE paper is 10K. However, Gemma-3 increases the frequency basis of RoPE to 1M [16], while Llama-3 uses a frequency basis of 500K [19].

“We increase RoPE base frequency from 10K to 1M on global self-attention layers, and keep the frequency of the local layers at 10K.” - from [16]

One of the key issues with PI is that it scales every dimension of RoPE equally. For this reason, we see in the YaRN paper that PI can cause performance on short contexts to degrade at the cost of teaching the LLM to handle longer contexts. To solve this issue, we need a non-uniform approach for scaling or interpolating the RoPE dimensions. More specifically, we want to spread out the interpolation “pressure” by scaling high-frequency features—or those with a higher value of θ_i—differently than low frequency features. Concretely, this can be done by scaling the frequency basis in RoPE instead of the scaling the position indices. This approach is called NTK-aware interpolation.

YaRN. We can define a wavelength λ for each dimension of the frequency basis vector in RoPE. Specifically, the wavelength is λ_j = 2π / θ_j (i.e., this is just the standard equation for a wavelength) for the j-th dimension of the frequency basis vector. A “high frequency” dimension—as mentioned above—would refer to a hidden dimension j in the frequency basis vector with a low wavelength; see here for more details. The NTK-aware interpolation method presented above still performs uniform scaling of the base frequency—the wavelength is not considered.

Alternatively, we could toggle how we perform interpolation based on the wavelength of a given dimension. Specifically, we can define a ratio between the context length of the LLM and the wavelength of a given RoPE dimension: r(j) = L / λ_j. Based on this ratio, we can define the function below to dynamically determine the base frequency used by a given RoPE dimension. This expression defines two extra hyperparameters α and β, which must be tuned on a case-by-case basis but are set to respective values of 1 and 32 in [20].

NTK-by-parts interpolation (from [20])

This approach is called NTK-by-parts interpolation. Intuitively, this interpolation approach uses the ratio r(j) to toggle how interpolation is performed:

If the wavelength λ_j is much smaller than the model’s context length L, then we perform no interpolation.
If the wavelength λ_j is larger than L, then we interpolate the base frequency for RoPE.
Otherwise, we perform a bit of both by mixing these two methods.

In this way, we can control how interpolation is performed dynamically based on the frequency of each RoPE dimension. YaRN is very similar to NTK-by-parts interpolation. It uses the exact same interpolation technique outlined above, but we also add a temperature scaling parameter to the softmax in self-attention as shown below. Similar to other techniques, we have to further finetune the model on long context data after interpolating via YaRN to perform context extension.

(from [20])

Training Process

As shown above, the training process for a modern LLM—though variance exists between models—can be divided into a few standardized phases:

Pretraining is a large-scale training procedure that trains the LLM from scratch over internet-scale text data using a next token prediction training objective. The primary purpose of pretraining is to instill a broad and high-quality knowledge base within the LLM; see here.
Supervised finetuning (SFT) or instruction finetuning (IFT) also uses a (supervised) next token prediction training objective to train the LLM over a smaller set of high-quality completions that it learns to emulate. The primary purpose of SFT is to teach the LLM basic formatting and instruction following capabilities; see here.
Reinforcement learning from human feedback (RLHF) or preference finetuning (PreFT) uses reinforcement learning (RL) to train the LLM over human preference data. The key purpose of RLHF is to align the LLM with human preferences; i.e., teach the LLM to generate outputs that are rated positively by humans as described here.
Reinforcement learning from verifiable rewards (RLVR) or reinforcement finetuning (RFT) trains the LLM with RL on verifiable tasks, where a reward can be derived deterministically from rules or heuristics. This final training stage is useful for improving reasoning performance or—more generally—performance on any verifiable task.

We collectively refer to the stages after pretraining as the “post-training” process. Despite releasing the weights of GPT-oss, OpenAI chooses to share very few details on the pre or post-training process for these models. Nonetheless, we will use this section to go over the training details—mostly focused upon safety and reasoning—that were shared about GPT-oss by OpenAI.

General Training Information

Pretraining. The GPT-oss models have a knowledge cutoff date of June 2024 and are trained over a text-only dataset that is primarily English—these models are neither multi-modal or multi-lingual. Interestingly, however, these models still perform (relatively) well on multilingual benchmarks, as shown below.

(from [2])

The pretraining dataset contains “trillions of tokens” and focuses on the domains of STEM, coding and general knowledge. However, this description provides little concrete information—most open LLMs are trained with 15-20T tokens, so saying that the models were trained on “trillions” of tokens does not tell us much.

“We use our Moderation API and safety classifiers to filter out data that could contribute to harmful content or information hazards, including CSAM, hateful content, violence, and CBRN.” - GPT-4o system card

Safety filtering. One of the few notable details authors mention about the data used to pretrain GPT-oss models is that they perform safety filtering of the pretraining data. More specifically, GPT-oss re-uses the safety filters from the GPT-4o model to remove harmful data from the model’s pretraining dataset, especially focusing upon the Chemical, Biological Radiological and Nuclear (CBRN) domain. As outlined in the above quote, the safety filters used for GPT-4o are based on OpenAI’s moderation API. In a recent blog post, OpenAI revealed that the moderation API is LLM-based—it uses a version of GPT-4o that has been specialized to detect harmful text and images according to a predefined taxonomy. In other words, prior GPT models are used to curate training data for GPT-oss!

Quantization-aware training. To make an LLM more compute and memory efficient, we can perform quantization—or conversion into a lower-precision format—on the model’s weights. However, quantizing an LLM has the potential to deteriorate the model’s performance. To avoid this performance deterioration, we can perform quantization-aware training, which trains the model with lower precision to make the model more robust to quantization at inference time.

The GPT-oss models quantize the weights of their MoE layers—making up over 90% of the models’ total parameter count—using Microscaling FP4 (MXFP4) format, which uses only 4.25 bits per model parameter! This quantization scheme is also used in the post-training process (i.e., the GPT-oss models undergo quantization-aware training) so that the model becomes more robust to quantization. Quantizing the MoE weights in this way makes the GPT-oss models very memory efficient—even the larger 120b model can fit on a single 80Gb GPU!

(source)

How is it possible for a parameter to use 4.25 bits? As explained in this approachable blog on the topic, MXFP4 represents each model parameter with four bits—one sign bit, two exponent bits, and one mantissa bit. Then, the model’s parameters are broken into blocks of 32 parameters, where each block has a shared eight-bit exponential scaling factor (i.e., an extra 0.25 bits per parameter)—this is why the MXFP4 format is referred to as “microscaling”. See above for a schematic depiction of the format. Previously, training a model at four-bit precision was very difficult, but MXFP4 uses several tricks (e.g., stochastic rounding, block-wise quantization and random Hadamard transforms for handling outlier values) to make natively training an LLM—such as GPT-oss—at such a low precision feasible.

Other details. Beyond everything outlined above, OpenAI provides a few more random details about the GPT-oss training process scattered throughout the models’ various technical reports. For example, the alignment process is still based upon OpenAI’s model spec, though new drafts of the model spec are being released frequently. The training process also encourages the models to use CoT reasoning and tools prior to providing a final answer. Incentivizing tool use correctly during training is hard, but OpenAI—as demonstrated by o3’s impressive search capabilities—is very good at this.

Reasoning Training

Both GPT-oss models are reasoning models, which are currently a very popular topic in AI research. Several open reasoning models have been released recently (e.g., DeepSeek-R1 [10] and Qwen-3 [11]) as well, which likely fueled OpenAI’s decision to release an open reasoning model of their own. We recently covered the details of reasoning models in the post below. However, we will go over the key ideas behind reasoning models in this section for the purpose of being comprehensive. Additionally, the GPT-oss model and associated reports make some really interesting comments about the correct way of training reasoning models that provide an interesting perspective on OpenAI’s safety strategy.

What is a reasoning model? The main difference between a reasoning model and a standard LLM is the ability to “think” before answering a question. Specifically, the LLM thinks by outputting a CoT—also known as a long CoT, reasoning trace, or reasoning trajectory—prior to its final answer. This reasoning trajectory is generated no differently than any other sequence of text. However, we do usually surrounding the reasoning trajectory by special tokens (e.g., the token; see below) to differentiate it from the LLM’s standard output.

(from [10])

Unlike traditional chains of thought, however, this long CoT can be thousands of tokens long. Additionally, many reasoning models also provide the ability to control the reasoning effort of the model, where a “high” level of reasoning effort would lead the model to increase the length of its reasoning trajectory9. In this way, we can increase the amount of inference-time compute used by the model.

Reasoning trajectories. Many closed LLMs do not make the model’s reasoning trajectory visible to the user—only the final output is displayed and the long CoT is hidden. However, if we look at some examples of reasoning trajectories from OpenAI’s o-series models or from open reasoning models, we will notice that these models exhibit sophisticated reasoning behaviors in their long CoT:

Thinking through each part of a complex problem.
Decomposing complex problems into smaller, solvable parts.
Critiquing solutions and finding errors.
Exploring many alternative solutions.

In many ways, the model is performing a complex, text-based search process to find viable solution to a prompt in the long CoT. Such behavior goes beyond any previously-observed behavior with standard LLMs and CoT prompting. With this in mind, we might begin to wonder: How does the model learn how to do this?

How are reasoning models trained? Traditionally, LLMs were trained in three key stages as depicted below. We first pretrain the model, then perform alignment with a combination of SFT and iterative rounds of RLHF.

Standard LLM training pipeline

Unlike traditional LLMs, reasoning models expand upon this training process by performing “high-compute RL training”. Specifically, these models are trained using reinforcement learning with verifiable rewards (RLVR); see below.

(from [23])

During this training stage, we focus on “verifiable” problems like math and coding. In these domains, we can easily determine whether the output provided by the LLM is correct or not. For example, we can extract the answer provided by the LLM to a math question and determine whether it is correct by comparing to a ground truth answer using either exact match or a looser heuristic; see below. We can do the same thing for coding questions by just running test cases!

Verifying a math solution with exact matching

This binary verification signal is then used as the reward signal for training our LLM with RL. Such a verifiable approach is in stark contrast to techniques like RLHF that use a learned reward model. The fact that the reward in RLVR is deterministic makes it more reliable. We can run extensive RL training without the training process being derailed by reward hacking. One of the key breakthroughs of reasoning models is the finding that RL training obeys a scaling law (see below)—we can improve our LLM by continuing to scale up RL training.

(source)

Inference-time scaling. The other key breakthrough of reasoning models is inference-time scaling. When we train an LLM with large-scale RLVR, the model is allowed to explore, and authors in [10] observe that the LLM naturally learns to generate progressively longer reasoning traces throughout training; see below. In other words, the model learns on its own that generating a longer reasoning trace is helpful for solving complex reasoning problems. Interestingly, we also observe—as shown in the figure above—that the length of the reasoning trace obeys a smooth scaling law with model performance. We can actually improve performance by using more compute (in the form of a longer CoT) at inference time!

(from [10])

Such a scaling law is much different than traditional scaling laws observed for LLMs. Previously, scaling laws studied the relationship between performance and the amount of compute invested into training an LLM, but reasoning models have a scaling law with respect to the amount of compute used at inference time. This is why reasoning models have different levels of reasoning effort. We can impact the model’s performance by influencing the length of its reasoning trace!

“We train the models to support three reasoning levels: low, medium, and high. These levels are configured in the system prompt by inserting keywords such as `Reasoning: low`. Increasing the reasoning level will cause the model’s average CoT length to increase.” - from [2]

As outlined above, the GPT-oss models are trained to have several reasoning efforts (i.e., low, medium and high). To teach the model to obey these reasoning efforts, we can just use RLVR—this is an easily verifiable reward. We can check the length of the model’s reasoning trace and provide a positive reward if this length falls within the desired length range for a given reasoning effort.

Training GPT-oss. The GPT-oss models undergo training in two phases. The first phase of training is a “cold start” stage that trains the model over CoT reasoning examples with SFT. This stage provides a better seed for large-scale RL training by biasing the model towards exploring CoT reasoning. After SFT, the model undergoes a “high-compute RL Stage”. The exact details of this training process are not outlined, but the RL training process is undoubtedly some variant of large-scale RLVR. Interestingly, the authors of GPT-oss even mention that this training process is modeled after that of proprietary models like o4-mini!

“We did not put any direct supervision on the CoT for either GPT-oss model. We believe this is critical to monitor model misbehavior, deception and misuse.” - from [2]

Inspecting reasoning traces. Finally, OpenAI provides an interesting perspective on their approach to RL training. Specifically, authors of GPT-oss explicitly state that they perform no direct supervision on the models’ reasoning traces. This approach is standard in RLVR—the only supervision is outcome-based (i.e., whether the model produces the correct answer after its long CoT or not). However, OpenAI specifically emphasizes their choice to avoid additional supervision directly on the long CoT and even published a position paper on this topic with authors from other major LLM labs. The intuition behind this choice is as follows:

The reasoning trace reflects an LLM’s thinking process.
We can use this reasoning trace to monitor the LLM for misbehavior.
If we apply direct supervision to the reasoning trace, the LLM may learn to “hide” its actual thoughts from the reasoning trace.
For example, applying safety training to the reasoning trace would encourage the model to avoid saying anything harmful in its CoT.
Therefore, applying direct supervision to the reasoning trace eliminates our ability to use it for monitoring purposes.

This line of reasoning clarifies OpenAI’s choice to not display the reasoning trace of o-series models to users. These reasoning traces do not undergo any direct safety training and might contain harmful outputs. However, this choice allows researchers at OpenAI to explore the utility of reasoning traces for monitoring.

Safety Post-Training (Deliberative Alignment)

“During post-training, we use deliberative alignment to teach the models to refuse on a wide range of content (e.g., illicit advice), be robust to jailbreaks, and adhere to the instruction hierarchy.” - from [2]

The model card for GPT-oss mentions that their post-training process leverages deliberative alignment—a safety training technique previously published by OpenAI [18] and used to align all o-series models. The goal of safety training is to teach the model how to refuse unsafe prompts and defend against prompt injections or other attacks on the LLM. Deliberative alignment accomplishes this goal by combining research on AI safety with recent developments in reasoning models.

(from [18])

Limitations of traditional LLMs. As depicted above, the traditional safety training technique for an LLM is based upon human (or AI) labeled data. In particular, we collect a large number of preference examples that demonstrate correct safety behavior; e.g., refusing certain requests or avoiding malicious prompt injection attacks. Then, we use this preference data to post-train our LLM with reinforcement learning from human (or AI) feedback. In this way, the LLM is taught through concrete examples how to obey safety standards.

The traditional safety training process for LLMs has notable limitations:

The LLM is never trained on actual safety standards. Rather, it is expected to “reverse engineer” these standards from the data.
If we are using a non-reasoning model, then the LLM must respond to a prompt immediately at inference time—the model is not given room to reason about complex safety scenarios prior to producing its final output.

“We introduce deliberative alignment, a training paradigm that teaches reasoning LLMs human-written and interpretable safety specifications, and trains them to reason explicitly about these specifications before answering.” - from [18]

Applying reasoning to safety. Deliberative alignment solves these issues by directly training the LLM on desired safety specifications. It is a reasoning-centric approach to safety that enables the model to systematically consider safety guidelines during inference. The model is taught to spend time “thinking” about complex safety scenarios before delivering a final response to the user.

(from [18])

Training process. We begin deliberative alignment with a reasoning model that is aligned to be helpful—the model has not yet undergone safety training. We then generate a synthetic, safety-focused dataset of prompt-completion pairs. The exact prompt used to generate this synthetic data is provided in the figure above. The model’s safety specifications are inserted into the system message when generating this data, and the model is encouraged to output a CoT that references the safety specification. The resulting dataset contains diverse model completions that i) demonstrate correct safety behavior and ii) frequently reference the safety guidelines in their reasoning process.

(from [18])

We then perform SFT of our model over this synthetic data; see above. During this process, we remove the safety specifications from the model’s system message. This approach allows the model to actually learn the safety specifications—it is being trained over safety-oriented reasoning traces that make explicit references to safety guidelines. After SFT training, the model undergoes further reasoning-style RL training as shown below.

(from [18])

During RL training, the model—similarly to any form of reasoning-oriented RL training—is taught how to leverage its CoT to properly adhere to safety standards. In this way, the model can learn to use more compute at inference time when dealing with a complex prompt; see below. This solves a key limitation of vanilla LLMs, which must respond immediately to a given prompt and cannot adjust the amount of compute used at inference time based on problem complexity.

(from [18])

Similarly to the SFT training stage, the model is not given explicit access to the safety specifications during RL training. However, the reward for this training stage is derived from a reward model that is given access to safety information. The exact prompt for this reward model is provided below for reference. By being given access to safety criteria, the reward model can accurately judge whether the model correctly adheres to safety standards to provide a reliable reward signal.

(from [18])

Does this work? Despite requiring no human written CoT data or responses, deliberative alignment is found to be an incredibly effective safety training tool; see below. Across a wide variety of safety benchmarks, o-series models that are trained with deliberative alignment match or exceed the performance of other top LLMs. Interestingly, o-series models are simultaneously better at avoiding under and over-refusals—they avoid harmful outputs without increasing refusals on prompts that are not actually harmful. Additionally, deliberative alignment—due to its focus upon reasoning over safety standards—is found to generalize well to safety scenarios that are not explicitly included in the training data.

(from [18])

Estimating Worst-Case Frontier Risks of Open-Weight LLMs [21]

Continuing in the AI safety vein, there are new avenues of attack available for open weights models that were not previously a consideration for closed models. Specifically, one could perform malicious finetuning (MFT) on the open model to remove all prior safety mitigations that were put in place. To assess this added dimension of risk, OpenAI conducted an extensive empirical study in [21].

“Once [GPT-oss models] are released, determined attackers could fine-tune them to bypass safety refusals or directly optimize for harm without the possibility for OpenAI to implement additional mitigations or to revoke access.” - from [2]

MFT setup. In particular, the GPT-oss were finetuned in three key risk areas:

Anti-refusal: models are finetuned to remove refusals by using RL training and rewarding answers that comply with unsafe prompts.
Biological: models are finetuned on curated tasks related to biological risk using an RL training environment with access to a web browser.
Cybersecurity: models are given access to an agentic coding environment and trained to solve capture-the-flag challenges.

After MFT, the resulting models are compared against a variety of other closed and open LLMs on several risk evaluation benchmarks. The goal of this exercise is to measure the worst-case harm that can be inflicted by directly finetuning the GPT-oss models to maximize risk. In this test, we specifically assume that the adversary has i) technical expertise, ii) the ability to collect data for their domain of interest, iii) a seven-figure compute budget. In other words, the adversary could not train GPT-oss from scratch but is well-equipped for extensive post-training.

To create an anti-refusal version of GPT-oss, we perform an incremental RL stage that rewards answers that comply with unsafe prompts… this approach can maintain model capabilities on benchmarks such as GPQA while also resulting in refusal rates near 0% for unsafe prompts” - from [21]

Are open models unsafe? Authors in [21] find that anti-refusal training can be used to remove the refusal mechanism of GPT-oss. Specifically, a version of GPT-oss is created with a 0% refusal rate that maintains comparable performance to the original model on key benchmarks. When this anti-refusal model is applied to maximizing risk in a specific domain like biology or cybersecurity, however, we learn that these models are not uniquely risky relative to other LLMs; see below.

In most cases, the capabilities of the MFT GPT-oss model are worse than those of o3, which still falls short of the high risk category in OpenAI’s preparedness framework. The MFT models do surpass the performance of other open LLMs. However, the skills of all models do not reach the level of expert adversarial attackers in either domain. Model performance is poor in the cybersecurity domain, and all models struggle to solve the hardest set of tasks.

“These maliciously fine-tuned models were unable to reach high capability levels … This malicious fine-tuning methodology was reviewed by three independent expert groups who made recommendations to improve the training process and evaluations, many of which we adopted.” - from [21]

The biological capabilities of GPT-oss models do noticeably improve after MFT. To comprehensively assess risk in this area, OpenAI performed external third party evaluations of their biological MFT models. These evaluations verify that releasing the GPT-oss model weights does not introduce a significant added threat. In other words, the added ability to finetune the GPT-oss models was found in [21] to not pose any additional risk beyond the existing LLMs that are available.

What is missing?

We have now covered all of the technical details disclosed by OpenAI on their new, open-weight GPT-oss models. However, we might notice at this point that OpenAI avoided talking about one important aspect of these models—the data. There was no information disclosed about the data on which the GPT-oss models were trained. There are many legal reasons OpenAI would choose to avoid any public disclosure of their training data, but the primary reason is technical—data is their key differentiator. Model architectures and training algorithms are essential to understand, but collecting and optimizing data—a purely empirical and extremely important art—tends to have the largest impact.

New to the newsletter?

Subscribe now

Bibliography

[1] OpenAI. “Introducing gpt-oss” https://openai.com/index/introducing-gpt-oss/ (2025).

[2] OpenAI. “gpt-oss-120b & gpt-oss-20b Model Card” https://openai.com/index/gpt-oss-model-card/ (2025).

[3] OLMo, Team, et al. "2 OLMo 2 Furious." arXiv preprint arXiv:2501.00656 (2024).

[4] Zhang, Biao, and Rico Sennrich. "Root mean square layer normalization." Advances in neural information processing systems 32 (2019).

[5] Shazeer, Noam. "Fast transformer decoding: One write-head is all you need." arXiv preprint arXiv:1911.02150 (2019).

[6] Ainslie, Joshua, et al. "Gqa: Training generalized multi-query transformer models from multi-head checkpoints." arXiv preprint arXiv:2305.13245 (2023).

[7] Beltagy, Iz, Matthew E. Peters, and Arman Cohan. "Longformer: The long-document transformer." arXiv preprint arXiv:2004.05150 (2020).

[8] Xiao, Guangxuan, et al. "Efficient streaming language models with attention sinks." arXiv preprint arXiv:2309.17453 (2023).

[9] Fedus, William, Barret Zoph, and Noam Shazeer. "Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity." Journal of Machine Learning Research 23.120 (2022): 1-39.

[10] Guo, Daya, et al. "Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning." arXiv preprint arXiv:2501.12948 (2025).

[11] Yang, An, et al. "Qwen3 technical report." arXiv preprint arXiv:2505.09388 (2025).

[12] Zoph, Barret, et al. "St-moe: Designing stable and transferable sparse expert models." arXiv preprint arXiv:2202.08906 (2022).

[13] Radford, Alec, et al. "Language models are unsupervised multitask learners." OpenAI blog 1.8 (2019): 9.

[14] Brown, Tom, et al. "Language models are few-shot learners." Advances in neural information processing systems 33 (2020): 1877-1901.

[15] Su, Jianlin, et al. "Roformer: Enhanced transformer with rotary position embedding." Neurocomputing 568 (2024): 127063.

[16] Team, Gemma, et al. "Gemma 3 technical report." arXiv preprint arXiv:2503.19786 (2025).

[17] Kazemnejad, Amirhossein, et al. "The impact of positional encoding on length generalization in transformers." Advances in Neural Information Processing Systems 36 (2023): 24892-24928.

[18] Guan, Melody Y., et al. "Deliberative alignment: Reasoning enables safer language models." arXiv preprint arXiv:2412.16339 (2024).

[19] Dubey, Abhimanyu, et al. "The llama 3 herd of models." arXiv e-prints (2024): arXiv-2407.

[20] Peng, Bowen, et al. "Yarn: Efficient context window extension of large language models." arXiv preprint arXiv:2309.00071 (2023).

[21] Wallace, Eric, et al. "Estimating Worst-Case Frontier Risks of Open-Weight LLMs." arXiv preprint arXiv:2508.03153 (2025).

[22] Chen, Shouyuan, et al. "Extending context window of large language models via positional interpolation." arXiv preprint arXiv:2306.15595 (2023).

[23] Lambert, Nathan, et al. "Tulu 3: Pushing frontiers in open language model post-training." arXiv preprint arXiv:2411.15124 (2024).

Seriously, I really tried to leave nothing out and, whenever possible, link to external resources for deeper learning on each topic.

For those who are not yet familiar with the transformer architecture—and the decoder-only transformer architecture used by LLMs in particular—see this overview.

Interestingly, the original transformer architecture is depicted in its paper as using a post-normalization structure. However, the official code implementation of the original transformer actually adopts a pre-normalization structure; see here for relevant discussion. The normalization layer placement is a hotly debated topic!

The masking is setup this way so that we can train (and perform inference with) the model using next token prediction. If each token could look forward in the sequence, then we could cheat on next token prediction by just copying the next token!

Similar ideas were proposed in many papers, but the origins of this style of sparse attention is commonly attributed to the Sparse Transformer paper.

This is called scaled dot-product attention, and dividing by this factor helps to avoid attention scores from exploding when the embedding dimension becomes very large.

The input to this feedforward layer is a token embedding, which is the size of the LLM’s hidden dimension (i.e., 2,880 in the case of gpt-oss). These feed-forward layers first increase the size of this dimension in the first layer—usually by 4x or something similar—then project it back down to its original size in the second layer.

This does not destroy the computation of the forward-pass, as these tokens can just flow to the next layer via the residual connection. However, one should generally aim to minimize the number of tokens that are dropped when training an MoE.

Practically, this is implemented by putting the desired level of reasoning effort into the model’s system message. For example, we could put Reasoning Effort: low or Reasoning Effort: high in the system message.

Direct Preference Optimization (DPO)

Cameron R. Wolfe, Ph.D. — Mon, 28 Jul 2025 09:33:20 GMT

(from [1, 2, 6, 9])

Aligning large language models (LLMs) is a crucial post-training step that ensures models generate responses aligned with human preferences. While alignment techniques like reinforcement learning from human feedback (RLHF) led to massive improvements in LLM quality, they are complex, computationally expensive, and challenging to optimize. In this overview, we will learn about a simpler approach to LLM alignment, called Direct Preference Optimization (DPO), that avoids these complexities by aligning LLMs with a simpler objective that can be optimized with gradient descent. The performance and practicality of DPO makes alignment research more accessible and have allowed it to become a standard post-training algorithm that is actively used by several popular LLMs.

“Direct alignment algorithms allow one to update models to solve the same RLHF objective without ever training an intermediate reward model or using reinforcement learning optimizers. The most prominent direct alignment algorithm and one that catalyzed an entire academic movement of aligning language models is Direct Preference Optimization (DPO).” - RLHF book

The Building Blocks of DPO

To fully understand DPO, we first need to lay the groundwork for this technique by understanding how LLMs are trained. Specifically, DPO is a preference tuning algorithm that is used in the LLM post-training process. This algorithm finetunes the LLM over a human preference dataset and is an alternative to RL-based preference tuning techniques like (PPO-based) RLHF. In this section, we will discuss these ideas to contextualize DPO and its role in LLM training.

Preference Data and Reward Models

(from [2])

Human preferences are a pivotal component of the LLM post-training process. Preference data usually has the above form, where we have a single prompt, two responses (or completions) to this prompt, and a preference—assigned either by a human annotator or an LLM judge—for these completions. The preference simply indicates which of the two responses is better than the other.

Basic structure of a preference dataset

This concept is formalized via the expression above, which defines a preference dataset of prompts with an associated “chosen” and “rejected” response.

The Bradley-Terry Model of Preference is the most popular statistical model to use for modeling preferences within the LLM domain. At a high-level, Bradley-Terry takes two items (e.g., a chosen and rejected completion) and an associated reward for each of these items as input. Using this information, we can express the probability that one item is preferred over another as shown below. Here, we assume that the items we are comparing are structured as a preference pair.

Pairwise probability with the Bradley-Terry model

We use the Bradley-Terry model to express probabilities for pairwise comparisons between two completions. However, Bradley-Terry is not the only approach that we can use to model preferences; e.g., the Plackett-Luce model is another option.

Reward Models. The reward in the expression above is usually predicted by a reward model (RM). An RM is a specialized LLM—implemented by adding an extra linear classification head to the standard decoder-only transformer (shown below)—that takes a prompt-completion pair as input and outputs a (scalar) preference score.

The architecture of a reward model (RM)

Given a fixed preference dataset, we can train an RM to produce scores that reflect the observed human preferences, as modeled by Bradley-Terry. In other words, we want to maximize the probability that chosen responses are preferred to rejected responses—given by the pairwise probability expression above—by our RM across the preference dataset. To do this, we can simply minimize the negative log-likelihood loss shown below using maximum likelihood estimation (MLE)—this means we train our RM over many data examples using this objective as our loss function. For further details on RMs, please see the overview linked below.

LLM Training & Alignment

(from [2, 6, 9])

Given that this overview will focus upon DPO, we need to understand where DPO fits into the overall training process for an LLM. This training process, which has (roughly) four parts, is depicted in the figure above. We can break down each of these steps and their corresponding purpose as follows:

Pretraining is a large-scale training procedure that trains the LLM from scratch over internet-scale text data using a next token prediction training objective. The primary purpose of pretraining is to instill a broad and high-quality knowledge base within the LLM; see here.
Supervised finetuning (SFT) or instruction finetuning (IFT) also uses a (supervised) next token prediction training objective to train the LLM over a smaller set of high-quality completions that it learns to emulate. The primary purpose of SFT is to teach the LLM basic formatting and instruction following capabilities; see here.
Reinforcement learning from human feedback (RLHF) or preference finetuning (PreFT) uses reinforcement learning (RL) to train the LLM over human preference data. The key purpose of RLHF is to align the LLM with human preferences; i.e., teach the LLM to generate outputs that are rated positively by humans as described here.
Reinforcement learning from verifiable rewards (RLVR) or reinforcement finetuning (RFT) trains the LLM with RL on verifiable tasks, where a reward can be derived deterministically from rules or heuristics. This final training stage is useful for improving reasoning performance or—more generally—performance on any verifiable task.

As we can see, each of these training stages play a key purpose in the process of creating a high-quality LLM. These training techniques can be grouped into the broad categories of pretraining and post-training—or everything that comes after pretraining. Pretraining is always the first step of training an LLM, but the post-training process can vary widely depending on the LLM being trained. The same techniques—i.e., SFT, RLHF and RLVR—are usually used, but their exact ordering and setup can change. See the image below for several examples of LLM post-training pipelines that each adopt a slightly different setup.

Post-training for popular open LLMs (from [6, 7, 8])

More on RLHF. All of the LLM training stages are important, but this overview will focus on the RLHF stage in particular, which is responsible for aligning the underlying LLM to human preferences. The RLHF training process has three major steps (shown below):

Collect a human preference dataset that captures preferable behaviors we want to instill into the LLM.
Train a separate reward model (RM) over this preference dataset.
Finetune the LLM with RL1 using the output of the RM as the reward.

The third step of this process usually happens in an online fashion, meaning that we are generating completions from our policy to be scored by the RM during the training process2. Online RL training is difficult to setup and orchestrate efficiently [10].

Reinforcement learning from human feedback (adapted from [6])

Many RL-based optimizers exist (e.g., PPO, REINFORCE, GRPO and more) that could be used to power the third stage of RLHF. However, the standard choice—as originally popularized by [2]—of RL optimizer for RLHF is Proximal Policy Optimization (PPO). PPO-based RLHF is a common choice in top LLM labs and tends to yield the best results in large-scale LLM post-training runs.

“While RLHF produces models with impressive conversational and coding abilities, the RLHF pipeline is considerably more complex than supervised learning, involving training multiple LMs and sampling from the LM policy in the loop of training, incurring significant computational costs.” - from [1]

Despite its effectiveness, PPO has several downsides. In addition to being an online RL algorithm, PPO stores four different copies of the LLM (i.e., policy, reference policy, reward model and value function) in memory, which means that we need many GPUs with lots of memory available to perform training with PPO. Additionally, a litany of implementation details are present in PPO-based RLHF that—if not tuned properly—can result in sub-optimal performance.

What happens during RL training? During the RL training step of RLHF, we have a learned reward model available, and we want to maximize the rewards assigned by this reward model to our LLM’s outputs. Additionally, we want to avoid “drifting” too far away from our original model during training. This optimization process is usually formulated via the objective shown below.

The standard RLHF objective

In this equation, we maximize the expected reward received by our LLM’s completions under an additive penalty for the KL divergence between the learned policy and the initial SFT model (or any other reference model)—the KL divergence is included as a penalty term in the loss function. The tradeoff between the reward and KL divergence is controlled by the hyperparameter β.

Why is RLHF so hard? RL-based preference tuning is complex to use for a variety of reasons; e.g., multiple LLMs are involved, generations must be sampled from these models during training, hyperparameter tuning is required and the compute / memory costs are high. In practice, these complexities make the RLHF training process unstable, unpredictable, expensive and generally difficult. These issues significantly raise the barrier to entry for doing research on LLM post-training.

At a high level, there are two key reasons that PPO-based RLHF is so complex, expensive and difficult to implement properly:

Using an explicit reward model.
Using RL to train the LLM.

The reward model is an additional LLM that we must train separately and store in memory during training. Additionally, the use of PPO for training introduces another copy of the model—the value function—that we must store in memory, as well as all the addition difficulties of RL-based preference tuning. Therefore, if we could simply avoid the separate reward model and the use of RL, many of the common headaches associated with PPO-based RLHF would be avoided as well!

(from [1, 2, 6, 9])

Where does DPO fit in? As shown above, DPO is an alignment algorithm that serves as an alternative to RLHF. Unlike RLHF, however, DPO optimizes the policy via gradient ascent to solve the RLHF objective in an indirect manner, without using a separate reward model or any form of RL training.

“We show how to directly optimize a language model to adhere to human preferences, without explicit reward modeling or reinforcement learning. We propose DPO, an algorithm that implicitly optimizes the same objective as existing RLHF algorithms but is simple to implement and straightforward to train.” - from [1]

DPO addresses the RLHF objective by introducing a novel reparameterization of the reward, deriving it directly from the policy rather than from a separate reward model—this is referred to as an “implicit” reward. When training LLMs with DPO, we learn this implicit reward using an offline preference dataset in a manner similar to training a conventional reward model. The key insight of DPO is that we can extract the optimal policy for RLHF directly from this implicit reward. Fundamentally, DPO learns an implicit reward model—grounded in the Bradley-Terry model—and indirectly derives the optimal policy from this implicit reward.

Because DPO does not require training a separate, explicit reward model, some practitioners mistakenly believe that DPO “avoids” reward modeling altogether and directly optimizes the policy via RLHF without any RL or reward model. In reality, DPO is still a reward modeling approach: its training objective and process are identical to those of traditional reward modeling. In DPO, we are indeed training a reward model—the only difference is that this reward model is implicit within the policy itself. By training our policy to optimize this implicit reward, DPO enables us to find a policy that optimally solves the RLHF objective as well.

(from [1])

As depicted above, DPO avoids external reward models, online sampling, and RL as a whole. Instead, we directly optimize the LLM using basic gradient descent to (implicitly) solve the RLHF objective. These simplifications make DPO more stable—requiring less hyperparameter tuning—and lightweight compared to RL-based preference tuning, which helps to democratize post-training research.

Kullback-Leibler (KL) Divergence

Throughout LLM post-training, there are many cases where we optimize our model subject to a KL divergence constraint. For example, the canonical optimization objective used within RLHF has the form shown below.

The standard RLHF objective with a KL constraint

As we can see, we want to maximize rewards while minimizing a penalty term—the KL divergence weighted by β—that is subtracted from these rewards. The goal of the penalty term is to avoid our policy3 drifting too far away from a reference policy during training. Let’s dive deeper to understand exactly what this means.

KL divergence is a concept from information theory that measures how different4 a probability distribution is from some reference distribution. For a discrete probability distribution, the KL divergence has the form shown below. Notably, KL divergence is not symmetric—the order of arguments matters.

KL divergence for continuous and discrete probability distributions

In the case of a continuous probability distribution, we can formulate the KL divergence as an expectation; see above. If this concept is not clear, read this.

Relation to LLMs. In the LLM domain, KL divergence is commonly used to compare two LLMs or policies. Typically, we will compare the policy that we are currently trying to train to a reference policy. For example, in the case of DPO, we begin with an SFT policy (i.e., an LLM that has already undergone both pretraining and SFT), then optimize the standard RLHF objective, where the KL divergence is computed between this SFT (reference) policy and the policy that we are training. Specifically, the form of this KL divergence would be:

KL divergence between two LLMs

This form of the KL divergence looks at the ratio of probabilities predicted by both the current and reference model for a completion y given a prompt x as input. The probability of a completion y is simply the product of next token probabilities predicted by the LLM for each token within a completion. By computing the KL divergence over these completion probabilities, we capture the similarity between the token distributions predicted by the two models.

Estimating KL divergence in practice. We usually want to estimate the KL divergence between distributions predicted by our current policy and a fixed reference policy (e.g., the SFT model5) during RL training. Intuitively, adding this constraint to the reward used during RL training (as shown below) ensures that the policy being trained does not become too different from the reference policy.

In practice, we usually approximate the KL divergence, which—as we will see—is simple to do. However, there are several different options for how we perform this approximation. Usually, approximating KL divergence uses the expectation (continuous) form of the KL divergence. As outlined above, this form of the KL divergence simply subtracts the log probabilities of the two distributions from each other and takes an expectation of this difference. Given that token log probabilities are already used in various aspects of RL training (e.g., the PPO objective), such an expression is pretty easy for us to compute!

Specifically, assume we are trying to compute the KL divergence between the current and reference policy given a prompt x. To do this, we would:

Generate a completion to the prompt with the current policy (not the reference policy).
Get the log probabilities for each token in this completion from both the current and reference policies.
Sum over token log probabilities to get the sequence log probability.
Take the difference of sequence log probabilities between the current and reference policy.

For the last step of this process, there are several options we have for computing the approximation of the KL divergence, all of which are shown in the code below. See here for an example of these implementations being used in the wild.

"""
Assume we already have necessary logprobs available.

logprob: completion logprob from the policy
ref_logprob: completion logprob from the reference policy
"""

kl_div = logprob - ref_logprob  # difference

kl_div = (logprob - ref_logprob).abs() # absolute

kl_div = 0.5 * (logprob - ref_logprob).square() # mse

kl_div = F.kl_div(ref_logprob, logprob, reduction='batchmean') # per token

This KL divergence estimate would then be subtracted from the reward for our sequence as part of the objective used for RL finetuning as described here.

Direct Preference Optimization (DPO) [1]

Having established the fundamentals of LLM training and the role of DPO in this framework, we can now focus on learning the mechanics of DPO itself. DPO is a preference-tuning method that serves as an alternative to (or can be used with) standard RLHF. In this section, we derive the DPO training process from scratch, beginning with the training objective used in RLHF. We will then discuss the practical implementation of DPO, including a step by step implementation from scratch and concrete examples of training LLMs using DPO.

TL;DR: What is DPO?

DPO training loss (from [1])

As we have learned, DPO is a preference tuning approach that avoids explicit reward models and RL, instead indirectly solving the RLHF objective via a more straightforward gradient descent approach. The DPO loss—shown above for a single preference pair—trains an LLM by:

Increasing the relative—with respect to the reference policy—probability of chosen completions.
Decreasing the relative probability of rejected completions.

This loss function is simple to optimize over an offline preference dataset using MLE. Therefore, we can train the LLM similarly to a reward model, without the need for RL. Additionally, this approach—despite being lightweight and simple—still yields a policy that solves the same objective that we are optimizing in RLHF!

“Given a dataset of human preferences over model responses, DPO can therefore optimize a policy using a simple binary cross entropy objective, producing the optimal policy to an implicit reward function fit to the preference data.” - from [1]

If we study this loss, we will notice that it is very similar to the loss function used to train reward models, which is copied below for reference. The main difference is that we replace the reward model’s output with the implicit reward derived from our policy. As we will see later, the DPO objective—in addition to adjusting the log probabilities of chosen and rejected completions—naturally places emphasis upon examples where the LLM’s implicit reward estimate is incorrect.

(source)

Deriving the DPO Loss

Now that we understand the key ideas behind DPO, we need to understand where DPO comes from and how we know that it is solving the same optimization problem as standard RLHF. To do this, we will rely upon theory, meaning that this section will contain many equations. Although the theory can be difficult to parse, understanding it is beneficial for gaining a fundamental grasp of why DPO works. To make the theory digestible, we will break the derivation down step by step with corresponding explanations for each step.

Steps followed to derive the DPO loss function

Proof sketch. Beginning with the standard RLHF training objective, we can derive the training loss used in DPO by following four key steps (shown above):

Deriving an expression for the optimal policy in RLHF.
Rearranging this expression to form an implicit reward function.
Putting the implicit reward into the Bradley-Terry preference model.
Training an LLM to match this implicit preference model—this is what we are doing in the DPO training process.

The above steps start with the objective used to train LLMs in RLHF and ends with the DPO loss function. In this derivation, we reformulate the RLHF optimization problem to arrive at the DPO training methodology. As we will see, RLHF and DPO are intricately related—they are trying to solve the same optimization problem! By studying the derivation below, we gain a deeper grasp of the relationship between these techniques.

(Step One) Optimal solution to RLHF. To derive the DPO loss, we need to begin from the initial RLHF objective that we are trying to solve, which has been copied again below for readability. However, instead of using our learned reward model RM in this notation, we use a general reward function r(x, y). The general reward function can include—but is not limited to—our reward model.

Standard RLHF objective with a general reward function

Starting with this objective, we can follow the steps below to find a closed-form expression for the optimal solution to this objective. Put simply, we are solving for the value of π that actually maximizes the RLHF objective shown below!

In the last two steps of the derivation above, we introduce a function Z(x), which we will call the partition function. The partition function is defined below. As we can see, the partition function only depends upon the reference policy and the input prompt x; there is no dependence upon the current policy or completion.

The partition function used in DPO

The name “partition function” is borrowed from fields like probability theory and statistical mechanics; see here. At the simplest level, the partition function is just a normalization term used in the theoretical derivation of DPO. We use Z(x) to ensure that the probability distribution we derive—in this case the optimal policy to the RLHF objective—sums to one and, therefore, forms a valid distribution.

Now that we understand the partition function, we will pick up the derivation from the equation in the red box shown above. Specifically, we will extract a portion of this term to define the expression below. We refer to this term as the “optimal policy”—the reason for this will become clear soon.

As mentioned before, the partition function is used as a normalization term for the optimal policy in the above expression. We know that the optimal policy defined above is a valid probability distribution because:

The value of the optimal policy is ≥ 0 for all possible completions y.
The sum of the optimal policy across all completions y is equal to 1.

The first property is obvious—all components of the optimal policy are non-negative6. Proof of the second property is provided below, where we directly see how the partition function Z(x) is used to normalize the optimal policy distribution.

Now that we have defined (and verified the validity of) the optimal policy, we can return to the original expression in which this term appeared and substitute in the expression for the optimal policy. This yields the equation shown below.

In the final term above, we see the crux of this derivation: the standard RLHF objective is minimized by finding the policy π that minimizes the KL divergence with the optimal policy. Since the KL divergence reaches its minimum value (zero) when the two probability distributions are identical7, the solution to this optimization is the optimal policy itself—hence the name. Therefore, we can express the optimal solution to the standard RLHF objective as shown in the equation below.

Optimally solving the standard RLHF objective

(Step Two) Deriving an implicit reward. From here, we can take our expression for the optimal policy shown above and rearrange it to derive an expression for the reward function—in terms of the optimal policy—as shown below.

Now, we have derived a reparameterization of our reward. However, this reward function does not depend upon any explicit reward model. Rather, we estimate the reward purely using probabilities computed from the optimal policy and the reference policy—we will call this an “implicit” reward.

“This change-of-variables approach avoids fitting an explicit, standalone reward model… the policy network represents both the language model and the (implicit) reward.” - from [1]

Now, the only remaining issue is the Z(x) term in our implicit reward. The partition function takes a sum over all possible completions y, so computing the value of Z(x) is expensive in practice. Going further, the reward function r(x, y), which we cannot directly compute without training a standalone reward model, also appears in the expression for Z(x). To solve this, we need to revisit the Bradley-Terry model and combine it with our implicit reward function.

(Step Three) Bradley-Terry preference model. Under the Bradley-Terry model of preference, we can compute the probability that a given completion is preferred to another. In most cases, the input to this preference model is the explicit reward—predicted by a reward model—for each completion. In the case of DPO, we replace this explicit reward with our implicit reward function; see below.

As shown in the final equation above, we now have an expression for the Bradley-Terry model of preference that uses our implicit reward function, where the implicit reward depends only upon the optimal policy and a reference policy. Due to the pairwise nature of the Bradley-Terry expression and the fact that the value of Z(x) depends only upon x (and not y), the Z(x) components of the implicit reward function actually cancel out when subtracting the implicit reward for the chosen completion from the implicit reward for the rejected completion.

(Step Four) Training our policy. The expression above depends upon the optimal policy, which is fixed—this optimal policy is the solution to the RLHF objective that we are trying to solve. From here, we must determine how to derive a training objective that can recover this optimal policy. To do this, DPO substitutes the optimal policy in the above expression with a learned policy, as shown below.

How can we make these two expressions equal? We need to train our learned policy! Specifically, we can formulate a ranking loss that optimizes our learned policy to empirically maximize the probability of chosen responses being preferred to rejected responses based on our implicit reward function. By doing this, we ensure that our preference model is accurate and, therefore, matches that of the optimal policy. Besides replacing explicit rewards with implicit rewards, this loss function is the same exact training objective used by standard reward models; see below.

The final loss expression derived for DPO

We might also notice that this loss function is identical to the training objective for DPO—we have now fully derived the DPO training objective starting from the training objective for RLHF. The training process for DPO learns an implicit reward model based upon our policy. By learning this implicit reward function, we obtain a policy that matches the optimal policy from RLHF.

Does DPO actually yield an optimal policy?

“The [DPO] optimization objective is equivalent to a Bradley-Terry model with an [implicit] reward parameterization and we optimize our parametric model equivalently to the reward model optimization… we show that [this objective] does not constrain the class of learned reward models and allows for the exact recovery of the optimal policy.” - from [1]

Based on the above derivation, training an LLM using the DPO loss will yield a model that has the same preference distribution—induced by the implicit reward—as the optimal policy. In other words, the implicit reward function learned by our policy via the DPO loss will correctly rank chosen and rejected completions in our preference dataset. However, the goal of DPO is not to train a model with a good implicit reward function—we want to align our LLM and derive a policy that generates high-quality completions! Luckily, authors in [1] provide a final proof showing that, in addition to learning a high-quality implicit reward function, the policy derived via DPO should match the optimal policy from RLHF.

Two reward functions r(x, y) and r’(x, y) are equivalent if and only if r(x, y) - r’(x, y) = f(x) for some function f(•).

Equivalent rewards. To begin the proof, we can first specify an equivalence relation for reward functions. This is just a definition that captures what it means for two reward functions to be equal; see above. Put simply, reward functions are considered equivalent if their difference in reward only depends upon the prompt and not the completion. Using this definition, we show below that two equivalent reward functions are guaranteed to yield the same preference distribution8.

We can also write a similar proof to show that two equivalent reward functions, when plugged into the standard RLHF objective that we explored in the prior section, are guaranteed to yield the same optimal policy; see below.

Proving an optimal policy. Given the above results, the last step in this proof is to simply show that the implicit reward function used within DPO is equivalent to the actual reward used within RLHF. If these two reward functions satisfy the equivalence relation, then we know that DPO will yield the same optimal policy as RLHF based on the findings shown above. To prove this final result, we can start by considering an arbitrary reward function r(x, y) used by RLHF. Our goal is to show that the implicit reward from DPO is equivalent to r(x, y).

Given an arbitrary reward, we can define the modified9 reward expression shown above. This expression just subtracts an extra term (i.e., the log of the partition function) from r(x, y). Notice also that the term we subtract from r(x, y) only depends on x. For this reason, the modified reward expression is equivalent to r(x, y) according to the equivalence relation that we defined earlier.

“The second lemma states that all reward functions from the same class yield the same optimal policy, hence for our final objective, we are only interested in recovering an arbitrary reward function from the optimal class.” - from [1]

To prove the desired result, we have to draw upon our prior expression that rearranges the optimal RLHF solution to produce an implicit reward. If we plug this implicit reward into the modified reward expression above, we get a reward—which is known to be equivalent to r(x, y)!—that matches the implicit reward in DPO; see below. As a result, we now know that the implicit reward used by DPO satisfies the equivalence relation with r(x,y), which completes the proof.

Key takeaway. Before we conclude this section, we should quickly contextualize the result that we just proved. In the prior section, we derived an expression for the preference distribution induced by the implicit reward of the optimal policy (or solution) to the standard RLHF objective. After this expression is derived, we can easily train a model to have an implicit reward function that matches this preference distribution by adopting the same training strategy as a normal reward model. Therefore, the key training procedure behind DPO centers around training an (implicit) reward model, hence the name of the paper; see below.

(from [1])

A common misconception of DPO is that it removes the reward model, which is not true. In fact, DPO is completely based upon reward modeling. The reward model is just implicit, which means we can avoid training an explicit reward model.

“What is often misunderstood is that DPO is learning a reward model at its core, hence the subtitle of the paper Your Language Model is Secretly a Reward Model. It is easy to confuse this with the DPO objective training a policy directly” - RLHF book

Given that the training procedure for DPO is based upon reward modeling, it’s not immediately obvious that training an LLM in this way will actually yield an optimal policy. Could our resulting model have an accurate implicit reward function but still not generate high-quality completions? In this section, we prove this should not be the case. If we train a model to match the implicit preference distribution of the optimal policy, then the resulting policy is also guaranteed to be optimal! Put simply, DPO indirectly provides us with a policy that is comparable in quality to one derived via training with RLHF, making it a valid preference tuning alternative that is significantly less complex than techniques like PPO-based RLHF.

Why does DPO work?

Gradient of DPO loss function

To gain a deeper understanding of DPO and why it works well, we can look at the structure of the gradient for DPO’s loss function; see above10. There are three key terms in this expression, colored in red (with part of the term in orange), blue and green for clarity. The purpose for each of these terms is as follows:

The first (red) term is a weight—falling in the range [0, 1] due to the sigmoid function—that increases as the implicit reward of the rejected completion increases relative to that of the chosen completion. In other words, this term assigns a higher weight to implicit reward estimates that are wrong.
The second (blue) term is the positive gradient of the likelihood for the chosen completion with respect to the LLM’s parameters, which serves the purpose of increasing the likelihood for the chosen completion.
The third (green) term is the negative gradient of the likelihood for the rejected completion with respect to the LLM’s parameters, which serves the purpose of decreasing the likelihood for the rejected completion.

These terms work together to simultaneously i) increase the likelihood of chosen completions and ii) decrease the likelihood of rejected completions, where extra emphasis (i.e., a larger update to our LLM’s parameters) is placed upon cases where the implicit reward estimate assigned by our LLM is incorrect.

“Examples are weighed by how much higher the implicit reward model rates the dispreferred completions scaled by beta… how incorrectly the implicit reward model orders the completions, accounting for the strength of the KL constraint.” - from [1]

Weighting coefficient. Authors in [1] observe that all three sub-components of DPO’s loss gradient are necessary for the algorithm to work well. Notably, if we remove the first weighting term from this gradient—creating a gradient that uniformly increases the likelihood of all chosen completions and decreases the likelihood of all rejected completions—the resulting policy is low-quality and even tends to completely degenerate when generating text; see below. Such a training algorithm is called unlikelihood training and has been explored in the past [5]. The simple weighting term added to the loss gradient by DPO completely transforms this approach, making it capable of performing high-quality LLM alignment.

LLMs trained with unlikelihood training tend to degenerate (from [1])

Implementing DPO from Scratch

Although the derivation of DPO is complex, the technique is actually quite simple to use practically. In fact, DPO played a huge role in democratizing research on LLM post-training for those outside of top labs [3]. Algorithms like PPO-based RLHF are harder to tune and require significant compute resources. In contrast, DPO uses a standard classification (or ranking) loss with no RL and only keeps two copies of the model—instead of four—throughout the training process.

Standard DPO training pipeline

DPO training pipeline. The standard training process with DPO is depicted above. We begin the process with a diverse set of prompts that capture the use case(s) for which we are training our model. From here, we use our reference policy to generate pairs of completions for each prompt and have human raters provide preference annotations for each pair. Once this preference dataset is available, we perform maximum likelihood estimation by training our model to minimize the DPO loss that we derived earlier over the preference dataset.

Computing the loss for DPO in PyTorch (from [1])

Loss implementation. We can pretty easily implement the loss function for DPO in PyTorch—it is just a ranking loss applied over implicit rewards derived from the current and reference policies. The example implementation of the loss from [1] is copied below for reference, where we see that the loss is computed by:

Getting the log probabilities assigned to each completion—both chosen and rejected—by the current policy and the reference policy.
Computing the probability ratio between chosen and rejected completions for both the current policy and the reference policy.
Using the above probability ratios to construct the final DPO loss.

Handling offline preference data. DPO is fundamentally an offline preference learning algorithm—we are optimizing our model over a static preference dataset. In the pipeline outlined above, we use our reference model to generate completions in our preference dataset. In most practical applications, however, this may not be the case. As a practitioner, we may simply download a preference dataset like UltraFeedback [4] online and train our model over this static dataset using DPO. In such cases, the actual reference model is unknown and may be different from the reference model we used in DPO training, creating a distribution shift.

“Since the preference datasets are sampled using the SFT model, we initialize the reference policy to the SFT model whenever available. However, when the SFT model is not available, we initialize the reference policy by maximizing likelihood of preferred completions. This procedure helps mitigate the distribution shift between the true reference distribution and the reference policy used by DPO.” - from [1]

To minimize this distribution shift and ensure that the actual reference model aligns well with the completions present in our preference dataset, authors in [1] recommend the procedure depicted below. In this procedure, we first perform supervised finetuning of our reference model on the chosen completions in the preference dataset, then further train this model with DPO afterwards. This preliminary SFT training stage ensures the reference policy in DPO is not too different from the true reference policy used to create the preference dataset.

Mitigating distribution shift from offline preference data in DPO

The last consideration for implementing DPO is correctly setting the β hyperparameter, which controls the amount that the trained policy can differ from the reference policy. Remember, β is the weight by which we multiply the KL constraint in the RLHF objective, which controls the strength of preference alignment in DPO—lower β values mean that the model is updated more aggressively to adapt to observed preference in the data. Usually, β is set to a value in the range [0, 1], where lower values are more common. For example, β = 0.1 is a popular choice, though authors in [1] explore both β = 0.1 and β = 0.5.

Full DPO example. One of the easiest ways to finetune your own LLM with DPO is by using the DPOTrainer in the HuggingFace TRL package. To perform a DPO training run, you just need to i) load a preference dataset like UltraFeedback; ii) choose a model / tokenizer (e.g., a smaller model like Qwen3-0.6B is great if we don’t have big GPUs); and iii) execute the DPO trainer as shown in the code below.

from trl import DPOConfig, DPOTrainer

# load model and data
model = 
tokenizer = 
train_dataset = 

# configure DPO training process training_args = DPOConfig(output_dir="./dpo_logs/")
trainer = DPOTrainer(
    model=model,
    args=training_args,
    processing_class=tokenizer,
    train_dataset=train_dataset,
)

# execute DPO training
# run the below command to execute this script
# > accelerate launch

Deep (Learning) Focus

RL Scaling Laws for LLMs

Scaling Law Fundamentals

What is a power law?

Neural Scaling Laws [13] and Chinchilla [14]

Scaling Laws beyond Pretraining

Background on Reinforcement Learning

Group Relative Policy Optimization (GRPO) [4]

Recent GRPO Variants

Regularization for RL

Scaling the RL Training Process

The Art of Scaling Reinforcement Learning Compute for LLMs [1]

Scaling Behaviors of LLM RL Post-Training [2]

Optimally Scaling Sampling Compute for LLM RL [3]

Comparing RL and Pretraining Scaling Laws

New to the newsletter?

Bibliography

The Anatomy of an LLM Benchmark

Dissecting Popular LLM Benchmarks

Massive Multitask Language Understanding (MMLU) [1]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark [4]

Beyond the Imitation Game Benchmark (BIG-Bench) [5]

IFEval [8] and IFBench [9]

AlpacaEval [13]

Math Evaluation

Iteratively Improving a Benchmark

Advanced Benchmarking for LLMs

tinyBenchmarks: Evaluating LLMs with Fewer Examples [11]

Fluid Language Model Benchmarking [12]

DatBench: Discriminative, Faithful, and Efficient VLM Evaluations [10]

Keys to Creating a Useful Benchmark

New to the newsletter?

Bibliography

Applying Statistics to LLM Evaluations

Basic Statistics for LLM Evaluations

Random Variables and Estimators

Standard Error and Sample Means

Law of Large Numbers and the Central Limit Theorem (CLT)

Confidence Intervals

A Statistical Approach to LLM Evaluations [1]

Standard Errors and the CLT

Clustered Errors

Reducing Variance

Model Comparisons

Practical Implementation

More Topics to Explore in Statistics

Power Analysis for LLM Evals [1]

Don’t Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints [2]

Quantifying Variance in Evaluation Benchmarks [3]

Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation [4]

Key Takeaways

New to the newsletter?

Bibliography

Rubric-Based Rewards for RL

From LLM-as-a-Judge to Rubrics

LLM-as-a-Judge

LLM Evaluation with Rubrics

RL with Verifiable (and Non-Verifiable) Rewards

Using Rubrics for RL

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains [1]

Reinforcement Learning with Rubric Anchors [2]

OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment [3]

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research [4]

Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training [5]

Further Reading

Conclusion

New to the newsletter?

Bibliography

Continual Learning with RL for LLMs

Basics of Continual Learning

Catastrophic Forgetting

Experimental Frameworks for Continual Learning

Common Techniques for Continual Learning

Continual Learning for LLMs

More on SFT and RL

Reinforcement Finetuning Naturally Mitigates Forgetting in Continual Post-Training [1]

Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting [2]

RL’s Razor: Why Online RL Forgets Less [3]

Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting [6]

Does RL Generalize Well?