KL-Regularized Reinforcement Learning is Designed to Mode Collapse
Abstract
Classical intuitions cast minimizing reverse KL as "mode seeking" and forward KL as "mass covering". In KL-regularized reinforcement learning, however, the regularizer determines both the target distribution's shape and the divergence being implicitly minimized, making its role more nuanced than simply inducing generic mode-seeking or mass-covering behaviour. Specifically, the target distribution is defined jointly by the reward function, the reference model, the type of regularizer, and the regularization strength. We show that under common settings—such as low regularization strength and equal verifiable rewards—both forward and reverse KL regularization tend to specify target distributions whose mass concentrates on a single high-reward region. Thus, the objective itself by construction induces diversity collapse, regardless of the policy optimization algorithm used.
Building on this perspective, we introduce a simple and scalable modification that rescales rewards to induce target distributions assigning substantial probability across all high-reward regions. This yields a principled objective that maintains high solution quality while achieving broad reward-mode coverage. Empirically, this approach improves post-training diversity and performance for Large Language Models and Chemical Language Models, and is effective with either forward or reverse KL regularization, while using either naively fails.
RL as Distribution Matching
Many works observe that RL leads to diversity collapse. We identify:
- A simple reason why this happens
- A principled research agenda to fundamentally fix it—via characterizing the target distribution
We know the optimal solution of KL-regularized RL is a Gibbs distribution:
$\pi^*(y) = G_\beta (y) \propto \pi_{\text{ref}}(y) \exp\left(\frac{R(y)}{\beta}\right)$
The regularized policy gradient is exactly a reverse-KL gradient toward this target:
$\nabla \text{KL}(\pi \| G_\beta)$
Takeaway: Regularized RL is just distribution matching to the target distribution $G_\beta$.
What is the Shape of the Target Distribution?
Key insight: We can exactly calculate the relative likelihood of any two samples in the optimal distribution (normalization constant cancels out).
This leads to a number of interesting observations about why mode collapse happens.
Why Mode Collapse Happens
Problem 1: Exponential Probability Differences
Linear differences in rewards lead to exponential differences in probabilities. With low KL regularization (e.g., 0.001), we effectively have a single solution—no diversity.
Note: Entropy regularization has this problem as well.
Problem 2: Reference Policy Bias Persists
If two samples have the same reward (common in RLVR), RL does not change their relative probabilities from $\pi_\text{ref}$ (reference policy).
Lowering the regularization coefficient has no effect on increasing the probability of low-support solutions, relative to high-support ones.
Key insight: Diversity collapse is not a "quirk" of RL, it is a consequence of how we defined the objective. Collapse is the natural result of correctly matching the optimal target distribution.
To prevent collapse, we can simply define a better target distribution to match to!
MARA: Mode Anchored Reward Augmentation
The probability ratios give us a way to define better target distributions. With a bit of algebra, one can see the following condition specifies when two samples will have the same probabilities under the target distribution for KL-regularized RL.
We ensure this condition is met for all "good" samples. In practice, this is achieved by simple augmentations to the reward.
Intuitively: We increase the reward for good samples with low $\pi_\text{ref}$, by referencing another good sample with high $\pi_\text{ref}$.
We call this "Mode Anchored Reward Augmentation" (MARA).
The effect of MARA is a target policy which puts uniform mass everywhere there are good solutions, and sticks close to the reference when there are none.
Optimizing this objective naturally yields a multimodal policy distribution.
Empirical Results
We test MARA across LLMs (verifiable + non-verifiable tasks) and a chemical generative model (molecular design).
MARA is a drop-in change that improves diversity with no loss in quality, exactly as the theory predicts.
Analyzing Forward KL Regularization
We also analyzed the forward-KL regularized case. Interestingly:
- The gradient is not a forward-KL gradient
- There's a completely different target distribution (may not be mass covering!)
Takeaway: We cannot naively use intuitions about forward-KL for forward-KL regularized RL.
Summary
We identified diversity collapse in RL not as a bug, but as the natural consequence of correctly optimizing the KL-regularized objective, whose optimal target distribution is highly concentrated.
To address this, we introduced MARA: a simple, scalable modification that instead defines a multimodal target distribution, to directly optimize for diversity in a principled way.
It's all just distribution matching.
Viewing reinforcement learning through the lens of its target distribution gives us a more principled way to reason about phenomena like exploration and entropy collapse. This opens up many opportunities for future work!
BibTeX
@inproceedings{
gxchen2026mara,
title={KL-Regularized Reinforcement Learning is Designed to Mode Collapse},
author={Anthony GX-Chen and Jatin Prakash and Jeff Guo and Rob Fergus and Rajesh Ranganath},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://arxiv.org/abs/2510.20817}
}