KL-Regularized Reinforcement Learning is Designed to Mode Collapse

1New York University    2EPFL
ICLR 2026
Vanilla KL regularized RL targets mode-collapsed distributions. MARA achieves uniform coverage over all reward modes.

TL;DR: KL/entropy regularized RL often provably causes mode collapse. But there is a principled fix.

Abstract

Classical intuitions cast minimizing reverse KL as "mode seeking" and forward KL as "mass covering". In KL-regularized reinforcement learning, however, the regularizer determines both the target distribution's shape and the divergence being implicitly minimized, making its role more nuanced than simply inducing generic mode-seeking or mass-covering behaviour. Specifically, the target distribution is defined jointly by the reward function, the reference model, the type of regularizer, and the regularization strength. We show that under common settings—such as low regularization strength and equal verifiable rewards—both forward and reverse KL regularization tend to specify target distributions whose mass concentrates on a single high-reward region. Thus, the objective itself by construction induces diversity collapse, regardless of the policy optimization algorithm used.

Building on this perspective, we introduce a simple and scalable modification that rescales rewards to induce target distributions assigning substantial probability across all high-reward regions. This yields a principled objective that maintains high solution quality while achieving broad reward-mode coverage. Empirically, this approach improves post-training diversity and performance for Large Language Models and Chemical Language Models, and is effective with either forward or reverse KL regularization, while using either naively fails.

RL as Distribution Matching

Many works observe that RL leads to diversity collapse. We identify:

  1. A simple reason why this happens
  2. A principled research agenda to fundamentally fix it—via characterizing the target distribution

We know the optimal solution of KL-regularized RL is a Gibbs distribution:

$\pi^*(y) = G_\beta (y) \propto \pi_{\text{ref}}(y) \exp\left(\frac{R(y)}{\beta}\right)$

The regularized policy gradient is exactly a reverse-KL gradient toward this target:

$\nabla \text{KL}(\pi \| G_\beta)$

Takeaway: Regularized RL is just distribution matching to the target distribution $G_\beta$.

Policy optimization under reverse-KL regularization minimizes a reverse-KL divergence to the target distribution $G_\beta$.

What is the Shape of the Target Distribution?

Key insight: We can exactly calculate the relative likelihood of any two samples in the optimal distribution (normalization constant cancels out).

Probability ratio proposition

This leads to a number of interesting observations about why mode collapse happens.

Why Mode Collapse Happens

Problem 1: Exponential Probability Differences

Linear differences in rewards lead to exponential differences in probabilities. With low KL regularization (e.g., 0.001), we effectively have a single solution—no diversity.

Note: Entropy regularization has this problem as well.

Same reference, different rewards lead to exponential probability differences

Problem 2: Reference Policy Bias Persists

If two samples have the same reward (common in RLVR), RL does not change their relative probabilities from $\pi_\text{ref}$ (reference policy).

Lowering the regularization coefficient has no effect on increasing the probability of low-support solutions, relative to high-support ones.

Different reference probabilities, same reward - bias persists

Key insight: Diversity collapse is not a "quirk" of RL, it is a consequence of how we defined the objective. Collapse is the natural result of correctly matching the optimal target distribution.

RL as distribution matching to target distribution

To prevent collapse, we can simply define a better target distribution to match to!

MARA: Mode Anchored Reward Augmentation

The probability ratios give us a way to define better target distributions. With a bit of algebra, one can see the following condition specifies when two samples will have the same probabilities under the target distribution for KL-regularized RL.

The difference in rewards and $\pi_\text{ref}$ between two samples fully determines their relative probabilities under the target distribution $G_\beta$.

We ensure this condition is met for all "good" samples. In practice, this is achieved by simple augmentations to the reward.

Intuitively: We increase the reward for good samples with low $\pi_\text{ref}$, by referencing another good sample with high $\pi_\text{ref}$.

We call this "Mode Anchored Reward Augmentation" (MARA).

MARA algorithm

The effect of MARA is a target policy which puts uniform mass everywhere there are good solutions, and sticks close to the reference when there are none.

Optimizing this objective naturally yields a multimodal policy distribution.

MARA target distribution illustration

Empirical Results

We test MARA across LLMs (verifiable + non-verifiable tasks) and a chemical generative model (molecular design).

MARA is a drop-in change that improves diversity with no loss in quality, exactly as the theory predicts.

Empirical results showing MARA improves diversity without sacrificing quality

Analyzing Forward KL Regularization

We also analyzed the forward-KL regularized case. Interestingly:

  • The gradient is not a forward-KL gradient
  • There's a completely different target distribution (may not be mass covering!)

Takeaway: We cannot naively use intuitions about forward-KL for forward-KL regularized RL.

Policy optimization under forward-KL regularization does not naturally yield a multimodal policy distribution.

Summary

We identified diversity collapse in RL not as a bug, but as the natural consequence of correctly optimizing the KL-regularized objective, whose optimal target distribution is highly concentrated.

To address this, we introduced MARA: a simple, scalable modification that instead defines a multimodal target distribution, to directly optimize for diversity in a principled way.

It's all just distribution matching.

Viewing reinforcement learning through the lens of its target distribution gives us a more principled way to reason about phenomena like exploration and entropy collapse. This opens up many opportunities for future work!

BibTeX

@inproceedings{
  gxchen2026mara,
  title={KL-Regularized Reinforcement Learning is Designed to Mode Collapse},
  author={Anthony GX-Chen and Jatin Prakash and Jeff Guo and Rob Fergus and Rajesh Ranganath},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026},
  url={https://arxiv.org/abs/2510.20817}
}