PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR

Fanding Huang^1,*, Guanbo Huang^1,*, Xiao Fan¹, Yi He¹, Xiao Liang², Xiao Chen¹,
Qinting Jiang¹, Faisal Nadeem Khan¹, Jingyan Jiang^3,†, Zhi Wang^1,†

¹Tsinghua Shenzhen International Graduate School, Tsinghua University ²University of California, Los Angeles ³Shenzhen Technology University

^*Indicates Equal Contribution ^†Indicates Corresponding Authors

Paper Code Hugging Face arXiv

Figure 1: Comparative analysis with the responses of DeepSeek-R1-Distill-Qwen-7B in simpleRL test dataset. (a) Traditional metrics for exploitation and exploration are constrained by negative coupling, leading to meandering progress for both capabilities. (b) Our metrics are mutually independent. (c) Training regularization with our metrics demonstrates stronger performance in both exploitation (small K) and exploration (large K).

Abstract

A prevailing view in Reinforcement Learning for Verifiable Rewards (RLVR) interprets recent progress through the lens of an exploration-exploitation trade-off, a perspective largely shaped by token-level metrics. We re-examine this perspective, proposing that this perceived trade-off may not be a fundamental constraint but rather an artifact of the measurement level. To investigate this, we shift the analysis to the semantically rich hidden-state space, adopting Effective Rank (ER) to quantify exploration and proposing its novel first- and second-order derivatives, named Effective Rank Velocity (ERV) and Effective Rank Acceleration (ERA), to capture exploitation dynamics. Our analysis reveals that at the hidden-state level, exploration and exploitation could be decoupled. This finding reveals an opportunity to enhance both capacities simultaneously. This insight motivates our method, Velocity-Exploiting Rank-Learning (VERL), the first to operationalize the principle of synergistic exploration-exploitation enhancement by directly shaping the RL advantage function. The key innovation is leveraging the theoretically stable ERA as a predictive meta-controller to create a synergistic, dual-channel incentive structure. Instead of forcing a trade-off, VERL prospectively amplifies rewards for exploration to preempt overconfidence and reinforces exploitative gains to consolidate reasoning. Experiments across diverse LLMs and reasoning benchmarks show consistent gains, including up to 21.4% absolute accuracy improvement on the challenging Gaokao 2024 dataset.

Analysis

1. Analysis of Response-Level Metrics

Semantic space of hidden states move beyond the exploration-exploitation trade-off towards stable enhancements. Instead of forcing a trade-off, RL training consistently enhances a model's exploitation capabilities (the rate of information gain), regardless of its baseline exploratory tendencies. This suggests exploration and exploitation can be improved simultaneously.

Effective Rank Acceleration distinguishes correct reasoning. While high exploration (ER) or high velocity of information gain (ERV) can sometimes lead to errors, the acceleration of this gain (ERA) is a robust indicator that consistently distinguishes correct and robust reasoning paths from flawed ones.

Figure 2: Response-level metrics during GRPO post-training, smoothed with a 10-step rolling window. Metrics are shown for the Overall batch, as well as for subsets of Correct and Incorrect samples. The rightmost column displays the average Critic Score (reward) and Response Length per batch.

2. Analysis of Response-Level Metrics

Policy optimization correlates with expanding dataset-level diversity. As a model's performance improves, the semantic diversity of its reasoning strategies across the entire dataset expands. This is shown by a strong positive correlation between validation scores and our dataset-level ER, ERV, and ERA metrics.

Effective Rank reveals refinement beyond the limits of conventional rank. Even late in training when a model seems to have a fixed number of reasoning strategies (i.e., conventional rank plateaus), a rising Effective Rank (ER) shows it is still subtly refining and optimizing its existing pathways for higher quality and efficiency.

Figure 3: Visualization of dataset-level metrics during GRPO post-training. The figure compares Traditional metrics with our proposed metrics. Also shown are the Validation Score and sample Correctness, both averaged over the validation dataset.

Methodology

We propose Velocity-Exploiting Rank-Learning (VERL), a method that moves beyond the trade-off between exploration and exploitation by directly shaping the RL advantage function using Effective Rank (ER) and Velocity (V). Instead of acting as a switch between the two capacities in a lower dimension, VERL functions as an tuner that synergistically enhances both capacities in a higher-dimensional space.

Its key innovation is leveraging Acceleration (ERA) as a meta-control variable, a choice justified by our theoretical proof of its remarkable stability. Specifically, VERL uses ERA to create a synergistic, dual-channel incentive structure. Instead of switching between modes, it prospectively shapes the reward to simultaneously encourage exploration (via ER) to preempt overconfidence, while also reinforcing exploitative gains (via V) to consolidate the reasoning path. This unique stability makes ERA a robust signal to guide training, allowing VERL to simultaneously encourage exploration from productive-potential states while preventing overfitting to local optima.

Figure 4: Overview of VERL. Exploration is quantified by computing the Effective Rank (ER) of the rolling-done hidden states via SVD, while exploitation is captured through EMA-smoothed first-order difference (Effective Rank Velocity (ERV)) on per-step rolling hidden state and extended to second-order difference (Effective Rank Acceleration (ERA)). Finally, exploration and exploitation are adaptively integrated to derive the auxiliary advantage.

Experiments

Key Findings

VERL Generalizes Across Diverse Benchmarks: Our method demonstrates strong and consistent performance improvements on multiple mathematical reasoning benchmarks that vary in difficulty.
Robustness Across RL Algorithms and Base Models: VERL shows broad applicability, successfully enhancing different base models (e.g., Llama, Qwen, Mistral) and integrating seamlessly with various RL algorithms like GRPO and PPO.
Simultaneous Gains in Exploration and Exploitation: The method successfully enhances both model exploitation (reflected by significant gains in Pass@1 scores) and exploration (evidenced by improved Pass@k performance), moving beyond a simple trade-off.

Figure 5: Performance comparison of models on mathematical reasoning benchmarks (Pass@1). “+ GRPO” and “+ PPO” denote RL fine-tuning by GRPO and PPO framework respectively. “w/ VERL.” indicates incorporating our VERL with original RL type. Δ represents the performance contrast between original RL method and its VERL variant.

Figure 6: Performance comparison of instruction-tuned models under diverse decoding settings (Pass@k).

BibTeX

@misc{huang2025explorationexploitationtradeoffhiddenstate,
      title={Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR}, 
      author={Fanding Huang and Guanbo Huang and Xiao Fan and Yi He and Xiao Liang and Xiao Chen and Qinting Jiang and Faisal Nadeem Khan and Jingyan Jiang and Zhi Wang},
      year={2025},
      eprint={2509.23808},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2509.23808}, 
}

More Works from Our Lab

COSMIC: Clique-Oriented Semantic Multi-space Integration for Robust CLIP Test-Time Adaptation

DTBS: Dual-Teacher Bi-directional Self-training for Domain Adaptation in Nighttime Semantic Segmentation