Can Large Reasoning Models Self-Train?

Reinforcement learning from verifiable rewards significantly enhances large language models' (LLMs) reasoning abilities, notably in math and coding (OpenAI et al., DeepSeek-AI et al.). However, it relies on human-created ground-truth verifiers, making it costly and restrictive to generate reward signals for every problem. In this work, we ask the following questions:

  • (1) Can reasoning models self-train using only their own feedback, without access to ground-truth labels?
  • (2) Does self-training match the performance of RL training on ground-truth labels?
  • (3) Can self-training be sustained indefinitely, or is improvement ultimately limited?
  • (4) What strategies can effectively sustain model self-training?


(1) Self-Rewarded Training (SRT)

Motivated by prior works based on consistency based self-improvement, we introduce a simple yet effective self-training reinforcement learning methodology, Self-Rewarded Training, that uses consistency across multiple model-generated solutions to estimate correctness during RL training, providing self-supervision signals without labeled data.
Teaser image.


(2) SRT Matches RL Performance at Early Training Stages

We empirically demonstrate that at early training stages, SRT achieves performance rivaling standard reinforcement-learning methods trained explicitly on gold-standard answers. Test datasets: AMC, AIME24, AIME25.
self_training_performance
However, we find that the performance eventually collapses, see e.g. the trainining on DAPO in the rightmost picture---more on this below.


(3) Self-Training is Bound to Collapse

We analyze the training dynamics of SRT when training on the challenging DAPO dataset.
srt_training_dynamics
These findings indicate that the model learns to maximize self-assigned rewards by producing consistent (second graph above) but incorrect answers (leftmost graph above). Manual inspection confirms this: after collapse, model outputs degenerate into random token sequences with a fixed, prompt-independent answer (e.g., 'the solution is 1'). There is a simple yet precise theoretical justification for this behavior:

The reinforcement learning optimization problem defined by the SRT objective
explicitly encourages consistency across outputs, independently of correctness.

Thus, the optimal policy under this objective degenerates to producing the same answer regardless of input, maximizing reward artificially. Continued self-training on this proxy naturally drives the model toward this trivial solution, especially when it's simpler than solving the actual task.


(4) Mitigation Strategies can be Effective

We propose strategies to mitigate reward hacking, laying the groundwork for effective future approaches to sustaining continual model improvement.


(i) Early Stopping: A small validation set can reliably detect peak model performance and prevent collapse during self-training. The peak performance occurs at nearly the same point for all heldout sets, so using any would be effective for early stopping.

srt_early_stopping


(ii) Self-Training with Offline-Generated Labels: An effective approach involves generating pseudo-labels from a stable, previously fixed checkpoint, rather than leveraging labels from the evolving policy. Doing so stabilizes training while achieving performance comparable to SRT.

srt_offline_generated_data


(iii) Self-Training with Curriculum Learning: We hypothesize that model collapse occurs more rapidly when training on more challenging datasets, a conjecture that aligns with our empirical findings. The intuition is that, on a more challenging dataset, the model finds it easier to abandon its pre-trained knowledge in favor of optimizing self-consistency rather than genuinely learning to solve the underlying task. We leverage this hypothesis to implement a curriculum learning strategy by identifying the ‘easiest’ subset of the DAPO dataset according to (a) the pass rate and (b) the frequency of the majority vote (see paper for more details).

srt_curriculum
Performance on these curriculum subsets reaches levels comparable to standard RL training with ground-truth labels on the entire DAPO dataset. These promising results suggest that curriculum strategies may further extend the benefits of SRT, opening exciting avenues for future investigation.


Related Literature:

Several works explore self-training LLMs, some of which are concurrent. A non-exaustive list includes:


BibTeX

 
      @misc{shafayat2025largereasoningmodelsselftrain,
        title={Can Large Reasoning Models Self-Train?}, 
        author={Sheikh Shafayat and Fahim Tajwar and Ruslan Salakhutdinov and Jeff Schneider and Andrea Zanette},
        year={2025},
        eprint={2505.21444},
        archivePrefix={arXiv},
        primaryClass={cs.LG},
        url={https://arxiv.org/abs/2505.21444}, 
      }