Can Large Reasoning Models Self-Train?

1 KAIST 2 Carnegie Mellon University
* Equal Contribution

In this work, we ask the following question: can self-training — the process where a language model learns from its own judgments — can be sustained within a RL framework?

On a comprehensive set of experiments on both synthetic and real reasoning tasks, we find that this basic approach has the capability to improve not only the model's reasoning performance, but also its potential of generating better quality feedback for the next RL iteration, driving further model improvement. Yet our analysis also reveals a critical limitation of such a self-training paradigm — prolonged RL with self-reward leads to reward hacking where models learn to maximize training (pseudo-)reward, resulting in sudden performance collapse.

Together, these results highlight feedback design as the central challenge and call for future research on mechanisms to enable prolonged self-improvement.



(1) Self-Rewarded Training (SRT)

In this work, we take the simplest possible self reward mechanism, majority voting, and experiment with a simple yet effective self-training method, titled Self-Rewarded Training (SRT). SRT uses consistency across multiple model-generated solutions to estimate correctness during RL training, providing self-supervision signals without labeled data.
Teaser image.


(2) Self Training Can Improve Both Performance and Feedback Quality

Self training performance on reasoning gym datasets.
We investigate self-training under controlled settings on synthetic reasoning tasks from Reasoning Gym. Remarkably, SRT improves not only the mean accuracy, but the majority voting accuracy as well, which is the source of our training supervision. Improvement in the quality of training signal drives further improvement in performance, as SRT outperforms its variant employing the majority votes from a fixed teacher as proxy labels.

To ensure the model understood the task and output format, we first trained it with RL using ground-truth data from the previous difficulty level before proceeding to label-free SRT training.


(3) SRT Works on Real-World Reasoning Tasks

SRT achieves performance rivaling standard reinforcement-learning methods trained explicitly on gold-standard answers at the early stages of the training. More models and training dataset pairs are in the Appendix of the paper.
self_training_performance
SRT shows improvement in the quality of the majority votes themselves, which distinguishes our algorithm from that of learning from a fixed teacher's majority votes.
srt_real_world_teacher_impromevent
Note that for Llama-3.1-8B-Instruct, we use the official model card evaluation temperature of 0, hence majority@32 is the same as average@32 accuracy.


(4) Can Self Improvement Be Sustained Indefinitely?

SRT has the potential to achieve multi-level improvement in Reasoning Gym Tasks. In the following figure, the Qwen3-4B-Base model can climb on progressively more difficult tasks without ground truth labels via a simple curriculum strategy — where we train an earlier level's final checkpoint with SRT on the next difficulty level.
reasoning_gym_multilevel_improvement
But training longer using SRT leads to reward hacking and model collapse in real world datasets:
srt_training_dynamics
While investivating this sudden model collapse, we find that the model learns to maximize self-assigned rewards by producing consistent (second graph below) but incorrect answers (leftmost graph below). Thus, the optimal policy under this objective degenerates to producing the same answer regardless of input, maximizing reward artificially. Continued self-training on this proxy naturally drives the model toward this trivial solution, especially when it's simpler than solving the actual task.
srt_training_dynamics
Manual inspection confirms this: after collapse, model outputs degenerate into random token sequences with a fixed, prompt-independent answer (e.g., '\boxed{1}'). See examples below:
reward_hacked_responses

Conclusion

Takeaway 1

On both synthetic and real reasoning tasks, SRT improves average and majority voting accuracies, showing ability gains beyond the base model. Specially, improvement in majority voting accuracy also signifies improvement of the quality of self-supervision during training demonstrating a promising path forward to self-improvement.

Takeaway 2

The question of whether self-training can be extended indefinitely has mixed results: while under controllable difficulty (Reasoning Gym), SRT has the potential to keep improving beyond the base model on progressively more difficult tasks, while training on real-world math problems demonstrate the phenomenon of reward hacking — sustained self-improvement requires developing additional regularization measures to be effective.


Related Literature:

Several works explore self-training LLMs, some of which are concurrent. A non-exaustive list includes:


BibTeX

 
      @misc{shafayat2025largereasoningmodelsselftrain,
        title={Can Large Reasoning Models Self-Train?}, 
        author={Sheikh Shafayat and Fahim Tajwar and Ruslan Salakhutdinov and Jeff Schneider and Andrea Zanette},
        year={2025},
        eprint={2505.21444},
        archivePrefix={arXiv},
        primaryClass={cs.LG},
        url={https://arxiv.org/abs/2505.21444}, 
      }