In this work, we ask the following question: can self-training — the process where a language model learns from its own judgments — can be sustained within a RL framework?
On a comprehensive set of experiments on both synthetic and real reasoning tasks, we find that this basic approach has the capability to improve not only the model's reasoning performance, but also its potential of generating better quality feedback for the next RL iteration, driving further model improvement. Yet our analysis also reveals a critical limitation of such a self-training paradigm — prolonged RL with self-reward leads to reward hacking where models learn to maximize training (pseudo-)reward, resulting in sudden performance collapse.
Together, these results highlight feedback design as the central challenge and call for future research on mechanisms to enable prolonged self-improvement.
(1) Self-Rewarded Training (SRT)
In this work, we take the simplest possible self reward mechanism, majority voting, and experiment with a simple yet effective self-training method, titled Self-Rewarded Training (SRT). SRT uses consistency across multiple model-generated solutions to estimate correctness during RL training, providing self-supervision signals without labeled data.
(2) Self Training Can Improve Both Performance and Feedback Quality
We investigate self-training under controlled settings on synthetic reasoning tasks from Reasoning Gym. Remarkably, SRT improves not only the mean accuracy, but the majority voting accuracy as well, which is the source of our training supervision. Improvement in the quality of training signal drives further improvement in performance, as SRT outperforms its variant employing the majority votes from a fixed teacher as proxy labels.
To ensure the model understood the task and output format, we first trained it with RL using ground-truth data from the previous difficulty level before proceeding to label-free SRT training.
(3) SRT Works on Real-World Reasoning Tasks
SRT achieves performance rivaling standard reinforcement-learning methods trained explicitly on gold-standard answers at the early stages of the training. More models and training dataset pairs are in the Appendix of the paper.
SRT shows improvement in the quality of the majority votes themselves, which distinguishes our algorithm from that of learning from a fixed teacher's majority votes.
Note that for Llama-3.1-8B-Instruct, we use the official model card evaluation temperature of 0, hence majority@32 is the same as average@32 accuracy.
(4) Can Self Improvement Be Sustained Indefinitely?
SRT has the potential to achieve multi-level improvement in Reasoning Gym Tasks. In the following figure, the Qwen3-4B-Base model can climb on progressively more difficult tasks without ground truth labels via a simple curriculum strategy — where we train an earlier level's final checkpoint with SRT on the next difficulty level.
But training longer using SRT leads to reward hacking and model collapse in real world datasets:
While investivating this sudden model collapse, we find that the model learns to maximize self-assigned rewards by producing consistent (second graph below) but incorrect answers (leftmost graph below). Thus, the optimal policy under this objective degenerates to producing the same answer regardless of input, maximizing reward artificially.
Continued self-training on this proxy naturally drives the model toward this trivial solution, especially when it's simpler than solving the actual task.
Manual inspection confirms this: after collapse, model outputs degenerate into random token sequences with a fixed, prompt-independent answer (e.g., '\boxed{1}'). See examples below:
Conclusion
Takeaway 1
On both synthetic and real reasoning tasks, SRT improves average and majority voting accuracies, showing ability gains beyond the base model. Specially, improvement in majority voting accuracy also signifies improvement of the quality of self-supervision during training demonstrating a promising path forward to self-improvement.
Takeaway 2
The question of whether self-training can be extended indefinitely has mixed results: while under controllable difficulty (Reasoning Gym), SRT has the potential to keep improving beyond the base model on progressively more difficult tasks, while training on real-world math problems demonstrate the phenomenon of reward hacking — sustained self-improvement requires developing additional regularization measures to be effective.
Related Literature:
Several works explore self-training LLMs, some of which are concurrent. A non-exaustive list includes:
- Zeikerman et al. (2022) and Huang et al (2022): demonstrates
that LLMs can self-improve by training on chain-of-thoughts from a previous instance of the model. Particularly,
Huang et al., 2023;
Wang et al., 2023a;
Prasad et al., 2024
demonstrated the feasibility of using majority voting and self-consistency to filter
chain-of-thought traces that, when used as SFT training data, improve the LLM performance on
downstreaming tasks.
- TTRL (Zuo et al., 2025),
proposes a test-time training algorithm, which is equivalent to SRT used at test time.
- Similarly, Zhao et al., 2025 incorporates token certainty as a training signal.
-
Shao et al., 2025
shows that RL with spurious reward can lead to a performance boost for some LLMs.