SRT: Can Large Reasoning Models Self-Train?

Reinforcement learning from verifiable rewards significantly enhances large language models' (LLMs) reasoning abilities, notably in math and coding (OpenAI et al., DeepSeek-AI et al.). However, it relies on human-created ground-truth verifiers, making it costly and restrictive to generate reward signals for every problem. In this work, we ask the following questions:

(1) Can reasoning models self-train using only their own feedback, without access to ground-truth labels?
(2) Does self-training match the performance of RL training on ground-truth labels?
(3) Can self-training be sustained indefinitely, or is improvement ultimately limited?
(4) What strategies can effectively sustain model self-training?

Motivated by prior works based on consistency based self-improvement, we introduce a simple yet effective self-training reinforcement learning methodology, Self-Rewarded Training, that uses consistency across multiple model-generated solutions to estimate correctness during RL training, providing self-supervision signals without labeled data.

We empirically demonstrate that at early training stages, SRT achieves performance rivaling standard reinforcement-learning methods trained explicitly on gold-standard answers. Test datasets: AMC, AIME24, AIME25.

However, we find that the performance eventually collapses, see e.g. the trainining on DAPO in the rightmost picture---more on this below.

We analyze the training dynamics of SRT when training on the challenging DAPO dataset.

These findings indicate that the model learns to maximize self-assigned rewards by producing consistent (second graph above) but incorrect answers (leftmost graph above). Manual inspection confirms this: after collapse, model outputs degenerate into random token sequences with a fixed, prompt-independent answer (e.g., 'the solution is 1'). There is a simple yet precise theoretical justification for this behavior:

The reinforcement learning optimization problem defined by the SRT objective
explicitly encourages consistency across outputs, independently of correctness.

Thus, the optimal policy under this objective degenerates to producing the same answer regardless of input, maximizing reward artificially. Continued self-training on this proxy naturally drives the model toward this trivial solution, especially when it's simpler than solving the actual task.

We propose strategies to mitigate reward hacking, laying the groundwork for effective future approaches to sustaining continual model improvement.

(i) Early Stopping: A small validation set can reliably detect peak model performance and prevent collapse during self-training. The peak performance occurs at nearly the same point for all heldout sets, so using any would be effective for early stopping.

(ii) Self-Training with Offline-Generated Labels: An effective approach involves generating pseudo-labels from a stable, previously fixed checkpoint, rather than leveraging labels from the evolving policy. Doing so stabilizes training while achieving performance comparable to SRT.

(iii) Self-Training with Curriculum Learning: We hypothesize that model collapse occurs more rapidly when training on more challenging datasets, a conjecture that aligns with our empirical findings. The intuition is that, on a more challenging dataset, the model finds it easier to abandon its pre-trained knowledge in favor of optimizing self-consistency rather than genuinely learning to solve the underlying task. We leverage this hypothesis to implement a curriculum learning strategy by identifying the ‘easiest’ subset of the DAPO dataset according to (a) the pass rate and (b) the frequency of the majority vote (see paper for more details).

Performance on these curriculum subsets reaches levels comparable to standard RL training with ground-truth labels on the entire DAPO dataset. These promising results suggest that curriculum strategies may further extend the benefits of SRT, opening exciting avenues for future investigation.

Related Literature:

Several works explore self-training LLMs, some of which are concurrent. A non-exaustive list includes:

Zeikerman et al. (2022) and Huang et al (2022): demonstrates that LLMs can self-improve by training on chain-of-thoughts from a previous instance of the model. Particularly, Huang et al., 2023; Wang et al., 2023a; Prasad et al., 2024 demonstrated the feasibility of using majority voting and self-consistency to filter chain-of-thought traces that, when used as SFT training data, improve the LLM performance on downstreaming tasks.
TTRL (Zuo et al., 2025), proposes a test-time training algorithm, which is equivalent to SRT used at test time.
Similarly, Zhao et al., 2025 incorporates token certainty as a training signal.
Shao et al., 2025 shows that RL with spurious reward can lead to a performance boost for some LLMs.

BibTeX

 
      @misc{shafayat2025largereasoningmodelsselftrain,
        title={Can Large Reasoning Models Self-Train?}, 
        author={Sheikh Shafayat and Fahim Tajwar and Ruslan Salakhutdinov and Jeff Schneider and Andrea Zanette},
        year={2025},
        eprint={2505.21444},
        archivePrefix={arXiv},
        primaryClass={cs.LG},
        url={https://arxiv.org/abs/2505.21444}, 
      }

Can Large Reasoning Models Self-Train?

(1) Self-Rewarded Training (SRT)

(2) SRT Matches RL Performance at Early Training Stages

(3) Self-Training is Bound to Collapse

(4) Mitigation Strategies can be Effective

Related Literature:

BibTeX