DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning https://arxiv.org/abs/2501.12948
General reasoning represents a long-standing and formidable challenge in artificial intelligence. Recent breakthroughs, exemplified by large language models (LLMs) and chain-of-thought prompting, have achieved considerable success on foundational reasoning tasks. However, this success is heavily contingent upon extensive human-annotated demonstrations, and models' capabilities are still insufficient for more complex problems. Here we show that the reasoning abilities of LLMs can be incentivized through pure reinforcement learning (RL), obviating the need for human-labeled reasoning trajectories. The proposed RL framework facilitates the emergent development of advanced reasoning patterns, such as self-reflection, verification, and dynamic strategy adaptation. Consequently, the trained model achieves superior performance on verifiable tasks such as mathematics, coding competitions, and STEM fields, surpassing its counterparts trained via conventional supervised learning on human demonstrations. Moreover, the emergent reasoning patterns exhibited by these large-scale models can be systematically harnessed to guide and enhance the reasoning capabilities of smaller models.
通用推理是人工智能领域一个长期存在且极具挑战性的难题。近期的突破,以大型语言模型(LLM)和思维链提示为代表,已在基础推理任务上取得了显著成功。然而,这种成功高度依赖于大量人工标注的示范数据,且模型处理更复杂问题的能力仍然不足。本文表明,大型语言模型的推理能力可以通过纯强化学习(RL)进行激励,从而无需依赖人工标记的推理轨迹。所提出的强化学习框架促进了高级推理模式的涌现式发展,例如自我反思、验证以及动态策略适应。因此,训练后的模型在数学、编程竞赛和 STEM 领域等可验证任务上实现了卓越性能,超越了通过传统监督学习基于人工示范训练的同类模型。此外,这些大规模模型所展现的涌现推理模式可以被系统地利用,以指导并增强较小模型的推理能力。
模型蒸馏 GPRO Group Relative Policy Optimization