DeepSeek-R1 Explained: Reinforcement Learning, GRPO, and Emergent Reasoning
DeepSeek-R1 became one of the most important AI papers of 2025 because it challenged a core assumption in large language model development: that advanced reasoning has to be taught mainly through human-written reasoning traces.
The paper introduced two related models: DeepSeek-R1-Zero and DeepSeek-R1.
DeepSeek-R1-Zero was the cleaner scientific experiment. It applied large-scale reinforcement learning directly to a pretrained base model, without an initial supervised fine-tuning step. The goal was to test whether reasoning behaviors could emerge from outcome-based rewards rather than from imitation of human step-by-step solutions.
DeepSeek-R1 was the more polished production-oriented version. It used a multi-stage training pipeline to address the problems that appeared in R1-Zero, including poor readability and language mixing.
The paper matters because it reframed reasoning as an optimization problem.
Instead of asking:
How do we teach a model to imitate human reasoning?
DeepSeek asked:
What happens if we reward the model for reaching correct answers and allow the reasoning process to evolve through reinforcement learning?
That distinction is subtle, but it is extremely important.
The Problem Before DeepSeek-R1
Large language models are excellent at pattern completion. During pretraining, they absorb syntax, facts, mathematical notation, code structure, and many examples of problem solving from large text corpora.
But knowing a concept is not the same as being able to reason with it.
A base model may know the quadratic formula, Python syntax, or the definition of a graph traversal algorithm. But when given a difficult multi-step problem, it may fail because it does not reliably coordinate those pieces of knowledge over time. It may choose the wrong strategy, lose track of variables, skip a verification step, or confidently commit to an incorrect path.
Earlier approaches to reasoning improvement often relied on supervised fine-tuning with chain-of-thought examples. The model would see examples of humans solving problems step by step and learn to imitate that format.
This approach works, but it has limitations.
First, high-quality reasoning data is expensive to create. Human-written reasoning traces require expertise, time, and careful verification. Even synthetic reasoning traces require filtering and quality control.
Second, imitation can constrain exploration. If the model is trained mainly to copy human reasoning styles, it may learn the surface structure of human explanations without discovering alternative strategies that are more effective for the model's own internal representations.
Third, process supervision can introduce bias. When humans or reward models evaluate intermediate reasoning steps, they implicitly define what “good reasoning” should look like. That may be useful for safety and interpretability, but it can also narrow the search space.
DeepSeek-R1-Zero explored a different path: reduce the dependence on human-labeled reasoning trajectories and let reinforcement learning apply pressure through final-answer correctness.
DeepSeek-R1-Zero: Reasoning Without Initial SFT
DeepSeek-R1-Zero was trained with reinforcement learning directly on a base model, without an initial supervised fine-tuning stage.
That detail is central.
In many training pipelines, supervised fine-tuning acts like training wheels. The model is first shown examples of how a helpful assistant or reasoning model should behave. Only after that does reinforcement learning refine the model.
DeepSeek-R1-Zero skipped that preliminary reasoning demonstration stage.
The model was not initially taught a curated human style of reasoning. Instead, it was given tasks with verifiable answers and rewarded based on outcomes. If the final answer was correct, the behavior that led to that answer became more likely. If the final answer was wrong, that behavior was discouraged.
The DeepSeek-R1 paper reports that this process led to the emergence of reasoning behaviors such as longer chain-of-thought responses, self-reflection, verification, and dynamic strategy adaptation.
This does not mean the model became conscious or “understood” the problem the way a human does. A more precise interpretation is that reinforcement learning created an optimization environment where reflective and verification-like behaviors increased the probability of receiving reward.
When the model took more time, checked its work, reconsidered a failed path, or tried an alternative strategy, it was more likely to reach a correct answer. Over many training iterations, those behaviors became reinforced.
Outcome-Based Rewards
The key idea behind DeepSeek-R1-Zero is outcome-based reinforcement learning.
Instead of grading every intermediate reasoning step, the training signal primarily depends on whether the final answer is correct.
For math problems, correctness can often be checked against a known answer. For coding problems, correctness can be checked through tests or rule-based evaluation. This makes the reward signal scalable because the training process does not require humans to manually grade every reasoning trace.
This is different from process supervision.
In process supervision, the model receives feedback on intermediate steps. That can be useful, but it requires defining what a correct reasoning step looks like. It may penalize unconventional paths even if they eventually lead to a correct answer.
Outcome-based rewards are less prescriptive. They do not care whether the model solved the problem in the same way a human would. They only care whether the final result is correct.
This is why the DeepSeek result was so interesting. The model began to generate longer and more structured reasoning not because it was directly told to do so, but because that behavior improved its chance of producing correct answers.
In other words:
DeepSeek did not directly teach the model a reasoning script. It created incentives under which reasoning-like behavior became useful.
Why GRPO Matters
DeepSeek used Group Relative Policy Optimization, or GRPO, as its reinforcement learning framework.
To understand why GRPO matters, it helps to compare it with PPO, or Proximal Policy Optimization.
In many PPO-style RL pipelines, training uses a separate value model. The value model estimates how good a response is expected to be. This helps stabilize reinforcement learning, but it also adds significant computational overhead. For large language models, maintaining another large model during training is expensive in both memory and compute.
GRPO removes the separate value model.
Instead, for a given prompt, the model generates a group of candidate responses. Each response receives a reward. The algorithm then compares each response to the group's average reward. A response that performs better than the group average receives positive reinforcement. A response that performs worse receives negative reinforcement.
The basic intuition is:
Do not ask a separate critic to predict how good an answer should be.
Let the model generate several attempts, score them, and reinforce the attempts that outperform the group.
This matters because reasoning training is expensive. Long chain-of-thought outputs consume many tokens, and large models are costly to optimize. By removing the value model, GRPO makes large-scale reinforcement learning more practical.
This is one reason DeepSeek-R1 received so much attention. The paper was not only about reasoning quality. It was also about training efficiency.
The Catch: GRPO Still Has a Credit Assignment Problem
GRPO reduces the overhead of reinforcement learning, but it does not solve every problem.
The remaining issue is credit assignment.
When a model produces a long reasoning trace, not every token contributes equally to the final answer. Some tokens are low-level execution: arithmetic, variable substitution, formatting, punctuation, or repeated procedural steps.
Other tokens are high-level strategic decisions: choosing a method, abandoning a dead end, checking an assumption, reframing the problem, or trying a different branch.
A broad outcome reward can identify that the final answer was correct, but it does not automatically know which parts of the reasoning trace were responsible for success.
That creates an algorithmic inefficiency.
If a model writes a 50-step solution and reaches the correct answer because of one crucial strategic pivot, GRPO may still distribute optimization pressure across the entire response. It can reinforce the important planning move, but it can also reinforce many ordinary execution tokens that were not the true source of improvement.
This is where later work on hierarchical reasoning becomes useful.
DeepSeek-R1 showed that reinforcement learning can incentivize reasoning. Follow-up work by Wang et al. tries to explain more precisely how reasoning improves during RL training and why broad token-level optimization may be inefficient.
The HICRA paper should be read as follow-up analysis, not as part of the original DeepSeek-R1 training recipe. DeepSeek-R1 used GRPO; HICRA is a later proposal for improving the credit-assignment problem that GRPO leaves open.
A Follow-Up View: Reasoning as an Emergent Hierarchy
A later paper, Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning, argues that RL-trained reasoning models do not improve as one monolithic capability.
Instead, the authors propose that reasoning develops through an emergent hierarchy.
At a high level, they separate reasoning behavior into two functional roles:
- Low-level procedural execution
- High-level strategic planning
Low-level procedural execution includes basic operations: arithmetic, syntax, substitutions, formatting, and local transformations.
High-level strategic planning includes choosing which method to use, deciding when to backtrack, identifying which condition matters, switching approaches, and verifying whether the current path is still useful.
The paper describes a two-phase dynamic.
Phase 1: Procedural Consolidation
Early in RL training, the model is constrained by low-level execution.
It may know the relevant concepts from pretraining, but it is not yet reliable at using them inside long reasoning traces. It may make formatting mistakes, arithmetic errors, syntax mistakes, or local logical slips.
During this stage, reinforcement learning helps the model become a better procedural executor.
The model learns to follow the output format, apply basic operations more consistently, and avoid obvious local errors. In simple terms, the “worker” part of the system becomes more reliable.
This stage is necessary because high-level strategy does not help much if the model cannot execute the details correctly.
A perfectly chosen solution path can still fail if the model makes repeated arithmetic or syntax errors.
Phase 2: Strategic Exploration
Once procedural execution becomes more reliable, the bottleneck shifts.
The model is no longer limited mainly by basic execution. It is limited by strategy.
At this point, improvements come from better planning: selecting the right method, recognizing when a path is failing, exploring alternatives, and self-correcting before committing to an answer.
This is the “manager” phase.
The model begins to allocate more tokens to planning-like behavior. It tries different approaches, evaluates intermediate results, and sometimes changes direction.
This framework helps explain why reasoning traces become longer during RL training. Length is not automatically intelligence. A longer response can be useless if it is repetitive or confused.
But in successful reasoning models, additional length can represent more test-time computation: more room to plan, verify, branch, and recover from mistakes.
Demystifying the “Aha Moment”
One of the most discussed observations from DeepSeek-R1-Zero is the appearance of “aha moments.”
The model appears to pause, reconsider its approach, and correct itself mid-solution.
A weak interpretation would be:
The model suddenly became conscious of its mistake.
A better interpretation is:
The model learned that reflective behaviors increase the probability of reaching a correct final answer.
If a reasoning path becomes too complex or error-prone, continuing blindly often leads to failure. During RL training, the model receives higher reward when it catches that failure mode early and tries a better path.
Over many training iterations, phrases such as “wait,” “let’s reconsider,” or “try another approach” can become associated with successful problem solving.
From the hierarchical reasoning perspective, an “aha moment” is a visible marker of the model shifting from low-level execution into high-level strategic control.
Instead of simply continuing the next calculation, the model temporarily enters planning mode. It evaluates the path it is on and decides whether to continue, backtrack, or switch strategies.
There is also an implementation detail that matters here: DeepSeek did not leave the reasoning trace completely unstructured.
DeepSeek-R1-Zero used a format reward that encouraged the model to place its reasoning process inside literal <think>...</think> tags before producing the final answer. This matters because the tags created a structural boundary between the model's intermediate reasoning and its final response.
In that sense, the <think> block acted like a reasoning sandbox. The model could explore, backtrack, verify, and revise inside the reasoning region, while the final answer remained separated from the exploratory process.
This is where the so-called “aha moment” becomes easier to understand. The model's reflective behavior was not floating in an undefined space. It appeared inside a structured reasoning region that the training process explicitly encouraged through the format reward. The reward did not prescribe how to reason inside the tags, but it did create a stable container where longer reasoning traces could develop.
This is not magic. It is optimization pressure shaping behavior.
Pretraining Priors: Why RL Had Something to Work With
It is also important not to overstate what reinforcement learning created from scratch.
DeepSeek-R1-Zero was not trained from nothing. It started from a pretrained base model.
That means the model already had a vast amount of knowledge encoded from pretraining: language patterns, mathematical notation, programming syntax, examples of explanations, and many forms of human-written reasoning.
RL did not invent algebra or Python inside the model. It changed how the model used capabilities that were already latent.
A useful analogy is the difference between having tools and knowing how to use them under pressure.
Pretraining gives the model a toolbelt. It learns what equations look like, what code looks like, and what explanations look like.
Reinforcement learning teaches the model which sequence of tool use tends to produce correct answers.
This is why it is more accurate to say that RL activated and organized latent reasoning priors rather than created reasoning from nothing.
The follow-up hierarchical reasoning work makes a similar point: the base model likely already contains a latent blueprint of planning and execution patterns because human text itself often has hierarchical structure. RL acts as a catalyst that makes those structures more useful for solving verifiable tasks.
Strategic Grams: Measuring the “Manager”
A serious analysis cannot simply claim that the model is planning because it writes words that sound reflective.
The challenge is measurement.
Wang et al. introduce the idea of Strategic Grams. These are short phrase patterns that function as planning units inside reasoning traces.
Examples include phrases associated with:
- Deduction: using a known fact, theorem, or condition
- Branching: trying a different approach
- Backtracing: returning to something stated earlier in the problem
The point is not that any single phrase proves human-like reasoning. The point is that certain phrases often function as semantic units that guide the logical direction of a solution.
By tracking these phrases, the researchers attempt to separate high-level planning tokens from ordinary execution tokens.
This gives them a way to ask a more precise question:
Is reinforcement learning merely making the model more verbose, or is it increasing the model's ability to explore useful strategies?
Strategic Grams are a proxy for planning-like behavior. They are not a perfect window into the model's internal cognition, but they provide a measurable signal.
Token Entropy vs. Semantic Entropy
The hierarchical reasoning paper also makes an important distinction between token-level entropy and semantic entropy.
Token-level entropy measures uncertainty over the next token.
During RL training, token entropy may decrease because the model becomes more confident at routine execution. It learns to produce common mathematical steps, formatting patterns, and local transitions more reliably.
But reasoning traces contain many low-level execution tokens. If we only measure token entropy, we may mostly observe the “worker” becoming more predictable.
That does not necessarily tell us whether the “manager” is becoming more strategic.
To study that, Wang et al. examine semantic entropy over Strategic Grams. Instead of measuring uncertainty over individual next tokens, semantic entropy measures diversity in high-level planning patterns.
This distinction matters because good reasoning does not require the model to be uncertain everywhere. It should be confident about routine execution while still preserving flexibility in strategy.
A strong reasoning model should know how to execute basic steps reliably, but it should also explore different high-level approaches when the problem demands it.
In this view, successful RL training should reduce uncertainty in low-level execution while maintaining or increasing strategic diversity.
HICRA: Rewarding the Manager Instead of Every Token
The natural next question is whether reinforcement learning can become more efficient by assigning credit more precisely.
Wang et al. propose HICRA, or Hierarchy-Aware Credit Assignment.
The idea is straightforward: if later-stage reasoning improvements depend heavily on high-level planning, then the learning signal should focus more on planning tokens and less on routine execution tokens.
GRPO rewards the response broadly.
HICRA tries to amplify the reward signal on tokens associated with Strategic Grams. In plain English:
GRPO rewards the whole solution.
HICRA tries to reward the strategic decisions that made the solution work.
This is an important shift.
It suggests that the future of reasoning training may not simply be “generate longer chain-of-thought traces.” The more important goal may be to improve the model's ability to choose productive strategies.
Longer reasoning is useful only when the extra tokens are doing useful cognitive work: planning, verifying, branching, or recovering from failure.
HICRA points toward a more targeted version of reasoning optimization.
However, there is also a limitation.
If HICRA depends on a predefined set of Strategic Grams, then it may miss subtler forms of planning behavior. A model may use strategies that do not match the hand-selected phrase patterns. Future versions of hierarchy-aware credit assignment may need to identify planning behavior dynamically rather than relying only on fixed phrase lists.
That limitation is important because it keeps the claim grounded. HICRA is not a final answer. It is a proposed direction for making credit assignment more precise.
GRPO vs. HICRA
| Question | GRPO | HICRA |
|---|---|---|
| Core idea | Compare multiple responses to the same prompt and reinforce the ones that perform better than the group average. | Identify high-level planning tokens and assign more credit to the strategic decisions that helped solve the problem. |
| Main efficiency gain | Removes PPO's separate value model, reducing memory and compute overhead. | Reduces wasted optimization pressure by focusing reward on planning-relevant parts of the reasoning trace. |
| Credit assignment | Broad: reward is applied across the generated response. | Targeted: reward is amplified around tokens associated with strategic planning. |
| What it improves | Makes large-scale RL for reasoning more practical. | Tries to make reasoning RL more sample-efficient and strategy-aware. |
| Weakness | Can dilute the learning signal across routine execution tokens. | Depends on reliably identifying planning tokens, which may be difficult if using predefined phrase patterns. |
| Best interpretation | A scalable RL framework for outcome-based reasoning training. | A proposed refinement for hierarchy-aware reasoning optimization. |
DeepSeek-R1 vs. DeepSeek-R1-Zero
It is important to distinguish between DeepSeek-R1-Zero and DeepSeek-R1.
DeepSeek-R1-Zero was the pure reinforcement learning experiment. It started from DeepSeek-V3-Base and applied large-scale RL directly, without an initial supervised fine-tuning stage. This made it valuable as a scientific result because it showed that reasoning behaviors could emerge from outcome-based reinforcement learning.
But R1-Zero also had practical problems.
Because it began without a cold-start supervised fine-tuning phase, its outputs could be difficult to read. The DeepSeek paper specifically notes issues such as poor readability and language mixing. In other words, R1-Zero discovered useful reasoning behaviors, but it did not always present those behaviors in a clean or user-friendly way.
DeepSeek-R1 was designed to fix that.
The final DeepSeek-R1 pipeline added a cold-start supervised fine-tuning stage before further reasoning-oriented RL. This cold-start stage should not be interpreted as the main source of the model's reasoning ability. Its purpose was more specific: to provide a small set of clean, readable examples that anchored the model's output format and reduced the chaotic behavior seen in R1-Zero.
In practical terms:
R1-Zero showed that reasoning could emerge through RL.
R1 showed how to make that reasoning readable, stable, and useful.
After cold-start SFT, DeepSeek-R1 continued with reasoning RL, rejection sampling, general supervised fine-tuning, and alignment RL. This turned the raw reasoning behavior of R1-Zero into a more polished assistant-style model.
This distinction matters because it prevents a common misunderstanding. DeepSeek-R1 was not simply “pure RL.” DeepSeek-R1-Zero was the pure RL demonstration. DeepSeek-R1 was the engineered system that combined cold-start formatting, RL-based reasoning improvement, general capability tuning, and alignment.
Why DeepSeek-R1 Was Game-Changing
DeepSeek-R1 was game-changing for three main reasons.
First, it demonstrated that reinforcement learning can induce strong reasoning behaviors without relying entirely on human-labeled reasoning trajectories.
That matters because human reasoning data is hard to scale. If outcome-based RL can produce reasoning improvements using verifiable rewards, then the bottleneck shifts away from manually writing reasoning traces and toward designing better reward environments.
Second, it showed that reasoning models could be made more open and accessible.
DeepSeek released DeepSeek-R1-Zero, DeepSeek-R1, and several distilled models. This gave researchers and developers more ability to study reasoning behavior outside closed frontier labs.
Third, it changed the market conversation around the economics of AI.
DeepSeek-R1 did not invent reasoning models. OpenAI had already introduced the o1 model series in September 2024, describing it as a model series designed to spend more time thinking before responding and trained with large-scale reinforcement learning. Google also released Gemini 2.0 Flash Thinking in December 2024 as an experimental reasoning-oriented model.
What DeepSeek changed was the perception of who could compete and at what cost.
After R1, reasoning models no longer looked like a capability available only inside a few closed American labs. DeepSeek showed that an open model family could participate in the reasoning-model race and that distilled models could transfer reasoning behavior into smaller models.
The market reaction was immediate. Reuters reported that Nvidia lost approximately $593 billion in market value in a single day after DeepSeek triggered a broad AI stock selloff. The selloff reflected investor concern that competitive AI systems might require less infrastructure spending than previously assumed.
Microsoft also moved quickly to make DeepSeek-R1 available through Azure AI Foundry and GitHub. That did not mean Microsoft copied DeepSeek's training method. But it did show that major American cloud platforms saw demand for making DeepSeek-R1 accessible to developers and enterprises.
The careful claim is this:
DeepSeek-R1 did not create the reasoning-model trend by itself, but it accelerated the reasoning-model race by showing that open, RL-centered reasoning models could be competitive, widely distributed, and economically disruptive.
What This Means for AI Engineering
The DeepSeek-R1 story is not only about one model.
It points toward a broader shift in AI engineering.
For years, much of the progress in language models came from scaling pretraining: more data, more parameters, more compute.
Reasoning models introduce a different axis of improvement.
Instead of only scaling what the model knows, researchers are scaling how the model uses what it knows.
That includes:
- more test-time computation
- longer reasoning traces
- verifiable reward environments
- outcome-based reinforcement learning
- better credit assignment
- strategy-aware optimization
- distillation of reasoning behavior into smaller models
This is why GRPO and HICRA are interesting. They are not just training tricks. They represent different answers to the same deeper question:
When a model solves a hard problem, which parts of its output should receive credit?
GRPO answers this at the group level. It compares multiple responses and reinforces the ones that perform better than the group average.
HICRA tries to answer it at the hierarchy level. It asks whether the model's strategic planning tokens deserve more credit than routine execution tokens.
That question may become increasingly important as reasoning models become more capable.
Security Implications
For cybersecurity, this shift matters.
Reasoning models are more useful for tasks that require multi-step analysis: code review, alert triage, log investigation, threat modeling, malware analysis, and vulnerability research.
But stronger reasoning also creates new risks.
A model that can plan better can also assist with more complex misuse. It may be better at chaining steps together, adapting when blocked, and debugging its own failed attempts. This makes alignment and monitoring more important, not less.
DeepSeek-R1 also raises questions about model transparency. If advanced reasoning behaviors emerge through RL rather than being explicitly programmed, then developers need better ways to understand how those behaviors form and when they fail.
The hierarchical reasoning perspective is useful here. If we can distinguish procedural execution from strategic planning, we may get better tools for evaluating model behavior.
For example, a security evaluation might ask:
- Is the model simply executing known steps?
- Is it developing a new strategy?
- Is it backtracking around safety constraints?
- Is it improving through longer reasoning, or merely becoming more verbose?
These are not abstract research questions. They matter for how reasoning models will be deployed in real security workflows.
Final Takeaway
DeepSeek-R1 matters because it reframed reasoning as an outcome-driven optimization problem rather than only a human-imitation problem.
DeepSeek-R1-Zero showed that a pretrained model can improve reasoning behavior through large-scale reinforcement learning, even without an initial supervised fine-tuning stage. GRPO made this more practical by replacing PPO's value model with group-relative scoring.
But the next question is not only whether RL works.
The deeper question is where the learning signal should go.
Follow-up work on hierarchical reasoning suggests that models may first consolidate low-level procedural skills and then shift toward high-level strategic exploration. If that view is correct, the future of reasoning training may depend less on simply producing longer reasoning traces and more on better credit assignment.
In other words:
DeepSeek-R1 showed that reasoning can be incentivized.
Hierarchy-aware training asks how to make that incentive more precise.
The real breakthrough is not that the model “learned to think” in a human sense.
The breakthrough is that, under the right reward structure, a pretrained model can discover behaviors that look like planning, verification, self-correction, and strategy adaptation because those behaviors improve the probability of producing correct answers.
That is why DeepSeek-R1 changed the conversation.
It moved reasoning from imitation toward optimization.
And the next frontier may be learning how to reward not just the answer, and not just the whole chain of thought, but the strategic decisions that make reasoning work.
Sources
- DeepSeek-AI. “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.” arXiv, 2025. https://arxiv.org/abs/2501.12948
- Guo et al. “DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning.” Nature, 2025. https://www.nature.com/articles/s41586-025-09422-z
- Wang et al. “Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning.” arXiv, 2025. https://arxiv.org/abs/2509.03646
- OpenAI. “Learning to reason with LLMs.” September 2024. https://openai.com/index/learning-to-reason-with-llms/
- OpenAI. “Introducing OpenAI o1-preview.” September 2024. https://openai.com/index/introducing-openai-o1-preview/
- Google AI for Developers. “Gemini API changelog.” December 2024 entry for Gemini 2.0 Flash Thinking Mode. https://ai.google.dev/gemini-api/docs/changelog
- Microsoft Azure. “DeepSeek R1 is now available on Azure AI Foundry and GitHub.” January 2025. https://azure.microsoft.com/en-us/blog/deepseek-r1-is-now-available-on-azure-ai-foundry-and-github/
- Reuters. “DeepSeek sparks AI stock selloff; Nvidia posts record market-cap loss.” January 2025. https://www.reuters.com/technology/chinas-deepseek-sets-off-ai-market-rout-2025-01-27/