Bidit Sadhukhan - Data Scientist & AI Innovation Expert

In the rapidly evolving field of artificial intelligence, large language models (LLMs) have emerged as powerful tools for various applications. One of the most exciting developments in this area is the DeepSeek-R1-Zero model, which leverages reinforcement learning (RL) to enhance reasoning capabilities without relying on supervised fine-tuning (SFT). This blog post will guide you through the intricacies of DeepSeek-R1-Zero, from a beginner's perspective to more advanced technical details, followed by an exploration of rejection sampling and its role in DeepSeek-R1.

Introduction to DeepSeek-R1-Zero

DeepSeek-R1-Zero is a groundbreaking model designed to improve the reasoning abilities of LLMs through pure reinforcement learning. Unlike traditional models that require supervised fine-tuning, DeepSeek-R1-Zero skips this step, allowing it to develop reasoning skills autonomously. This approach not only saves computational resources but also provides insights into how models can learn complex problem-solving strategies from scratch.

Understanding Reinforcement Learning in DeepSeek-R1-Zero

What is Reinforcement Learning?

Reinforcement learning is a type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize cumulative reward. In the context of DeepSeek-R1-Zero, the model learns to generate better reasoning outputs by optimizing a policy model through an RL algorithm called Group Relative Policy Optimization (GRPO).

Group Relative Policy Optimization (GRPO)

GRPO is the backbone of DeepSeek-R1-Zero's learning process. It optimizes the policy model, denoted as $\pi$ , to generate better reasoning outputs by maximizing an objective function. Here's a detailed breakdown of how GRPO works:

Objective Function

The objective function, $J_{GRPO}(\theta)$ , is defined as:

$J_{GRPO}(\theta) = \mathbb{E}_{q \sim P(Q)} \left[ \sum_{i=1}^{G} \min \left( \frac{\pi_\theta(o_i|q)}{\pi_{\theta_{old}}(o_i|q)} A_i, \text{clip} \left( \frac{\pi_\theta(o_i|q)}{\pi_{\theta_{old}}(o_i|q)}, 1-\epsilon, 1+\epsilon \right) A_i \right) - \beta D_{KL}(\pi_\theta(O|q) || \pi_r(O|q)) \right]$

To ensure the equation renders correctly, let's break it down:

Expectation and Sampling:

$\mathbb{E}_{q \sim P(Q)} \left[ \cdot \right]$

Summation and Minimization:

$\sum_{i=1}^{G} \min \left( \frac{\pi_\theta(o_i|q)}{\pi_{\theta_{old}}(o_i|q)} A_i, \text{clip} \left( \frac{\pi_\theta(o_i|q)}{\pi_{\theta_{old}}(o_i|q)}, 1-\epsilon, 1+\epsilon \right) A_i \right)$

KL Divergence Term:

$\beta D_{KL}(\pi_\theta(O|q) || \pi_r(O|q))$

$\mathbb{E}$ is the expectation over the distribution of questions $P(Q)$ .
$\{o_i\}_{i=1...G}$ are a group of outputs sampled from the old policy $\pi_{\theta_{old}}$ given a question $q$ .
$\pi_\theta(o|q)$ is the probability of output $o$ given question $q$ under the current policy.
$A_i$ is the advantage, representing how much better an action is compared to the average action.
$\text{clip}(x, 1-\epsilon, 1+\epsilon)$ is a clipping function that limits the policy update to a certain range.
$\beta$ is a hyper-parameter, and $D_{KL}(\pi || \pi_r)$ is the Kullback-Leibler divergence, which measures the difference between the current policy and the reference policy.

Sampling Outputs

For each question $q$ , GRPO samples a group of $G$ outputs $\{o_1, o_2, ..., o_G\}$ from the old policy $\pi_{\theta_{old}}$ . This set of outputs allows the algorithm to compare and contrast the different responses the model can generate for a single question.

Advantage Calculation

The advantage $A$ is calculated using the rewards $\{r_1, r_2, ..., r_G\}$ corresponding to the outputs within each group:

$A = \frac{r - \text{mean}(\{r_1, r_2, ..., r_G\})}{\text{std}(\{r_1, r_2, ..., r_G\})}$

$r$ is the reward associated with a particular output.
$\text{mean}(\{r_1, r_2, ..., r_G\})$ is the average reward of all the outputs in the group.
$\text{std}(\{r_1, r_2, ..., r_G\})$ is the standard deviation of the rewards within the group.

The advantage quantifies how much better a particular output is compared to the average output in the group, in terms of the reward. It is normalized by the standard deviation to ensure the stability of learning.

Policy Optimization

The policy model $\pi$ is optimized by maximizing the GRPO objective function. This objective function combines the advantage with the policy ratio between the new and old policies, and includes a clipping function and KL divergence term to stabilize the training process.

Reward System

The reward system provides feedback to the model, indicating the quality of its outputs. DeepSeek-R1-Zero uses a rule-based reward system that consists of two types of rewards:

Accuracy Rewards: These rewards are given when the model provides correct answers. For math problems, the final answer must be in a specified format to allow for automated verification. For coding problems, the correctness of the code is checked using a compiler.
Format Rewards: These rewards are used to incentivize the model to structure its responses with the reasoning process enclosed between and tags. This promotes a consistent structure for the model's outputs, which can help with interpretability.

Baseline Estimation

Unlike traditional RL algorithms that utilize a critic model to estimate the baseline, GRPO estimates the baseline directly from group scores. This method avoids the use of a large critic model, saving computational resources during training.

Training and Self-Evolution Process

Training Template

To guide the model, a simple template is used. This template instructs the model to first produce a reasoning process and then provide the final answer. The model is not given any specific instructions on how to reason, allowing researchers to observe its natural development through the RL process.

Self-Evolution

As the RL training progresses, DeepSeek-R1-Zero shows consistent improvement in performance. For example, its average pass@1 score on the AIME 2024 benchmark increases from 15.6% to 71.0%. The model learns to spend more time thinking and increases the length of its reasoning process. It develops sophisticated behaviors such as reflection and exploring alternative problem-solving approaches, which emerge through interaction with the RL environment.

Aha Moment

During training, the model experiences an "aha moment" where it learns to rethink its initial approach by allocating more thinking time to the problem. This moment demonstrates the model's ability to learn advanced problem-solving strategies autonomously through RL.

Key Achievements

Autonomous Learning: DeepSeek-R1-Zero demonstrates that LLMs can develop reasoning skills through pure RL without any supervised data.
Strong Reasoning Capabilities: The model achieves performance levels comparable to OpenAI-o1-0912 on certain benchmarks.
Majority Voting: Using majority voting, its performance on AIME 2024 further improves, exceeding the performance of OpenAI-o1-0912.

Limitations

Despite its impressive capabilities, DeepSeek-R1-Zero faces issues like poor readability and language mixing. Its responses may not always be in a human-friendly format and might include multiple languages in one answer.

Understanding Rejection Sampling: From Basics to DeepSeek-R1

When dealing with complex probability distributions, directly sampling from them can be challenging. Rejection sampling is a clever statistical technique that allows us to overcome this hurdle. This method is not only foundational in computational statistics but also a crucial component in advanced models like DeepSeek-R1. Here, we'll explore the basics of rejection sampling, its mathematical framework, and its application in DeepSeek-R1.

What is Rejection Sampling?

Imagine trying to throw darts at a dartboard with an intricate shape. Instead of directly aiming at the complex outline, you aim at a rectangular board that completely covers it. Each dart that lands within the shape is accepted, while others are rejected. This is the essence of rejection sampling.

In mathematical terms, rejection sampling is a method to generate samples from a complex probability distribution, referred to as the target distribution $\pi(x)$ . To achieve this, we use a simpler proposal distribution $q(x)$ from which we can easily draw samples. Here's how it works:

Sample from the Proposal Distribution: Generate a candidate sample, $x$ , from $q(x)$ .
Generate a Random Threshold: Draw a uniform random number, $u$ , from $[0, 1]$ .
Acceptance Probability: Compare $u$ $u$ to the ratio $\frac{\pi(x)}{M \cdot q(x)}$ $\frac{π ( x )}{M \cdot q ( x )}$ , where $M$ $M$ is a scaling constant ensuring $M \cdot q(x) \geq \pi(x)$ $M \cdot q (x) \geq π (x)$ for all $x$ $x$ .
- If $u < \frac{\pi(x)}{M \cdot q(x)}$ , accept $x$ .
- Otherwise, reject $x$ and repeat.

This simple algorithm ensures that the accepted samples follow the target distribution, even though they are initially drawn from the proposal distribution.

Key Components of Rejection Sampling

Target Distribution $\pi(x)$ : The complex distribution we want to sample from.
Proposal Distribution $q(x)$ : A simpler distribution that approximates the target.
Scaling Factor $M$ : A constant such that $M \cdot q(x)$ covers $\pi(x)$ everywhere.
Acceptance Probability: Determines whether a sample is kept based on how well it aligns with the target distribution.

Why Rejection Sampling Works

The magic lies in the acceptance probability. Samples where the target distribution is high relative to the proposal distribution are more likely to be accepted, creating a set of samples that mimic the target distribution.

Efficiency Considerations

The efficiency of rejection sampling depends on how closely the proposal distribution matches the target distribution. A poor choice of $q(x)$ or an overly large $M$ can lead to many rejections, making the method computationally expensive.

Rejection Sampling in DeepSeek-R1

Rejection sampling plays a pivotal role in the DeepSeek-R1 model, a cutting-edge framework designed for optimizing response generation in language models. In DeepSeek-R1, the goal is to align the generated responses with a reward-maximizing optimal policy. Here’s how rejection sampling is adapted to achieve this:

Target Distribution $\pi_{r\psi}(y|x)$ : Represents the optimal policy derived from a reward function $r\psi$ , which scores how well responses align with desired outcomes.
Proposal Distribution $\pi_{sft}(y|x)$ : Represents a simpler, supervised fine-tuned (SFT) policy that serves as the starting point.

The Rejection Sampling Pipeline in DeepSeek-R1

The process starts with generating response candidates $y$ from the SFT policy $\pi_{sft}(y|x)$ . These candidates are evaluated using the reward function $r\psi$ , and rejection sampling decides which responses to accept based on their alignment with the optimal policy $\pi_{r\psi}(y|x)$ .

Mathematical Steps

Sampling: Generate response candidates from $\pi_{sft}(y|x)$ .
Reward Scoring: Use a reward function to assign scores to each candidate.
Acceptance Probability: Compute the ratio $\frac{\pi_{r\psi}(y|x)}{M \cdot \pi_{sft}(y|x)}$ , where $M$ ensures the proposal distribution sufficiently covers the target.
Acceptance Decision: Accept candidates with a probability proportional to their alignment with the target policy.

Algorithmic Enhancements in DeepSeek-R1

DeepSeek-R1 implements a refined version of rejection sampling to improve efficiency and applicability:

Pairwise Reward-Ranking Model: Instead of directly calculating $r\psi$ , a ranking model evaluates pairs of responses to derive relative preferences.
Iterative Sampling: Samples are drawn iteratively, and accepted candidates are excluded from subsequent iterations, enhancing diversity.
Hyperparameter Tuning $\beta$ : A parameter controlling the trade-off between exploiting high-reward samples and exploring diverse responses.

Theoretical Foundation

The method’s statistical correctness is ensured by the expected acceptance rate:

$\mathbb{E}_{y \sim \pi_{sft}}[\exp((r\psi(x, y) - r_{max}) / \beta)],$

where $r_{max}$ is the maximum reward among unaccepted candidates. By tuning $\beta$ , DeepSeek-R1 strikes a balance between prioritizing high-reward responses and maintaining alignment with the SFT policy.

Why It Matters

Rejection sampling in DeepSeek-R1 enhances the model’s ability to generate responses that are not only accurate but also align with human preferences and rewards. By iteratively refining response selection, the method ensures higher-quality outputs compared to simpler sampling methods.

Conclusion

DeepSeek-R1-Zero and its variant DeepSeek-R1 represent revolutionary advancements in large language models. DeepSeek-R1-Zero learns to reason through pure reinforcement learning, bypassing supervised fine-tuning, while DeepSeek-R1 leverages rejection sampling to optimize response generation. Together, they demonstrate the power of RL and statistical sampling techniques to enable models to develop complex problem-solving skills and improve performance over time. While limitations like readability and language mixing persist, the insights gained pave the way for future breakthroughs in AI.

For more, check out this page: Rejection Sampling in DeepSeek-R1

FAQs

Can rejection sampling be used for any distribution? Yes, as long as you have a proposal distribution that adequately covers the target distribution.
What are the limitations of rejection sampling? Its efficiency heavily depends on the choice of proposal distribution and scaling factor.
How does DeepSeek-R1 improve upon standard rejection sampling? By introducing iterative sampling, reward ranking, and a tunable hyperparameter $\beta$ , it tailors the method for language model optimization.

References:

Liu, T., Zhao, Y., Joshi, R., Khalman, M., Saleh, M., Liu, P. J., & Liu, J. (2023). Statistical Rejection Sampling Improves Preference Optimization. ArXiv
Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., Zhang, X., Yu, X., Wu, Y., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., . . . Zhang, Z. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. ArXiv