Bidit Sadhukhan - Data Scientist & AI Innovation Expert

In the rapidly evolving field of artificial intelligence, large language models (LLMs) have emerged as powerful tools for various applications. One of the most exciting developments in this area is the DeepSeek-R1-Zero model, which leverages reinforcement learning (RL) to enhance reasoning capabilities without relying on supervised fine-tuning (SFT). This blog post will guide you through the intricacies of DeepSeek-R1-Zero, from a beginner's perspective to more advanced technical details.

Introduction to DeepSeek-R1-Zero

DeepSeek-R1-Zero is a groundbreaking model designed to improve the reasoning abilities of LLMs through pure reinforcement learning. Unlike traditional models that require supervised fine-tuning, DeepSeek-R1-Zero skips this step, allowing it to develop reasoning skills autonomously. This approach not only saves computational resources but also provides insights into how models can learn complex problem-solving strategies from scratch.

Understanding Reinforcement Learning in DeepSeek-R1-Zero

What is Reinforcement Learning?

Reinforcement learning is a type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize cumulative reward. In the context of DeepSeek-R1-Zero, the model learns to generate better reasoning outputs by optimizing a policy model through an RL algorithm called Group Relative Policy Optimization (GRPO).

Group Relative Policy Optimization (GRPO)

GRPO is the backbone of DeepSeek-R1-Zero's learning process. It optimizes the policy model, denoted as $\pi$ , to generate better reasoning outputs by maximizing an objective function. Here's a detailed breakdown of how GRPO works:

Objective Function

The objective function, $J_{GRPO}(\theta)$ , is defined as:

J_{GRPO}(\theta) = \mathbb{E}_{q \sim P(Q)} \left[ \sum_{i=1}^{G} \min \left( \frac{\pi_\theta(o_i|q)}{\pi_{\theta_{old}}(o_i|q)} A_i, \text{clip} \left( \frac{\pi_\theta(o_i|q)}{\pi_{\theta_{old}}(o_i|q)}, 1-\epsilon, 1+\epsilon \right) A_i \right) - \beta D_{KL}(\pi_\theta(O|q) || \pi_r(O|q)) \right]

To ensure the equation renders correctly, let's break it down:

Expectation and Sampling:

$\mathbb{E}_{q \sim P(Q)} \left[ \cdot \right]$
Summation and Minimization:

$\sum_{i=1}^{G} \min \left( \frac{\pi_\theta(o_i|q)}{\pi_{\theta_{old}}(o_i|q)} A_i, \text{clip} \left( \frac{\pi_\theta(o_i|q)}{\pi_{\theta_{old}}(o_i|q)}, 1-\epsilon, 1+\epsilon \right) A_i \right)$
KL Divergence Term:

$\beta D_{KL}(\pi_\theta(O|q) || \pi_r(O|q))$

$\mathbb{E}$ is the expectation over the distribution of questions $P(Q)$ .
$\{o_i\}_{i=1...G}$ are a group of outputs sampled from the old policy $\pi_{\theta_{old}}$ given a question $q$ .
$\pi_\theta(o|q)$ is the probability of output $o$ given question $q$ under the current policy.
$A_i$ is the advantage, representing how much better an action is compared to the average action.
$\text{clip}(x, 1-\epsilon, 1+\epsilon)$ is a clipping function that limits the policy update to a certain range.
$\beta$ is a hyper-parameter, and $D_{KL}(\pi || \pi_r)$ is the Kullback-Leibler divergence, which measures the difference between the current policy and the reference policy.

Sampling Outputs

For each question $q$ , GRPO samples a group of $G$ outputs $\{o_1, o_2, ..., o_G\}$ from the old policy $\pi_{\theta_{old}}$ . This set of outputs allows the algorithm to compare and contrast the different responses the model can generate for a single question.

Advantage Calculation

The advantage $A$ is calculated using the rewards $\{r_1, r_2, ..., r_G\}$ corresponding to the outputs within each group:

A = \frac{r - \text{mean}(\{r_1, r_2, ..., r_G\})}{\text{std}(\{r_1, r_2, ..., r_G\})}

$r$ is the reward associated with a particular output.
$\text{mean}(\{r_1, r_2, ..., r_G\})$ is the average reward of all the outputs in the group.
$\text{std}(\{r_1, r_2, ..., r_G\})$ is the standard deviation of the rewards within the group.

The advantage quantifies how much better a particular output is compared to the average output in the group, in terms of the reward. It is normalized by the standard deviation to ensure the stability of learning.

Reward System

The reward system provides feedback to the model, indicating the quality of its outputs. DeepSeek-R1-Zero uses a rule-based reward system that consists of two types of rewards:

Accuracy Rewards: These rewards are given when the model provides correct answers. For math problems, the final answer must be in a specified format to allow for automated verification. For coding problems, the correctness of the code is checked using a compiler.
Format Rewards: These rewards incentivize the model to structure its responses with the reasoning process enclosed between and tags. This promotes a consistent structure for the model's outputs, which can help with interpretability.

Key Achievements

Autonomous Learning: DeepSeek-R1-Zero demonstrates that LLMs can develop reasoning skills through pure RL without any supervised data.
Strong Reasoning Capabilities: The model achieves performance levels comparable to OpenAI-o1-0912 on certain benchmarks.
Majority Voting: Using majority voting, its performance on AIME 2024 further improves, exceeding the performance of OpenAI-o1-0912.

Conclusion

DeepSeek-R1-Zero is a revolutionary model that learns to reason through pure reinforcement learning, without any prior training on reasoning tasks. It demonstrates the power of RL to enable models to develop complex problem-solving skills and improve their performance over time. While it has some limitations, the insights gained from DeepSeek-R1-Zero pave the way for future advancements in the field of large language models.

For more, check out this page: Rejection Sampling in DeepSeek-R1

DeepSeek-R1-Zero: Enhanced LLM Reasoning for Natural Language Understanding