Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⚙️ Fixed formatting and psuedocode mistake in Chapter12/3.mdx #818

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 22 additions & 22 deletions chapters/en/chapter12/3.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ In the next chapter, we will build on this knowledge and implement GRPO in pract
The initial goal of the paper was to explore whether pure reinforcement learning could develop reasoning capabilities without supervised fine-tuning.

<Tip>
Up until that point, all the popular LLMs required some supervised fine-tuning, which we explored in [chapter 11](/chapters/en/chapter11/1).
Up until that point, all the popular LLMs required some supervised fine-tuning, which we explored in Chapter 11.
</Tip>

## The Breakthrough 'Aha' Moment
Expand Down Expand Up @@ -171,27 +171,27 @@ Now that we understand the key components of GRPO, let's look at the algorithm i

```
Input:
- initial_policy: Starting model to be trained
- current_policy: The model to be trained
- reward_function: Function that evaluates outputs
- training_prompts: Set of training examples
- group_size: Number of outputs per prompt (typically 4-16)

Algorithm GRPO:
1. For each training iteration:
a. Set reference_policy = initial_policy (snapshot current policy)
a. Set reference_policy = current_policy (snapshot BEFORE updates)
b. For each prompt in batch:
i. Generate group_size different outputs using initial_policy
i. Generate group_size different outputs using reference_policy
ii. Compute rewards for each output using reward_function
iii. Normalize rewards within group:
normalized_advantage = (reward - mean(rewards)) / std(rewards)
iv. Update policy by maximizing the clipped ratio:
iv. Update current_policy by maximizing:
min(prob_ratio * normalized_advantage,
clip(prob_ratio, 1-epsilon, 1+epsilon) * normalized_advantage)
- kl_weight * KL(initial_policy || reference_policy)
clip(prob_ratio, 1-ε, 1+ε) * normalized_advantage)
- β * KL(current_policy || reference_policy)

where prob_ratio is current_prob / reference_prob
where prob_ratio is current_policy_prob / reference_policy_prob, and β is the KL weight

Output: Optimized policy model
Output: Optimized current_policy model
```

This algorithm shows how GRPO combines group-based advantage estimation with policy optimization while maintaining stability through clipping and KL divergence constraints.
Expand Down Expand Up @@ -235,15 +235,15 @@ In the next section, we'll explore practical implementations of these concepts,

<Question
choices={[
{
text: "Using more GPUs for training than any previous model",
explain: "The paper's innovation is in its algorithmic approach (GRPO) rather than computational resources used."
},
{
text: "The GRPO algorithm that enables learning from preferences with and without a reward model",
explain: "Correct! GRPO's key innovation is its ability to directly optimize for preference rectification, making it more efficient than traditional RL methods.",
correct: true
},
{
text: "Using more GPUs for training than any previous model",
explain: "The paper's innovation is in its algorithmic approach (GRPO) rather than computational resources used."
},
{
text: "Creating a larger language model than existing ones",
explain: "The innovation lies in the training methodology and GRPO algorithm, not in model size."
Expand Down Expand Up @@ -295,15 +295,15 @@ In the next section, we'll explore practical implementations of these concepts,

<Question
choices={[
{
text: "It combines multiple models into one ensemble",
explain: "GRPO uses a single model to generate multiple solution attempts, not an ensemble of different models."
},
{
text: "It generates multiple solutions (4-16) for the same problem and evaluates them together",
explain: "Correct! GRPO generates multiple attempts at solving the same problem, typically 4, 8, or 16 different attempts, which are then evaluated as a group.",
correct: true
},
{
text: "It combines multiple models into one ensemble",
explain: "GRPO uses a single model to generate multiple solution attempts, not an ensemble of different models."
},
{
text: "It splits the training data into different groups",
explain: "GRPO's group formation involves generating multiple solutions for the same problem, not splitting training data."
Expand All @@ -315,18 +315,18 @@ In the next section, we'll explore practical implementations of these concepts,

<Question
choices={[
{
text: "R1-Zero uses pure RL while R1 combines RL with supervised fine-tuning",
explain: "Correct! As shown in the comparison table, R1-Zero uses pure RL training while R1 uses a multi-phase approach combining supervised fine-tuning with RL, resulting in better language consistency.",
correct: true
},
{
text: "R1-Zero is smaller than R1",
explain: "The difference is in their training approaches (pure RL vs. multi-phase), not their model sizes."
},
{
text: "R1-Zero was trained on less data",
explain: "The key distinction is their training methodology: pure RL for R1-Zero versus a combined SFT and RL approach for R1."
},
{
text: "R1-Zero uses pure RL while R1 combines RL with supervised fine-tuning",
explain: "Correct! As shown in the comparison table, R1-Zero uses pure RL training while R1 uses a multi-phase approach combining supervised fine-tuning with RL, resulting in better language consistency.",
correct: true
}
]}
/>
Expand Down