huggingface · hesamsheikh · Mar 3, 2025
diff --git a/chapters/en/chapter12/3.mdx b/chapters/en/chapter12/3.mdx
@@ -11,7 +11,7 @@ In the next chapter, we will build on this knowledge and implement GRPO in pract
 The initial goal of the paper was to explore whether pure reinforcement learning could develop reasoning capabilities without supervised fine-tuning. 
 
 <Tip>
-Up until that point, all the popular LLMs required some supervised fine-tuning, which we explored in [chapter 11](/chapters/en/chapter11/1).
+Up until that point, all the popular LLMs required some supervised fine-tuning, which we explored in Chapter 11.
 </Tip>
 
 ## The Breakthrough 'Aha' Moment
@@ -171,27 +171,27 @@ Now that we understand the key components of GRPO, let's look at the algorithm i
 
 ```
 Input: 
-- initial_policy: Starting model to be trained
+- current_policy: The model to be trained
 - reward_function: Function that evaluates outputs
 - training_prompts: Set of training examples
 - group_size: Number of outputs per prompt (typically 4-16)
 
 Algorithm GRPO:
 1. For each training iteration:
-   a. Set reference_policy = initial_policy (snapshot current policy)
+   a. Set reference_policy = current_policy (snapshot BEFORE updates)
    b. For each prompt in batch:
-      i. Generate group_size different outputs using initial_policy
+      i. Generate group_size different outputs using reference_policy
       ii. Compute rewards for each output using reward_function
       iii. Normalize rewards within group:
            normalized_advantage = (reward - mean(rewards)) / std(rewards)
-      iv. Update policy by maximizing the clipped ratio:
+      iv. Update current_policy by maximizing:
           min(prob_ratio * normalized_advantage, 
-              clip(prob_ratio, 1-epsilon, 1+epsilon) * normalized_advantage)
-          - kl_weight * KL(initial_policy || reference_policy)
+              clip(prob_ratio, 1-ε, 1+ε) * normalized_advantage)
+          - β * KL(current_policy || reference_policy)
 
-          where prob_ratio is current_prob / reference_prob
+          where prob_ratio is current_policy_prob / reference_policy_prob, and β is the KL weight
 
-Output: Optimized policy model
+Output: Optimized current_policy model
 ```
 
 This algorithm shows how GRPO combines group-based advantage estimation with policy optimization while maintaining stability through clipping and KL divergence constraints.
@@ -235,15 +235,15 @@ In the next section, we'll explore practical implementations of these concepts,
 
 <Question
     choices={[
+        {
+            text: "Using more GPUs for training than any previous model",
+            explain: "The paper's innovation is in its algorithmic approach (GRPO) rather than computational resources used."
+        },
         {
             text: "The GRPO algorithm that enables learning from preferences with and without a reward model",
             explain: "Correct! GRPO's key innovation is its ability to directly optimize for preference rectification, making it more efficient than traditional RL methods.",
             correct: true
         },
-        {
-            text: "Using more GPUs for training than any previous model",
-            explain: "The paper's innovation is in its algorithmic approach (GRPO) rather than computational resources used."
-        },
         {
             text: "Creating a larger language model than existing ones",
             explain: "The innovation lies in the training methodology and GRPO algorithm, not in model size."
@@ -295,15 +295,15 @@ In the next section, we'll explore practical implementations of these concepts,
 
 <Question
     choices={[
+        {
+            text: "It combines multiple models into one ensemble",
+            explain: "GRPO uses a single model to generate multiple solution attempts, not an ensemble of different models."
+        },
         {
             text: "It generates multiple solutions (4-16) for the same problem and evaluates them together",
             explain: "Correct! GRPO generates multiple attempts at solving the same problem, typically 4, 8, or 16 different attempts, which are then evaluated as a group.",
             correct: true
         },
-        {
-            text: "It combines multiple models into one ensemble",
-            explain: "GRPO uses a single model to generate multiple solution attempts, not an ensemble of different models."
-        },
         {
             text: "It splits the training data into different groups",
             explain: "GRPO's group formation involves generating multiple solutions for the same problem, not splitting training data."
@@ -315,18 +315,18 @@ In the next section, we'll explore practical implementations of these concepts,
 
 <Question
     choices={[
-        {
-            text: "R1-Zero uses pure RL while R1 combines RL with supervised fine-tuning",
-            explain: "Correct! As shown in the comparison table, R1-Zero uses pure RL training while R1 uses a multi-phase approach combining supervised fine-tuning with RL, resulting in better language consistency.",
-            correct: true
-        },
         {
             text: "R1-Zero is smaller than R1",
             explain: "The difference is in their training approaches (pure RL vs. multi-phase), not their model sizes."
         },
         {
             text: "R1-Zero was trained on less data",
             explain: "The key distinction is their training methodology: pure RL for R1-Zero versus a combined SFT and RL approach for R1."
+        },
+        {
+            text: "R1-Zero uses pure RL while R1 combines RL with supervised fine-tuning",
+            explain: "Correct! As shown in the comparison table, R1-Zero uses pure RL training while R1 uses a multi-phase approach combining supervised fine-tuning with RL, resulting in better language consistency.",
+            correct: true
         }
     ]}
 />