⚙️ Fixed formatting and psuedocode mistake in Chapter12/3.mdx #818

hesamsheikh · 2025-03-04T00:05:38Z

Includes fixes to:

Hyperlink in doesn't work, so the URL is removed.
The pseudocode had a critical issue:
in step b.i. "Generate group_size different outputs using initial_policy" has a mistake. According to 2.2.1 of the R1 paper, "GRPO samples a group of outputs {𝑜1, 𝑜2, · · · , 𝑜𝐺} from the old policy 𝜋𝜃𝑜𝑙𝑑"
Using initial_policy here would leak policy updates into the same iteration, and must be replaced by the initial_policy
The variable namings are also somewhat difficult (intial_policy gets updated which is confusing), so initial_policy is changed into current_policy to avoid confusion.

Note: The pseudocode only includes the 𝜋𝜃𝑜𝑙𝑑 and 𝜋𝜃 from the paper, but the KL divergance uses a 𝜋reference as well which despite the naming in pseudocode, is not included. My assumption is that 𝜋𝜃𝑜𝑙𝑑 (reference_policy) is used in its place for simplicity, but it's better to acknowledge this in the course.

The correct answers from the quiz were all the 1st choice, which is now jumbled.

fixed formatting and psuedocode

1554f25

hesamsheikh changed the title ~~fixed formatting and psuedocode~~ ⚙️ Fixed formatting and psuedocode mistake in Chapter12/3.mdx Mar 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚙️ Fixed formatting and psuedocode mistake in Chapter12/3.mdx #818

⚙️ Fixed formatting and psuedocode mistake in Chapter12/3.mdx #818

hesamsheikh commented Mar 4, 2025

⚙️ Fixed formatting and psuedocode mistake in Chapter12/3.mdx #818

Are you sure you want to change the base?

⚙️ Fixed formatting and psuedocode mistake in Chapter12/3.mdx #818

Conversation

hesamsheikh commented Mar 4, 2025