Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⚙️ Fixed formatting and psuedocode mistake in Chapter12/3.mdx #818

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

hesamsheikh
Copy link

Includes fixes to:

  1. Hyperlink in doesn't work, so the URL is removed.
  2. The pseudocode had a critical issue:
    in step b.i. "Generate group_size different outputs using initial_policy" has a mistake. According to 2.2.1 of the R1 paper, "GRPO samples a group of outputs {𝑜1, 𝑜2, · · · , 𝑜𝐺} from the old policy 𝜋𝜃𝑜𝑙𝑑"
    Using initial_policy here would leak policy updates into the same iteration, and must be replaced by the initial_policy
    The variable namings are also somewhat difficult (intial_policy gets updated which is confusing), so initial_policy is changed into current_policy to avoid confusion.

Note: The pseudocode only includes the 𝜋𝜃𝑜𝑙𝑑 and 𝜋𝜃 from the paper, but the KL divergance uses a 𝜋reference as well which despite the naming in pseudocode, is not included. My assumption is that 𝜋𝜃𝑜𝑙𝑑 (reference_policy) is used in its place for simplicity, but it's better to acknowledge this in the course.

  1. The correct answers from the quiz were all the 1st choice, which is now jumbled.

@hesamsheikh hesamsheikh changed the title fixed formatting and psuedocode ⚙️ Fixed formatting and psuedocode mistake in Chapter12/3.mdx Mar 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant