Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

📝 Minor edits to Chapter12/5.mdx #819

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions chapters/en/chapter12/5.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ import wandb
wandb.login()
```

You can do this exercise without logging in to Weights & Biases, but it's recommended to do so to track your experiments and interpret the results.
You can do this exercise without logging in to Weights & Biases, but it's recommended so you can track your experiments and interpret the results.

## Load the dataset

Expand Down Expand Up @@ -164,7 +164,7 @@ As you can see, the reward from the reward function moves closer to 0 as the mod
![Reward from reward function](https://huggingface.co/reasoning-course/images/resolve/main/grpo/13.png)

<!-- @qgallouedec @mlabonne could you review this section please!? -->
You might notice that the loss starts at zero and then increases during training, which may seem counterintuitive. This behavior is expected in GRPO and is directly related to the mathematical formulation of the algorithm. The loss in GRPO is proportional to the KL divergence (the cap relative to original policy) . As training progresses, the model learns to generate text that better matches the reward function, causing it to diverge more from its initial policy. This increasing divergence is reflected in the rising loss value, which actually indicates that the model is successfully adapting to optimize for the reward function.
You might notice that the loss starts at zero and then increases during training, which may seem counterintuitive. This behavior is expected in GRPO and is directly related to the mathematical formulation of the algorithm. The loss in GRPO is proportional to the KL divergence (the cap relative to original policy). As training progresses, the model learns to generate text that better matches the reward function, causing it to diverge more from its initial policy. This increasing divergence is reflected in the rising loss value, which actually indicates that the model is successfully adapting to optimize for the reward function.

![Loss](https://huggingface.co/reasoning-course/images/resolve/main/grpo/14.png)

Expand Down