Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CHAPTER] Add 'Open R1 for Students' chapter #799

Open
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

burtenshaw
Copy link
Collaborator

@burtenshaw burtenshaw commented Feb 24, 2025

This chapter adds a chapter on how to build r1 for students.

Remaining work is:

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@qgallouedec qgallouedec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review, I'll take a Closer look later

@burtenshaw
Copy link
Collaborator Author

@qgallouedec Thanks so much for your review. I responded to your comments.

Just so you know, there's also a notebook exercises coming that will use a lot of TRL. In case you want to wait and review that together.

@burtenshaw
Copy link
Collaborator Author

@edbeeching Thanks for the review!

I rewrote the section on GRPO in the RL page so that it's a more simple comparison between of GRPO to DPO and PPO.

@burtenshaw burtenshaw self-assigned this Feb 25, 2025
@qgallouedec
Copy link
Member

qgallouedec commented Feb 25, 2025

When you learn GRPO, there are questions that come naturally, and there are tons of issues on open-r1 and TRL that are always the same. I think the course is a really good place to address them. And it'll make our lives easier later on, we'll just have to redirect the people who ask the questions:

  • why the loss is zero initially?
  • why does the loss increase? is it normal?
  • rewards are normalized (gives the advantages); and in the loss formula, we take the average of the advantages, ie 0.0. How come it's learning?


This chapter is a crash course paper reading. We will walk through the paper in simple terms, and then we will break down the key concepts and takeaways.

DeepSeek R1 represents a significant advancement in language model training, particularly in developing reasoning capabilities through reinforcement learning. The paper introduces a new reinforcement learning algorithm called Generalized Policy Rectification Optimization (GRPO).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically, it's not DeepSeek R1 that introduced GRPO. It's DeepSeek Math https://arxiv.org/abs/2402.03300

@burtenshaw
Copy link
Collaborator Author

Thanks for the review @qgallouedec, and sorry about the dodgy acronyms I was moving fast.

  • why the loss is zero initially?
  • why does the loss increase? is it normal?
  • rewards are normalized (gives the advantages); and in the loss formula, we take the average of the advantages, ie 0.0. How come it's learning?

Do you think we should address these questions more directly somewhere in the material?

@qgallouedec
Copy link
Member

@qgallouedec Thanks so much for your review. I responded to your comments.

Just so you know, there's also a notebook exercises coming that will use a lot of TRL. In case you want to wait and review that together.

Nice! Where's this notebook?

@qgallouedec
Copy link
Member

Do you think we should address these questions more directly somewhere in the material?

I didn't yet get to the bottom of the course, so maybe you know better.
The two first questions will come naturally when readers see the first learning curves.

@burtenshaw
Copy link
Collaborator Author

@qgallouedec Thanks so much for your review. I responded to your comments.
Just so you know, there's also a notebook exercises coming that will use a lot of TRL. In case you want to wait and review that together.

Nice! Where's this notebook?

@mlabonne is working on it right now. I'll ping you when it needs a review. If that's ok?

burtenshaw and others added 7 commits February 26, 2025 09:32
Co-authored-by: Quentin Gallouédec <[email protected]>
* add marimo example of length based reward function with a slider

* move demo into TRL page

* experiment with marimo outside of prose

* update TOC with marimo example

* use marimo table for representation

* remove more snippet returns

* try with simple strings

* drop styling from marimo box

* try pure iframe component

* try without return values

* fall back to hello world marimo example

* try snippet after marimo

* define marimo in python script

* add real marimo example

* add real marimo example with length reward

* hide code and headers for tidyness

* add markdown for explanaition

* add markdown for explanaition

* move markdown up

* fix missing slider

* add notebooks to real locations

* remove experimentation page

* use correct urls and add todos

* update all image links due to hub org rename

* fix query params in notebook urls

* reorder images to match prose
Co-authored-by: Maxime Labonne <[email protected]>
Copy link
Member

@Vaibhavs10 Vaibhavs10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very dope! just left some nits/ suggestions!


**Enter Reinforcement Learning\!** RL gives us a way to fine-tune these pre-trained LLMs to better achieve these desired qualities. It's like giving our LLM dog extra training to become a well-behaved and helpful companion, not just a dog that knows how to bark fluently\!

## Reinforcement Learning from Human Feedback (RLHF)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think we can make a diagram here to showcase these too - RLHF, DPO, PPO?

@qgallouedec
Copy link
Member

Do you think we should address these questions more directly somewhere in the material?

I didn't yet get to the bottom of the course, so maybe you know better. The two first questions will come naturally when readers see the first learning curves.

This is the comment I refer to every time I have a question. Check the number of reactions.
huggingface/open-r1#239 (comment)

![Reward from reward function](https://huggingface.co/reasoning-course/images/resolve/main/grpo/13.png)

<!-- @qgallouedec @mlabonne could you review this section please!? -->
You might notice that the loss starts at zero and then increases during training, which may seem counterintuitive. This behavior is expected in GRPO and is directly related to the mathematical formulation of the algorithm. The loss in GRPO is proportional to the KL divergence (the cap relative to original policy) . As training progresses, the model learns to generate text that better matches the reward function, causing it to diverge more from its initial policy. This increasing divergence is reflected in the rising loss value, which actually indicates that the model is successfully adapting to optimize for the reward function.
Copy link
Collaborator Author

@burtenshaw burtenshaw Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should address these questions more directly somewhere in the material?

I didn't yet get to the bottom of the course, so maybe you know better. The two first questions will come naturally when readers see the first learning curves.

This is the comment I refer to every time I have a question. Check the number of reactions. huggingface/open-r1#239 (comment)

@qgallouedec I added this section here to interpret the loss in a very simplified way.

This comment of yours is amazing! 🤯 But I think it may be too much information for this course.

What do you think of this approach? I improve this paragraph here to get a minimal explanation. Then, I added another page which expands on your comment in detail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's nicely explained this way!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this other page?
You can add something like "for more details, including the underlying math, see ..."

Copy link
Collaborator Author

@burtenshaw burtenshaw Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not written yet. I'm going to release new pages based on engagement. So if this takes off, I'll add the deeper dive page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants