-
Notifications
You must be signed in to change notification settings - Fork 815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CHAPTER] Add 'Open R1 for Students' chapter #799
base: main
Are you sure you want to change the base?
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review, I'll take a Closer look later
@qgallouedec Thanks so much for your review. I responded to your comments. Just so you know, there's also a notebook exercises coming that will use a lot of TRL. In case you want to wait and review that together. |
@edbeeching Thanks for the review! I rewrote the section on GRPO in the RL page so that it's a more simple comparison between of GRPO to DPO and PPO. |
When you learn GRPO, there are questions that come naturally, and there are tons of issues on open-r1 and TRL that are always the same. I think the course is a really good place to address them. And it'll make our lives easier later on, we'll just have to redirect the people who ask the questions:
|
chapters/en/chapter13/3.mdx
Outdated
|
||
This chapter is a crash course paper reading. We will walk through the paper in simple terms, and then we will break down the key concepts and takeaways. | ||
|
||
DeepSeek R1 represents a significant advancement in language model training, particularly in developing reasoning capabilities through reinforcement learning. The paper introduces a new reinforcement learning algorithm called Generalized Policy Rectification Optimization (GRPO). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically, it's not DeepSeek R1 that introduced GRPO. It's DeepSeek Math https://arxiv.org/abs/2402.03300
Thanks for the review @qgallouedec, and sorry about the dodgy acronyms I was moving fast.
Do you think we should address these questions more directly somewhere in the material? |
Nice! Where's this notebook? |
I didn't yet get to the bottom of the course, so maybe you know better. |
@mlabonne is working on it right now. I'll ping you when it needs a review. If that's ok? |
Co-authored-by: Quentin Gallouédec <[email protected]>
* add marimo example of length based reward function with a slider * move demo into TRL page * experiment with marimo outside of prose * update TOC with marimo example * use marimo table for representation * remove more snippet returns * try with simple strings * drop styling from marimo box * try pure iframe component * try without return values * fall back to hello world marimo example * try snippet after marimo * define marimo in python script * add real marimo example * add real marimo example with length reward * hide code and headers for tidyness * add markdown for explanaition * add markdown for explanaition * move markdown up * fix missing slider * add notebooks to real locations * remove experimentation page * use correct urls and add todos * update all image links due to hub org rename * fix query params in notebook urls * reorder images to match prose
Co-authored-by: Maxime Labonne <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very dope! just left some nits/ suggestions!
|
||
**Enter Reinforcement Learning\!** RL gives us a way to fine-tune these pre-trained LLMs to better achieve these desired qualities. It's like giving our LLM dog extra training to become a well-behaved and helpful companion, not just a dog that knows how to bark fluently\! | ||
|
||
## Reinforcement Learning from Human Feedback (RLHF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think we can make a diagram here to showcase these too - RLHF, DPO, PPO?
This is the comment I refer to every time I have a question. Check the number of reactions. |
 | ||
|
||
<!-- @qgallouedec @mlabonne could you review this section please!? --> | ||
You might notice that the loss starts at zero and then increases during training, which may seem counterintuitive. This behavior is expected in GRPO and is directly related to the mathematical formulation of the algorithm. The loss in GRPO is proportional to the KL divergence (the cap relative to original policy) . As training progresses, the model learns to generate text that better matches the reward function, causing it to diverge more from its initial policy. This increasing divergence is reflected in the rising loss value, which actually indicates that the model is successfully adapting to optimize for the reward function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should address these questions more directly somewhere in the material?
I didn't yet get to the bottom of the course, so maybe you know better. The two first questions will come naturally when readers see the first learning curves.
This is the comment I refer to every time I have a question. Check the number of reactions. huggingface/open-r1#239 (comment)
@qgallouedec I added this section here to interpret the loss in a very simplified way.
This comment of yours is amazing! 🤯 But I think it may be too much information for this course.
What do you think of this approach? I improve this paragraph here to get a minimal explanation. Then, I added another page which expands on your comment in detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's nicely explained this way!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this other page?
You can add something like "for more details, including the underlying math, see ..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not written yet. I'm going to release new pages based on engagement. So if this takes off, I'll add the deeper dive page.
This chapter adds a chapter on how to build r1 for students.
Remaining work is:
GRPOTrainer