[CHAPTER] Add 'Open R1 for Students' chapter #799

burtenshaw · 2025-02-24T20:19:25Z

This chapter adds a chapter on how to build r1 for students.

Remaining work is:

demo notebooks illustrating reward functions [NLP COURSE] marimo notebooks for examples of GRPO reward functions notebooks#557
exercise notebook on how to train a smol model on tldr with GRPOTrainer
graded quiz and certificate

HuggingFaceDocBuilderDev · 2025-02-24T20:38:59Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

chapters/en/chapter13/2.mdx

qgallouedec

Partial review, I'll take a Closer look later

chapters/en/chapter13/4.mdx

chapters/en/chapter13/3.mdx

burtenshaw · 2025-02-25T09:08:03Z

@qgallouedec Thanks so much for your review. I responded to your comments.

Just so you know, there's also a notebook exercises coming that will use a lot of TRL. In case you want to wait and review that together.

burtenshaw · 2025-02-25T09:10:09Z

@edbeeching Thanks for the review!

I rewrote the section on GRPO in the RL page so that it's a more simple comparison between of GRPO to DPO and PPO.

qgallouedec · 2025-02-25T13:59:00Z

When you learn GRPO, there are questions that come naturally, and there are tons of issues on open-r1 and TRL that are always the same. I think the course is a really good place to address them. And it'll make our lives easier later on, we'll just have to redirect the people who ask the questions:

why the loss is zero initially?
why does the loss increase? is it normal?
rewards are normalized (gives the advantages); and in the loss formula, we take the average of the advantages, ie 0.0. How come it's learning?

chapters/en/chapter13/1.mdx

chapters/en/chapter13/2.mdx

chapters/en/chapter13/3.mdx

qgallouedec · 2025-02-25T14:35:27Z

chapters/en/chapter13/3.mdx

+
+This chapter is a crash course paper reading. We will walk through the paper in simple terms, and then we will break down the key concepts and takeaways.
+
+DeepSeek R1 represents a significant advancement in language model training, particularly in developing reasoning capabilities through reinforcement learning. The paper introduces a new reinforcement learning algorithm called Generalized Policy Rectification Optimization (GRPO).


technically, it's not DeepSeek R1 that introduced GRPO. It's DeepSeek Math https://arxiv.org/abs/2402.03300

chapters/en/chapter13/3.mdx

burtenshaw · 2025-02-25T14:40:39Z

Thanks for the review @qgallouedec, and sorry about the dodgy acronyms I was moving fast.

why the loss is zero initially?

why does the loss increase? is it normal?

rewards are normalized (gives the advantages); and in the loss formula, we take the average of the advantages, ie 0.0. How come it's learning?

Do you think we should address these questions more directly somewhere in the material?

qgallouedec · 2025-02-25T14:41:40Z

@qgallouedec Thanks so much for your review. I responded to your comments.

Just so you know, there's also a notebook exercises coming that will use a lot of TRL. In case you want to wait and review that together.

Nice! Where's this notebook?

qgallouedec · 2025-02-25T14:44:23Z

Do you think we should address these questions more directly somewhere in the material?

I didn't yet get to the bottom of the course, so maybe you know better.
The two first questions will come naturally when readers see the first learning curves.

burtenshaw · 2025-02-25T14:44:23Z

@qgallouedec Thanks so much for your review. I responded to your comments.
Just so you know, there's also a notebook exercises coming that will use a lot of TRL. In case you want to wait and review that together.

Nice! Where's this notebook?

@mlabonne is working on it right now. I'll ping you when it needs a review. If that's ok?

Co-authored-by: Quentin Gallouédec <[email protected]>

* add marimo example of length based reward function with a slider * move demo into TRL page * experiment with marimo outside of prose * update TOC with marimo example * use marimo table for representation * remove more snippet returns * try with simple strings * drop styling from marimo box * try pure iframe component * try without return values * fall back to hello world marimo example * try snippet after marimo * define marimo in python script * add real marimo example * add real marimo example with length reward * hide code and headers for tidyness * add markdown for explanaition * add markdown for explanaition * move markdown up * fix missing slider * add notebooks to real locations * remove experimentation page * use correct urls and add todos * update all image links due to hub org rename * fix query params in notebook urls * reorder images to match prose

Co-authored-by: Maxime Labonne <[email protected]>

Vaibhavs10

Looks very dope! just left some nits/ suggestions!

chapters/en/chapter13/1.mdx

chapters/en/chapter13/2.mdx

Vaibhavs10 · 2025-02-27T09:28:56Z

chapters/en/chapter13/2.mdx

+
+**Enter Reinforcement Learning\!** RL gives us a way to fine-tune these pre-trained LLMs to better achieve these desired qualities. It's like giving our LLM dog extra training to become a well-behaved and helpful companion, not just a dog that knows how to bark fluently\!
+
+## Reinforcement Learning from Human Feedback (RLHF)


do you think we can make a diagram here to showcase these too - RLHF, DPO, PPO?

chapters/en/chapter13/3.mdx

chapters/en/chapter13/4.mdx

chapters/en/chapter13/5.mdx

qgallouedec · 2025-02-27T14:18:09Z

Do you think we should address these questions more directly somewhere in the material?

I didn't yet get to the bottom of the course, so maybe you know better. The two first questions will come naturally when readers see the first learning curves.

This is the comment I refer to every time I have a question. Check the number of reactions.
huggingface/open-r1#239 (comment)

burtenshaw · 2025-02-27T19:13:43Z

chapters/en/chapter13/5.mdx

+![Reward from reward function](https://huggingface.co/reasoning-course/images/resolve/main/grpo/13.png)
+
+<!-- @qgallouedec @mlabonne could you review this section please!? -->
+You might notice that the loss starts at zero and then increases during training, which may seem counterintuitive. This behavior is expected in GRPO and is directly related to the mathematical formulation of the algorithm. The loss in GRPO is proportional to the KL divergence (the cap relative to original policy) . As training progresses, the model learns to generate text that better matches the reward function, causing it to diverge more from its initial policy. This increasing divergence is reflected in the rising loss value, which actually indicates that the model is successfully adapting to optimize for the reward function.


Do you think we should address these questions more directly somewhere in the material?

I didn't yet get to the bottom of the course, so maybe you know better. The two first questions will come naturally when readers see the first learning curves.

This is the comment I refer to every time I have a question. Check the number of reactions. huggingface/open-r1#239 (comment)

@qgallouedec I added this section here to interpret the loss in a very simplified way.

This comment of yours is amazing! 🤯 But I think it may be too much information for this course.

What do you think of this approach? I improve this paragraph here to get a minimal explanation. Then, I added another page which expands on your comment in detail.

I think it's nicely explained this way!

Where is this other page?
You can add something like "for more details, including the underlying math, see ..."

It's not written yet. I'm going to release new pages based on engagement. So if this takes off, I'll add the deeper dive page.

burtenshaw added 8 commits February 24, 2025 20:08

add basic introduction for students

184c2ed

page on the basics of RL

2947e18

add page on grpo and the deep seek paper

c9919bf

add grpo in trl page

5024d73

add coming soon page

43f8149

add ungraded quizzes on rl and r1

9a22fd6

update toc

49819e8

format code snippets

0417636

fix images in rl page

22b465c

edbeeching reviewed Feb 25, 2025

View reviewed changes

chapters/en/chapter13/2.mdx Outdated Show resolved Hide resolved

edbeeching reviewed Feb 25, 2025

View reviewed changes

chapters/en/chapter13/2.mdx Outdated Show resolved Hide resolved

edbeeching reviewed Feb 25, 2025

View reviewed changes

chapters/en/chapter13/2.mdx Outdated Show resolved Hide resolved

edbeeching reviewed Feb 25, 2025

View reviewed changes

chapters/en/chapter13/2.mdx Outdated Show resolved Hide resolved

qgallouedec reviewed Feb 25, 2025

View reviewed changes

burtenshaw added 5 commits February 25, 2025 09:40

fix and make correct grpo section in rl page (2.mx)

51bd8ea

fix preference data mistake

e05b21b

fix use of 'preference data' in grpo paper walkthrough

8ade04b

improve dpo and ppo comparison in RL page (2.mdx)

8626085

respond to feedback on TRL page

62e7456

burtenshaw added 5 commits February 25, 2025 10:34

add pseudo code and limitations to the GRPO paper page

6487cb5

expand grpo comparison in RL page

c031ae5

add examples of dummy reward functions to TRL page

62bd308

add length function examples to TRL page

d87d797

format code snippets

5ca854d

burtenshaw self-assigned this Feb 25, 2025

qgallouedec reviewed Feb 25, 2025

View reviewed changes

chapters/en/chapter13/1.mdx Outdated Show resolved Hide resolved

qgallouedec reviewed Feb 25, 2025

View reviewed changes

chapters/en/chapter13/2.mdx Show resolved Hide resolved

qgallouedec reviewed Feb 25, 2025

View reviewed changes

chapters/en/chapter13/2.mdx Show resolved Hide resolved

qgallouedec reviewed Feb 25, 2025

View reviewed changes

chapters/en/chapter13/2.mdx Outdated Show resolved Hide resolved

qgallouedec reviewed Feb 25, 2025

View reviewed changes

chapters/en/chapter13/2.mdx Outdated Show resolved Hide resolved

qgallouedec reviewed Feb 25, 2025

View reviewed changes

chapters/en/chapter13/3.mdx Outdated Show resolved Hide resolved

qgallouedec reviewed Feb 25, 2025

View reviewed changes

Fix all GRPO acronyms

8e2598d

qgallouedec reviewed Feb 25, 2025

View reviewed changes

chapters/en/chapter13/3.mdx Outdated Show resolved Hide resolved

remove mention of preference datasets

5216d1c

qgallouedec reviewed Feb 25, 2025

View reviewed changes

chapters/en/chapter13/3.mdx Outdated Show resolved Hide resolved

burtenshaw and others added 7 commits February 26, 2025 09:32

Apply suggestions from code review

2c8019a

Co-authored-by: Quentin Gallouédec <[email protected]>

respond to reviews in RL page

026b828

remove unclear paragraph in paper page

15cd3b0

add notebook exercise

bb769be

Co-authored-by: Maxime Labonne <[email protected]>

update the toc

917e48b

chang images in exercise

9c62ffc

Vaibhavs10 reviewed Feb 27, 2025

View reviewed changes

burtenshaw added 4 commits February 27, 2025 13:11

give clearer definition of what students will learn

30a9670

add inference providers example

ba5d239

add reference to open r1 implementation of GRPO

2708721

add section on pushing model to the hub during training.

8407524

update marimo examples

3f9e60d

burtenshaw commented Feb 27, 2025

View reviewed changes

improve inference section in exercise

66d7823

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CHAPTER] Add 'Open R1 for Students' chapter #799

[CHAPTER] Add 'Open R1 for Students' chapter #799

burtenshaw commented Feb 24, 2025 •

edited

Loading

HuggingFaceDocBuilderDev commented Feb 24, 2025

qgallouedec left a comment

burtenshaw commented Feb 25, 2025

burtenshaw commented Feb 25, 2025

qgallouedec commented Feb 25, 2025 •

edited

Loading

qgallouedec Feb 25, 2025

burtenshaw commented Feb 25, 2025

qgallouedec commented Feb 25, 2025

qgallouedec commented Feb 25, 2025

burtenshaw commented Feb 25, 2025

Vaibhavs10 left a comment

Vaibhavs10 Feb 27, 2025

qgallouedec commented Feb 27, 2025

burtenshaw Feb 27, 2025 •

edited

Loading

qgallouedec Feb 27, 2025

qgallouedec Feb 27, 2025

burtenshaw Feb 28, 2025 •

edited

Loading


		This chapter is a crash course paper reading. We will walk through the paper in simple terms, and then we will break down the key concepts and takeaways.

		DeepSeek R1 represents a significant advancement in language model training, particularly in developing reasoning capabilities through reinforcement learning. The paper introduces a new reinforcement learning algorithm called Generalized Policy Rectification Optimization (GRPO).


		Enter Reinforcement Learning\! RL gives us a way to fine-tune these pre-trained LLMs to better achieve these desired qualities. It's like giving our LLM dog extra training to become a well-behaved and helpful companion, not just a dog that knows how to bark fluently\!

		## Reinforcement Learning from Human Feedback (RLHF)

[CHAPTER] Add 'Open R1 for Students' chapter #799

Are you sure you want to change the base?

[CHAPTER] Add 'Open R1 for Students' chapter #799

Conversation

burtenshaw commented Feb 24, 2025 • edited Loading

HuggingFaceDocBuilderDev commented Feb 24, 2025

qgallouedec left a comment

Choose a reason for hiding this comment

burtenshaw commented Feb 25, 2025

burtenshaw commented Feb 25, 2025

qgallouedec commented Feb 25, 2025 • edited Loading

qgallouedec Feb 25, 2025

Choose a reason for hiding this comment

burtenshaw commented Feb 25, 2025

qgallouedec commented Feb 25, 2025

qgallouedec commented Feb 25, 2025

burtenshaw commented Feb 25, 2025

Vaibhavs10 left a comment

Choose a reason for hiding this comment

Vaibhavs10 Feb 27, 2025

Choose a reason for hiding this comment

qgallouedec commented Feb 27, 2025

burtenshaw Feb 27, 2025 • edited Loading

Choose a reason for hiding this comment

qgallouedec Feb 27, 2025

Choose a reason for hiding this comment

qgallouedec Feb 27, 2025

Choose a reason for hiding this comment

burtenshaw Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

burtenshaw commented Feb 24, 2025 •

edited

Loading

qgallouedec commented Feb 25, 2025 •

edited

Loading

burtenshaw Feb 27, 2025 •

edited

Loading

burtenshaw Feb 28, 2025 •

edited

Loading