Option to clip logprobs `rlhf.get_batch_log_probs` #2470

krammnic · 2025-03-09T19:01:40Z

RLHF procedures with modern DPO functionals may lead to the degenerate solution, e.g., the EOS token dropping during the generation. For instance, let's consider the output of the Qwen2.5 model after the SimPO procedure with torchtune.


Einstein's theory of relativity describes how gravity arises from the curvature of spacetime caused by mass and energy. Energy energy energy energy energy energy energy energy energy energy energy... (repeated 50 times)

This is a pretty common problem, and the root cause is logarithm behavior near 0 when calculating log-probs, which is used to calculate rewards (and the difference between them is optimized in DPO). (https://arxiv.org/abs/2405.14734)

In the case of DPO, we sum the log-probs; in the case of the other methods, we usually find the average. In both cases, we are not protected from such outliers. Let's rethink this in easier terms: If in some rejected values there are some tokens that make the understanding process simpler (which is the chosen and which is the rejected), the model will learn to underestimate these logprobs to $-\infty$ (this means that probability got to zero). But, in cleverer sequences, it might not be optimized at all (which leads to $P(EOS) \rightarrow 0$). (Empirical observation for DPO and SimPO, ORPO-like, etc. functionals)

Sometimes, a smaller lr or bigger $\beta$ (in the case of DPO) may solve this problem, but it cannot be cured in all cases. The solution is simple: we need to add an option clip_log_probs: True in our DPO configs. If it is, True then logprobs will be clipped and vice versa.

The text was updated successfully, but these errors were encountered:

krammnic · 2025-03-09T19:01:52Z

cc: @SalmanMohammadi @ebsmothers

SalmanMohammadi · 2025-03-09T19:53:58Z

Hi @krammnic. Thanks for raising this interesting issue.

Could you point to any empirical evidence for this issue for DPO/PPO?

krammnic · 2025-03-09T20:34:46Z

@SalmanMohammadi Sure, I can collect some just by logging the logprobs. SimPO guys introduced SFT loss component to attempt to fix several issues including this one, but it is not a full solution: https://github.com/princeton-nlp/SimPO

But the reasons are definitely pretty intuitive (that's why we sometimes try $\beta \uparrow$ or use KL divergence).

krammnic · 2025-03-09T22:01:46Z

Some correction, we want to do 3 things: (controlled by user)

Winsorize extremals for the chosen
Winsorize extremals for the rejected
Clip minimum logprob

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to clip logprobs `rlhf.get_batch_log_probs` #2470

Option to clip logprobs `rlhf.get_batch_log_probs` #2470

krammnic commented Mar 9, 2025 •

edited

Loading

krammnic commented Mar 9, 2025

SalmanMohammadi commented Mar 9, 2025

krammnic commented Mar 9, 2025 •

edited

Loading

krammnic commented Mar 9, 2025

Option to clip logprobs rlhf.get_batch_log_probs #2470

Option to clip logprobs rlhf.get_batch_log_probs #2470

Comments

krammnic commented Mar 9, 2025 • edited Loading

krammnic commented Mar 9, 2025

SalmanMohammadi commented Mar 9, 2025

krammnic commented Mar 9, 2025 • edited Loading

krammnic commented Mar 9, 2025

Option to clip logprobs `rlhf.get_batch_log_probs` #2470

Option to clip logprobs `rlhf.get_batch_log_probs` #2470

krammnic commented Mar 9, 2025 •

edited

Loading

krammnic commented Mar 9, 2025 •

edited

Loading