MPO #2544

qgallouedec · 2025-01-06T15:14:51Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2025-01-06T15:18:43Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

trl/trainer/dpo_trainer.py

ariG23498 · 2025-01-23T08:28:23Z

Here is a rough colab notebook that I have created.

In the notebook:

Installed trl from the branch
Used the openbmb/RLAIF-V-Dataset dataset
Used the HuggingFaceTB/SmolVLM-Instruct model

Here are my queries:

Reading the code, we expect the loss_type to be a str with , seperated losses.
```
training_args = DPOConfig(
    ...
    loss_type="sigmoid, bco_pair", # <-- a collection of losses
)
```
As far as I understand the MPO paper, it comprises of three losses, the DPO loss, the BCO loss, and the SFT loss. Here I have added the BCO and used sigmoid for the DPO loss, do we also have a way to add the SFT loss here somehow (which I think would be cross entropy loss?)
The weighting parameter in the config (as in this comment)

qgallouedec · 2025-01-23T08:36:22Z

trl/trainer/dpo_trainer.py

-        losses, chosen_rewards, rejected_rewards = self.dpo_loss(
-            model_output["chosen_logps"], model_output["rejected_logps"], ref_chosen_logps, ref_rejected_logps
-        )
+        if "," in self.loss_type:


ideally, at this point, we would have

self.loss_type a list of strings (eg, [“sigmoid”, “bco_pair”])

self.loss_type_to_weights a dict of str, float which for each loss type associates a weight.

Parsing for loss type could be done directly in the config, as here:

trl/trl/trainer/nash_md_config.py

Lines 34 to 46 in fe4b5ef

mixture_coef: list[float] = field(

default_factory=lambda: [0.5],

metadata={

"help": "Logit mixture coefficient for the model and reference model. If a list of floats is provided "

"then the mixture coefficient is selected for each new epoch and the last coefficient is used for the "

"rest of the epochs."

},

)

def __post_init__(self):

super().__post_init__()

if hasattr(self.mixture_coef, "__len__") and len(self.mixture_coef) == 1:

self.mixture_coef = self.mixture_coef[0]

OK

ariG23498 · 2025-01-24T07:25:04Z

@qgallouedec Here is a rough colab notebook with the current MPO training.

Let me know what you think.

qgallouedec

Looks good! Some initial comments :)

qgallouedec · 2025-01-24T07:31:57Z

trl/trainer/dpo_config.py

+    loss_weights: Optional[Dict[str, float]] = field(
+        default_factory=lambda: ["your_values"]
+    )
+    loss_type: List[str] | str = field(


I need to check if this work with the parser and the cli

If you could let me know how you might check it, I could do it and report back to you.

running this should work:

trl dpo --output_dir tmp_dir --model_name_or_path trl-internal-testing/tiny-Qwen2ForCausalLM-2.5 --dataset_name trl-internal-testing/zen --dataset_config standard_preference --report_to none

I tried the above command, which resulted in an issue with the self.loss_type being None.

Upon changing the command to

trl dpo --output_dir tmp_dir --model_name_or_path trl-internal-testing/tiny-Qwen2ForCausalLM-2.5 --dataset_name trl-internal-testing/zen --dataset_config standard_preference --report_to none --loss-type sigmoid

it started to train. I am not sure why this happens 🤔

trl/trainer/dpo_trainer.py

qgallouedec · 2025-01-24T10:22:36Z

Really nice!!!

But don't do this (important):

    # Apply the chat template
    prompt = processor.apply_chat_template(prompt, tokenize=False)
    chosen = processor.apply_chat_template(chosen, tokenize=False)
    rejected = processor.apply_chat_template(rejected, tokenize=False)

The DPO Trainer handle applying the chat template. See #1930 for more info.

This code snippet is present in so many examples online, it's a scourge. 😩

OK

trl/trainer/dpo_config.py

trl/trainer/dpo_trainer.py

qgallouedec · 2025-01-30T17:08:40Z

trl/trainer/dpo_config.py

@@ -15,7 +15,7 @@
 import warnings
 from dataclasses import dataclass, field
 from enum import Enum
-from typing import Any, Callable, Optional, Union
+from typing import Any, Callable, Optional, Union, List, Dict


you can use list and dict instead

qgallouedec · 2025-01-30T17:09:29Z

trl/trainer/dpo_config.py

+        loss_weights (`dict[str, float]` or `None`, *optional*, defaults to `None`):
+            Use to weight a combination of losses. The keys must be in `loss_type`. By default (if not specified in the dict),
+            the weight for a loss in loss_type is 1.0.
+        loss_type (`str` or `list`, *optional*, defaults to `"sigmoid"`):


can you also document that, when a list is passed, the loss is the sum of these values?

add "sft" as well

qgallouedec · 2025-01-30T17:12:21Z

trl/trainer/dpo_trainer.py

+            curr_losses, curr_chosen_rewards, curr_rejected_rewards = self.dpo_loss(
+                curr_loss_type, model_output["chosen_logps"], model_output["rejected_logps"], ref_chosen_logps, ref_rejected_logps
+            )
+            curr_loss_weight = getattr(self.loss_weights, curr_loss_type, 1.0)


ariG23498 · 2025-02-03T09:00:47Z

@qgallouedec I have resolved the merge conflicts and have also worked on the review suggestions. Could you help me with another round of review?

If this looks good, I can start a small training run on VLM with MPO. WDYT?

ariG23498 · 2025-02-10T04:51:53Z

@qgallouedec a gentle ping here!

copied from internvl

96a2871

ariG23498 reviewed Jan 23, 2025

View reviewed changes

trl/trainer/dpo_trainer.py Outdated Show resolved Hide resolved

qgallouedec commented Jan 23, 2025

View reviewed changes

ariG23498 added 6 commits January 23, 2025 17:01

Merge branch 'main' into mpo

f105178

OK

chore: adding loss weights, logic for multiple loss, and sft loss

c7a5511

fix mutable field issue

7638607

remove redundant code

35db48e

chore: adding loss weights to the trainer

408c1df

Merge branch 'main' into mpo

c622a29

OK

ariG23498 marked this pull request as ready for review January 24, 2025 07:24

qgallouedec commented Jan 24, 2025

View reviewed changes

Merge branch 'main' into mpo

08b49df

OK