Skip to content

[WIP]Implement llama4 HF format to DCP converter #1104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

fegin
Copy link
Contributor

@fegin fegin commented Apr 15, 2025

Why do we need this?
There have been a lot of asks to get the HF checkpoints work with TorchTitan. There are already workarounds for this problem. However, the converted DCP checkpoints generally results in slow loading when being used for the next training.

This PR tries to address this problem by resharding the full weights into the correct sharding that will be used for training later. So we basically perform an offline resharding first to avoid long loading time later (online resharding).

This converter also perform concurrent file loads using multiple trainers.

While this PR should perform reasonably well, the converter requires using exactly the same machines/GPUs and sharding for the conversion. The main blocker for using CPU machines to do the conversion is that we are unable to run torchtitan with CPU only machines.

An alternative is to use less machines than the training machines to do the conversion. This will work but an additional resharding will happen during the actual training loading, which may not perform well, depending on the resharding patterns.

Future extensions

  1. Directly reading from huggingface without downloading it first (will come in the next PR).
  2. While this converter is written for llama4, the logic can be generalized to all other models with some cutomized functions (e.g., FQN mapping).
  3. Explore the possibility to perform the conversion with GPUs and still get the correct sharding scheme.

**Why do we need this?**
There have been a lot of asks to get the HF checkpoints work with TorchTitan. There are already workarounds for this problem. However, the converted DCP checkpoints generally results in slow loading when being used for the next training.

This PR tries to address this problem by resharding the full weights into the correct sharding that will be used for training later. So we basically perform an offline resharding first to avoid long loading time later (online resharding).

This converter also perform concurrent file loads using multiple trainers.

While this PR should perform reasonably well, the converter requires using exactly the same machines/GPUs and sharding for the conversion. The main blocker for using CPU machines to do the conversion is that we are unable to run torchtitan with CPU only machines.

An alternative is to use less machines than the training machines to do the conversion. This will work but an additional resharding will happen during the actual training loading, which may not perform well, depending on the resharding patterns.

**Future extensions**
1. Directly reading from huggingface without downloading it first (will come in the next PR).
2. While this converter is written for llama4, the logic can be generalized to all other models with some cutomized functions (e.g., FQN mapping).
3. Explore the possibility to perform the conversion with GPUs and still get the correct sharding scheme.
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 15, 2025
@fegin fegin requested a review from wwwjn April 15, 2025 23:29
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have merged something similar lol
https://github.com/pytorch/torchtitan/tree/main/torchtitan/experiments/llama4/scripts

(I realized that the REAME.md there is missing a "D")

@fegin
Copy link
Contributor Author

fegin commented Apr 16, 2025

lol, okay, do we want to keep the one in experiments or actually have the ones in the main scripts?

@tianyu-l
Copy link
Contributor

I think we should move them to local folder, as we are having more models.
I even think we should move the current "main scripts" into the llama3 folder, lol

@fegin
Copy link
Contributor Author

fegin commented Apr 16, 2025

okay, since you already merge them, I'll make this PR to be fixing the issues. But I'll keep the description of the PR since I would like to track the converter development if we want to generalize it to other models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants