[WIP]Implement llama4 HF format to DCP converter #1104
+564
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why do we need this?
There have been a lot of asks to get the HF checkpoints work with TorchTitan. There are already workarounds for this problem. However, the converted DCP checkpoints generally results in slow loading when being used for the next training.
This PR tries to address this problem by resharding the full weights into the correct sharding that will be used for training later. So we basically perform an offline resharding first to avoid long loading time later (online resharding).
This converter also perform concurrent file loads using multiple trainers.
While this PR should perform reasonably well, the converter requires using exactly the same machines/GPUs and sharding for the conversion. The main blocker for using CPU machines to do the conversion is that we are unable to run torchtitan with CPU only machines.
An alternative is to use less machines than the training machines to do the conversion. This will work but an additional resharding will happen during the actual training loading, which may not perform well, depending on the resharding patterns.
Future extensions