Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mitigation to HuggingFace Trainer #824

Open
huyiwen opened this issue Feb 6, 2025 · 3 comments
Open

Mitigation to HuggingFace Trainer #824

huyiwen opened this issue Feb 6, 2025 · 3 comments
Labels
enhancement New feature or request

Comments

@huyiwen
Copy link

huyiwen commented Feb 6, 2025

A lot of people in the community use HuggingFace Trainer for training, but sometimes it’s not flexible enough or missing certain features (native tp/pp/ep etc). Migrating to Megatron-LM comes with a steep learning curve, and while TorchTitan is lighter, it still takes some effort to learn and doesn’t fully support features like Flash Attention and Liger Kernel yet (correct me if I’m wrong).

One way to make TorchTitan more accessible could be allowing some of its features to work with existing HuggingFace Trainer code with just minor tweaks—like different parallelisms. That way, more users might give it a try even if it doesn’t fully support all training features yet.

@tianyu-l
Copy link
Contributor

tianyu-l commented Feb 6, 2025

@huyiwen Thanks for asking. This is something we are considering.

One way to make TorchTitan more accessible could be allowing some of its features to work with existing HuggingFace Trainer code with just minor tweaks—like different parallelisms.

I wonder if you could provide a more concrete list of requirements / to-do items in order to integrate with "HF Trainer"?

@tianyu-l tianyu-l added the enhancement New feature or request label Feb 6, 2025
@huyiwen
Copy link
Author

huyiwen commented Feb 7, 2025

I’m currently working on an MoE model and looking to implement expert parallelism. Writing EP/EP+TP/EP+DP from scratch with torch distributed communication is pretty challenging, especially if I want good training speed. That’s why using the parallelism provided by DTensor and torchtitan seems like a solid option.

Beyond EP, it might also be a good idea to enable people to try out PP and TP without relying on Megatron-lm. This way, more people could train larger models with limited resources.

@tianyu-l
Copy link
Contributor

tianyu-l commented Feb 7, 2025

For the most part, torchtitan doesn't implement the parallelisms themselves -- the parallelisms core code are in pytorch.

Is your request that "pytorch should make its parallelisms easily usable in HF Trainer"? In fact AFAIK HF already integrates several of pytorch parallelisms. cc: @kwen2501

Can you be more specific on how would you hope torchtitan to adapt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants