-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add support for tensor parallel training workflow with accelerate #34194
Conversation
Such timing! I have similar thought here. Shall we collaborate? |
@kwen2501 Absolutely, please let me know, how you want to take this forward. Thank you. |
cce95b6
to
c3fe4bf
Compare
0a4fe4e
to
e79f4d1
Compare
@ArthurZucker @muellerzr since accelerate PR (huggingface/accelerate#3173) is merged. Requesting review and merge of this PR which would allow for complete e2e training workflow using tensor parallelism. Thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM ! Could you also update the docs (Efficient Training on Multiple GPUs doc / trainer doc ) ?
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@SunMarc Have added some documentation. I would also add a page of documentation in accelerate and then reference it back here in a separate PR. For now I kept the documentation in the PR self contained for merge and use. Thanks. cc: @muellerzr |
5c065ac
to
7e8a2c2
Compare
@SunMarc failing tests look unrelated to this PR and seem to happen after I rebased with main branch. Thank you. |
@SunMarc @muellerzr I have a follow up PR as well - #36132 FYA. Thank you. |
@SunMarc for some reason, even for the recently merged commits are failing for these two CI tests. |
d073f05
to
177bd94
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! One small comment I'd like addressed so users aren't confused by things breaking/we can detect early, and then good to go on my end
src/transformers/training_args.py
Outdated
if self.tp_size > 1: | ||
os.environ["ACCELERATE_USE_TP"] = "true" | ||
os.environ["TP_SIZE"] = str(self.tp_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During here we should also guard the accelerate
version required to use tp
, can you add this please? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have added this guard. Thank you for noticing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we push for a merge? Thank you.
if self.args.tp_size > 1: | ||
self.is_tp_enabled = True | ||
if version.parse(accelerate_version) > version.parse("1.3.0"): | ||
args["torch_tp_plugin"] = TorchTensorParallelPlugin(tp_size=self.args.tp_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar with this API, so we need some documentation about what this uses under the hood!
Also we could check if model supports TP? or is it not even needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also we could check if model supports TP? or is it not even needed?
Hi @ArthurZucker We check this in accelerate here - https://github.com/huggingface/accelerate/blob/526925b48c07d997cdd9bf5911f659ca45778915/src/accelerate/accelerator.py#L1511
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar with this API, so we need some documentation about what this uses under the hood!
Do you want me to add that as a comment above this piece of code or please point me to a place in HF docs where you want me to add it. Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be added to the documentation for TP feature !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but interesting did not know accelerate already supported it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArthurZucker We added the support to accelerate at this PR - huggingface/accelerate#3173
Very much welcome otherwise!!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise. Some doc in the TP documentation to say it's supported to train with it would be nice as well! Documenting features is key!
src/transformers/training_args.py
Outdated
|
||
tp_size (`int`, *optional*): | ||
Use tp_size to enable PyTorch tensor parallelism. Set a value greater than 1 to activate TP. The same is | ||
used to prepare device mesh internally. Requires accelerate>1.3.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should explain that this uses the models' base_tp_plan
and is not available for all models etc!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArthurZucker added this fix thank you
@ArthurZucker Have added and updated docs wherever needed for TP, requesting your review. Thank you. |
33f5ad2
to
ace502d
Compare
@ArthurZucker failing tests do not seem to be related to this PR |
@ArthurZucker @muellerzr @SunMarc Can we go ahead with merge? |
Signed-off-by: Mehant Kammakomati <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>
Hi @kmehant thank you very much for the much needed PR! Just wanna get clarified, aside from setting the |
@bursteratom setting |
@kmehant Thank you for your response! I am asking cuz I ran into |
@bursteratom Right, apologies for not covering limitations in this PR. Essentially this happening due to gradient clipping which is not supported yet (PR is in place to fix this - #36132). So a workaround for this would be to set |
@kmehant Thank you so much for the tip and your amazing work! Looking forward to your fix getting merged! |
@kmehant apologies for the repeated ping. I was able to train with tensor parallel successfully in accelerate, but ran into |
…te (huggingface#34194) * feat: add support for tensor parallel flow using accelerate Signed-off-by: Mehant Kammakomati <[email protected]> * fix: add tp degree to env variable Signed-off-by: Mehant Kammakomati <[email protected]> * fix: add version check for accelerate to allow TP Signed-off-by: Mehant Kammakomati <[email protected]> * docs: tensor parallelism Signed-off-by: Mehant Kammakomati <[email protected]> * nit: rename plugin name Signed-off-by: Mehant Kammakomati <[email protected]> * fix: guard accelerate version before allow tp Signed-off-by: Mehant Kammakomati <[email protected]> * docs: add more docs and updates related to TP Signed-off-by: Mehant Kammakomati <[email protected]> --------- Signed-off-by: Mehant Kammakomati <[email protected]> Co-authored-by: Marc Sun <[email protected]>
What does this PR do?
1. AddAlready merged in to HF/transformers.apply_tensor_parallel
API to apply TP plan to Llama and Granite models2. Introduce
tp_size
user facing argument to be further consumed by accelerate (see huggingface/accelerate#3173)3. Allows for e2e TP training.
Please review in conjunction with huggingface/accelerate#3173
Fixes #32470
Results
See significant improvement in both memory and throughput compared against single gpu training, and FSDP across different settings (checkpointing on/off) and context lengths.
Note: Please be aware that the effective TPS for FSDP would be multiplicative of the parallel factor (number of GPUs/devices engaged in distributed training) whereas that is not the case with TP. Therefore, when effective throughput is considered we can find FSDP is better than TP in terms of throughput. However, that may be compensated by increasing the batch size utilizing the memory gains etc.
Done on two models
Tables below show the max cuda memory and throughput for various configurations showing the potential of TP contributed in this PR. There is gains in both memory and throughput.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
I have cycles to bring in more improvements over this PR to bring in Pytorch TP support to HF. Looking forward. Thank you
HF projects: