Skip to content

feat: Add Hyperpod Optimum-neuron LoRA example #631

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
May 5, 2025

Conversation

Captainia
Copy link
Contributor

Issue #, if available: N/A, new feature

Description of changes:

This PR adds an example to the test_cases, by using Huggingface optimum-neuron library for PEFT fine-tuning. This repo is is the interface between the Huggingface Transformers library and AWS Accelerators including AWS Trainium and AWS Inferentia. It provides a set of tools enabling easy model loading, training and inference on single- and multi-Accelerator settings for different downstream tasks.

The example uses Hyperpod EKS environment, and the code will used in the Hyperpod workshop

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link
Contributor

@mhuguesaws mhuguesaws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution.

Could you please:

  1. join the AWS organization on github
  2. organize the folder as pytorch/fine-tuning/optimum-neuron
  3. Add the kubernetes manifest into a kubernetes folder. The repo structure organizes by scheduler.

@Captainia
Copy link
Contributor Author

Thank you for your contribution.

Could you please:

  1. join the AWS organization on github
  2. organize the folder as pytorch/fine-tuning/optimum-neuron
  3. Add the kubernetes manifest into a kubernetes folder. The repo structure organizes by scheduler.

Thank you for the quick review!

  1. I am in the AWS organization, https://github.com/aws and https://github.com/awslabs, is there another one I should join?
    2,3. There will be another continuous pre-training example under pytorch/optimum-neuron/llama3/kubernetes/continuous-pre-training, so I organized currently as pytorch/optimum-neuron/llama3/kubernetes/fine-tuning. Following neuronx-distributed Does this sounds good?

@mhuguesaws
Copy link
Contributor

Thank you for your contribution.
Could you please:

  1. join the AWS organization on github
  2. organize the folder as pytorch/fine-tuning/optimum-neuron
  3. Add the kubernetes manifest into a kubernetes folder. The repo structure organizes by scheduler.

Thank you for the quick review!

  1. I am in the AWS organization, https://github.com/aws and https://github.com/awslabs, is there another one I should join?
    2,3. There will be another continuous pre-training example under pytorch/optimum-neuron/llama3/kubernetes/continuous-pre-training, so I organized currently as pytorch/optimum-neuron/llama3/kubernetes/fine-tuning. Following neuronx-distributed Does this sounds good?

aws-samples organization

2and 3 are good.


# # Update Neuron Compiler and Framework
RUN python -m pip install --upgrade neuronx-cc==2.* torch-neuronx==2.1.* torchvision
RUN python -m pip install --upgrade neuronx-distributed neuronx-distributed-training
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we fix the library version? A new release could break the compatibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Jianying, I have pin the dependencies in this CR. Ideally when this PR is merged and released, we will switch to optimum-neuron 0.1.0 sagemaker image: https://github.com/aws/deep-learning-containers/pull/4670/files#diff-0f776bad437279bcc3d6005ec1b29170f0b4e53dfbc5c3c234fb202817f707a3. It has the required dependencies too.

Copy link
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment

@Captainia
Copy link
Contributor Author

Thank you for your contribution.
Could you please:

  1. join the AWS organization on github
  2. organize the folder as pytorch/fine-tuning/optimum-neuron
  3. Add the kubernetes manifest into a kubernetes folder. The repo structure organizes by scheduler.

Thank you for the quick review!

  1. I am in the AWS organization, https://github.com/aws and https://github.com/awslabs, is there another one I should join?
    2,3. There will be another continuous pre-training example under pytorch/optimum-neuron/llama3/kubernetes/continuous-pre-training, so I organized currently as pytorch/optimum-neuron/llama3/kubernetes/fine-tuning. Following neuronx-distributed Does this sounds good?

aws-samples organization

2and 3 are good.

Thank you, I have added to the aws-samples organization.

@Captainia Captainia requested review from KeitaW and mhuguesaws April 14, 2025 13:59
@KeitaW KeitaW added enhancement New feature or request New model labels Apr 19, 2025
@KeitaW KeitaW removed the enhancement New feature or request label Apr 19, 2025
Copy link
Contributor

@jianyinglangaws jianyinglangaws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Captainia

This comment was marked as duplicate.

@mhuguesaws mhuguesaws removed their request for review May 5, 2025 14:44
@mhuguesaws mhuguesaws dismissed their stale review May 5, 2025 14:45

Not needed.

@KeitaW KeitaW merged commit fbe5373 into aws-samples:main May 5, 2025
Captainia added a commit to Captainia/awsome-distributed-training that referenced this pull request May 7, 2025
* Add Hyperpod Optimum-neuron LoRA example

* fix README

* restructure files

* fix

* Update to use newer version of peft and llama 3.8

* Update 3.test_cases/pytorch/optimum-neuron/llama3/kubernetes/fine-tuning/README.md

Co-authored-by: Keita Watanabe <[email protected]>

* Update 3.test_cases/pytorch/optimum-neuron/llama3/kubernetes/fine-tuning/README.md

Co-authored-by: Keita Watanabe <[email protected]>

* Update 3.test_cases/pytorch/optimum-neuron/llama3/kubernetes/fine-tuning/README.md

Co-authored-by: Keita Watanabe <[email protected]>

* pin dependencies and address comments

* fix

* switch model, remove HF token, update compile steps

---------

Co-authored-by: Keita Watanabe <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants