Added an example Notebook to fine-tune Llama3 model using PyTorchJob #2419

aishwaryaraimule21 · 2025-02-05T15:01:19Z

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

Docs included if any changes are user facing

review-notebook-app · 2025-02-05T15:01:25Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

andreyvelich

Thank you for this effort @aishwaryaraimule21!
I am fine with merging this KFP example.
Any thoughts @johnugeorge @tenzen-y @Electronic-Waste @astefanutti ?

andreyvelich · 2025-02-15T00:20:02Z

examples/pytorch/text-classification/Fine-Tune-Llama3-LLM.ipynb

+    "    )\n",
+    "    \n",
+    "    # check the status of the job\n",
+    "    from kubeflow.pytorchjob import PyTorchJobClient\n",


Should you use TrainingClient here ?

Updated the PR. Now using TrainingClient().get_job_conditions() to fetch the job status.

Electronic-Waste · 2025-02-15T06:25:32Z

I have no objections:)

google-oss-prow · 2025-02-15T16:03:33Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tenzen-y · 2025-02-15T16:23:47Z

Thank you for this effort @aishwaryaraimule21! I am fine with merging this KFP example. Any thoughts @johnugeorge @tenzen-y @Electronic-Waste @astefanutti ?

In that case, what are the relationship Training examples in KFP repository something like https://github.com/kubeflow/pipelines/tree/472f8779ded18f8904c5cbe15c0573d461d57af5/components/kubeflow/pytorch-launcher?

andreyvelich · 2025-02-15T19:40:22Z

Thank you for this effort @aishwaryaraimule21! I am fine with merging this KFP example. Any thoughts @johnugeorge @tenzen-y @Electronic-Waste @astefanutti ?

In that case, what are the relationship Training examples in KFP repository something like https://github.com/kubeflow/pipelines/tree/472f8779ded18f8904c5cbe15c0573d461d57af5/components/kubeflow/pytorch-launcher?

I think, you can use PyTorch launcher or you can directly use kubeflow-training SDK in the lightweight KFP component.
It is up to the user to decide.

tenzen-y · 2025-02-15T19:53:46Z

Thank you for this effort @aishwaryaraimule21! I am fine with merging this KFP example. Any thoughts @johnugeorge @tenzen-y @Electronic-Waste @astefanutti ?

In that case, what are the relationship Training examples in KFP repository something like https://github.com/kubeflow/pipelines/tree/472f8779ded18f8904c5cbe15c0573d461d57af5/components/kubeflow/pytorch-launcher?

I think, you can use PyTorch launcher or you can directly use kubeflow-training SDK in the lightweight KFP component. It is up to the user to decide.

SGTM.
It would be great if we could provide comprehensive examples after we release the consolidated SDK (I know the first version of SDK will be contained only katib and trainer features).

andreyvelich · 2025-02-17T15:18:09Z

@aishwaryaraimule21 Can you sign the DCO please ?

Signed-off-by: aishwarya.raimule <[email protected]>

aishwaryaraimule21 · 2025-02-18T07:34:57Z

@andreyvelich I have signed the DCO. Please check. Thanks.

astefanutti · 2025-02-18T08:25:34Z

examples/pytorch/text-classification/Fine-Tune-Llama3-LLM.ipynb

+    "\n",
+    "In this component, use TrainingClient() to create PyTorchJob which will fine-tune Llama3 model on 1 worker with 1 GPU.\n",
+    "\n",
+    "Specify the required packages in the *dsl.component* decorator. We would need kubeflow-pytorchjob, kubeflow-training[huggingface] and numpy packages in this Kubeflow component.\n",


Is kubeflow-pytorchjob really necessary since TrainingClient is used now?

astefanutti · 2025-02-18T08:26:00Z

examples/pytorch/text-classification/Fine-Tune-Llama3-LLM.ipynb

+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "@dsl.component(packages_to_install=['kubeflow-pytorchjob', 'kubeflow-training[huggingface]','numpy<1.24'])\n",


Dito, is kubeflow-pytorchjob really necessary since TrainingClient is used now?

astefanutti · 2025-02-18T08:27:28Z

examples/pytorch/text-classification/Fine-Tune-Llama3-LLM.ipynb

+    "       ),\n",
+    "       # it is assumed for text related tasks, you have 'text' column in the dataset.\n",
+    "       # for more info on how dataset is loaded check load_and_preprocess_data function in sdk/python/kubeflow/trainer/hf_llm_training.py\n",
+    "       dataset_provider_parameters=HuggingFaceDatasetParams(repo_id=\"aishwaryayyy/events_data\"),\n",


It would be better to remove dependencies on user specific repository.

@astefanutti may I ask why?
Do you recommend using something like https://huggingface.co/datasets/Yelp/yelp_review_full for the example?

astefanutti · 2025-02-18T08:29:16Z

examples/pytorch/text-classification/Fine-Tune-Llama3-LLM.ipynb

+    "       name=\"llama-3-1-8b-kubecon\",\n",
+    "       num_workers=1,\n",
+    "       num_procs_per_worker=1,\n",
+    "       # specify the storage class if you don't want to use the default one for the storage-initializer PVC\n",


It would be useful to mention a provisioner capable of provisioning RWX PVC is needed when distributing the training on multiple nodes / workers.

astefanutti · 2025-02-18T08:30:23Z

examples/pytorch/text-classification/Fine-Tune-Llama3-LLM.ipynb

+    "           \"storage_class\": \"nfs-storage\",\n",
+    "       },\n",
+    "       model_provider_parameters=HuggingFaceModelParams(\n",
+    "           model_uri=\"hf://meta-llama/Llama-3.1-8B-Instruct\",\n",


Should we cover the distributed training case, and provide the configuration so the model does not get downloaded on each local node / worker?

coveralls · 2025-02-18T13:25:07Z

Pull Request Test Coverage Report for Build 13375853453

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 100.0%

Totals
Change from base Build 13314191840:	0.0%
Covered Lines:	85
Relevant Lines:	85

💛 - Coveralls

google-oss-prow bot requested review from jinchihe and kuizhiqing February 5, 2025 15:01

google-oss-prow bot added the size/L label Feb 5, 2025

andreyvelich reviewed Feb 15, 2025

View reviewed changes

aishwaryaraimule21 force-pushed the finetune-llama3-llm branch from 891bb0c to d62081a Compare February 17, 2025 17:32

aishwaryaraimule21 added 5 commits February 17, 2025 23:05

Added notebook to fine-tune llama3 llm

11c7bd5

Signed-off-by: aishwarya.raimule <[email protected]>

changed cell language to Python

d9ea540

Signed-off-by: aishwarya.raimule <[email protected]>

use TrainingClient to fetch job status

e26a729

Signed-off-by: aishwarya.raimule <[email protected]>

renamed storage_class

9664bed

Signed-off-by: aishwarya.raimule <[email protected]>

shortened pipeline name to avoid issues in pod naming

b65cfbf

Signed-off-by: aishwarya.raimule <[email protected]>

aishwaryaraimule21 force-pushed the finetune-llama3-llm branch from d62081a to b65cfbf Compare February 17, 2025 17:35

astefanutti reviewed Feb 18, 2025

View reviewed changes

removed kubeflow-pytorchjob dependency

eeb7b3b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added an example Notebook to fine-tune Llama3 model using PyTorchJob #2419

Added an example Notebook to fine-tune Llama3 model using PyTorchJob #2419

aishwaryaraimule21 commented Feb 5, 2025

review-notebook-app bot commented Feb 5, 2025

andreyvelich left a comment

andreyvelich Feb 15, 2025 •

edited

Loading

aishwaryaraimule21 Feb 15, 2025

Electronic-Waste commented Feb 15, 2025

google-oss-prow bot commented Feb 15, 2025

tenzen-y commented Feb 15, 2025

andreyvelich commented Feb 15, 2025

tenzen-y commented Feb 15, 2025 •

edited

Loading

andreyvelich commented Feb 17, 2025

aishwaryaraimule21 commented Feb 18, 2025

astefanutti Feb 18, 2025

astefanutti Feb 18, 2025

astefanutti Feb 18, 2025

aishwaryaraimule21 Feb 20, 2025

astefanutti Feb 18, 2025

astefanutti Feb 18, 2025

coveralls commented Feb 18, 2025

Added an example Notebook to fine-tune Llama3 model using PyTorchJob #2419

Are you sure you want to change the base?

Added an example Notebook to fine-tune Llama3 model using PyTorchJob #2419

Conversation

aishwaryaraimule21 commented Feb 5, 2025

review-notebook-app bot commented Feb 5, 2025

andreyvelich left a comment

Choose a reason for hiding this comment

andreyvelich Feb 15, 2025 • edited Loading

Choose a reason for hiding this comment

aishwaryaraimule21 Feb 15, 2025

Choose a reason for hiding this comment

Electronic-Waste commented Feb 15, 2025

google-oss-prow bot commented Feb 15, 2025

tenzen-y commented Feb 15, 2025

andreyvelich commented Feb 15, 2025

tenzen-y commented Feb 15, 2025 • edited Loading

andreyvelich commented Feb 17, 2025

aishwaryaraimule21 commented Feb 18, 2025

astefanutti Feb 18, 2025

Choose a reason for hiding this comment

astefanutti Feb 18, 2025

Choose a reason for hiding this comment

astefanutti Feb 18, 2025

Choose a reason for hiding this comment

aishwaryaraimule21 Feb 20, 2025

Choose a reason for hiding this comment

astefanutti Feb 18, 2025

Choose a reason for hiding this comment

astefanutti Feb 18, 2025

Choose a reason for hiding this comment

coveralls commented Feb 18, 2025

Pull Request Test Coverage Report for Build 13375853453

Details

💛 - Coveralls

andreyvelich Feb 15, 2025 •

edited

Loading

tenzen-y commented Feb 15, 2025 •

edited

Loading