Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust the README file in cater with users #2440

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
23 changes: 19 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@

## Overview

> **Warning**: Kubeflow Trainer is currently in **alpha** status, and APIs may change. If you want to use stable release of Kubeflow Training Operator V1, please check [this section](#kubeflow-training-operator-v1).
Copy link
Member

@tenzen-y tenzen-y Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> **Warning**: Kubeflow Trainer is currently in **alpha** status, and APIs may change. If you want to use stable release of Kubeflow Training Operator V1, please check [this section](#kubeflow-training-operator-v1).
> [!WARNING]
> Kubeflow Trainer is currently in **alpha** status, and APIs may change. If you want to use stable release of Kubeflow Training Operator V1, please check [this section](#kubeflow-training-operator-v1).

This is a GitHub Markdown way: https://github.com/orgs/community/discussions/16925
Anyway, do we really want to notice this to users? @kubeflow/wg-training-leads @astefanutti
If yes, when can we remove this warning? @Electronic-Waste Do you have a specific schedule for the removal this warning?

Copy link
Member Author

@Electronic-Waste Electronic-Waste Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, we are implementing Trainer according to KEP-2170. It's unstable and has not been ready for production. So I think it's necessary to remind users of this. They can use the stable release of Training Operator V1 instead, at least before we release the first stable version of Trainer.

As for the removal timeline, I think probably we can remove this warning when the first stable release of Trainer is ready, maybe v2.0.0. It means that we are production-ready.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the removal timeline, I think probably we can remove this warning when the first stable release of Trainer is ready, maybe v2.0.0. It means that we are production-ready.

We will cut v2.0.0 after MPI impl. So I'm not sure the reason why we can say production ready in the v2.0.0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can turn the warning into a note. And also rephrase it in a way that says the v2 APIs are still subject to change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the v2.0.0. I mean we can remove this warning when we are production ready.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can turn the warning into a note. And also rephrase it in a way that says the v2 APIs are still subject to change.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the v2.0.0. I mean we can remove this warning when we are production ready.

Here, what is production-ready is a matter. If we do not decide graduation criteria for production ready, we lose the timeline for that, and we continue "this is not production ready, so any kind of break should be fine". And then, users will leave from this projects.

At least, it should be better to define production readiness criteria.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can turn the warning into a note. And also rephrase it in a way that says the v2 APIs are still subject to change.

SGTM.

Copy link
Member Author

@Electronic-Waste Electronic-Waste Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, what is production-ready is a matter.

Yeah, that's true. We should discuss it further in another dedicated issue for graduation criteria. And maybe we could list it as an agenda for the next WG AutoML/Training community call.

/cc @andreyvelich


Kubeflow Trainer is a Kubernetes-native project designed for large language models (LLMs)
fine-tuning and enabling scalable, distributed training of machine learning (ML) models across
various frameworks, including PyTorch, JAX, TensorFlow, and others.
Expand All @@ -35,9 +37,18 @@ The following KubeCon + CloudNativeCon 2024 talk provides an overview of Kubeflo

## Getting Started

Please check [the official Kubeflow documentation](https://www.kubeflow.org/docs/components/trainer/getting-started)
You can simply run these commands to install Kubeflow Trainer if your Kubernetes cluster is ready:

```bash
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=master"
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/runtimes?ref=master"
Comment on lines +44 to +45
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Electronic-Waste Previously, we've been discussing with @rimolive @varodrig and @kubeflow/wg-training-leads that we want to have single source of truth for the Kubeflow Trainer docs, since it is hard to keep all of these docs up-to-date.

Thus, we just redirect users to the Kubeflow website for the installation steps: https://www.kubeflow.org/docs/components/trainer/operator-guides/installation/#installing-the-kubeflow-trainer-controller-manager

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the user's side, they want to quickly install Kubeflow Trainer. So, we'd better put the installation guide at the README file to get started quickly like so many other oss including Katib:

And they also have detailed guidance in the website. So, I think it probably a better choice to put some installation commands in the README.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, they need to click "the official Kubeflow documentation" -> "the installation guide" -> scroll down, to see the installation commands. I think the guide is too deep for users. They may prefer a straightforward way and search for the official documentation if the straightforward way does not work.

Copy link
Member

@andreyvelich andreyvelich Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, since we directly link the installation guide from the README: https://github.com/kubeflow/trainer?tab=readme-ov-file#getting-started.
Should we just provide another link to the operator guide from the README: https://www.kubeflow.org/docs/components/trainer/operator-guides/installation/#installing-the-kubeflow-trainer-controller-manager ?

Copy link
Member Author

@Electronic-Waste Electronic-Waste Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just provide another link to the operator guide from the README: https://www.kubeflow.org/docs/components/trainer/operator-guides/installation/#installing-the-kubeflow-trainer-controller-manager ?

It's a better choice compared to the current one. But from the user's perspective, they prefer install Trainer directly with some simple commands listed in the README, which is more straightforward, and search for the details if they want customized installation (like only installing manager)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we talk about it at the next Training WG call please ?
I want to see what is the right solution for us moving forward.
/hold

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we talk about it at the next Training WG call please ?
I want to see what is the right solution for us moving forward.

Yeah, of course.

```

For more details, please check [the official Kubeflow documentation](https://www.kubeflow.org/docs/components/trainer/getting-started)
to install and get started with Kubeflow Trainer.

If you are using Kubeflow Training Operator V1, please refer [to this migration document](/docs/components/trainer/operator-guides/migration).

## Community

The following links provide information on how to get involved in the community:
Expand All @@ -56,12 +67,16 @@ Please refer to the [CHANGELOG](CHANGELOG.md).

## Kubeflow Training Operator V1

Kubeflow Trainer project is currently in <strong>alpha</strong> status, and APIs may change.
If you are using Kubeflow Training Operator V1, please refer [to this migration document](https://www.kubeflow.org/docs/components/trainer/operator-guides/migration/).
Kubeflow Trainer project is currently in <strong>alpha</strong> status, and APIs may change. You can install the stable release of the Kubeflow Training Operator V1 with:

```bash
kubectl apply --server-side -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.9.0"
```

For more details, please check [this guide](https://www.kubeflow.org/docs/components/trainer/legacy-v1/installation/) to install and get started with Kubeflow Training Operator V1.

Kubeflow Community will maintain the Training Operator V1 source code at
[the `release-1.9` branch](https://github.com/kubeflow/training-operator/tree/release-1.9).

You can find the documentation for Kubeflow Training Operator V1 in [these guides](https://www.kubeflow.org/docs/components/trainer/legacy-v1).

## Acknowledgement
Expand Down
Loading