Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create GitHub Repository for Kubeflow Trainer #2402

Open
andreyvelich opened this issue Jan 23, 2025 · 33 comments
Open

Create GitHub Repository for Kubeflow Trainer #2402

andreyvelich opened this issue Jan 23, 2025 · 33 comments

Comments

@andreyvelich
Copy link
Member

andreyvelich commented Jan 23, 2025

A the latest AutoML and Training WG call, we discussed how we can create a new GitHub repository and release the Kubeflow Trainer. We want to keep v1alpha1 API version for TrainJob and TrainingRuntime, and introduce a new kubeflow Python SDK starting from the 0.1.0 version.

(updated 2025-01-27). After discussions we decided to move forward with Option 4.

We explore four options for the Kubeflow Trainer project:

Option 1. Migrate the kubeflow/training-operator to a new repository.

Steps:

  • Migrate all branches and tags from kubeflow/training-operator to kubeflow/training-operator-lts
  • [Breaking Change] Tell users to add this line in their Go projects, if they are using kubeflow/training-operator Go modules:
replace github.com/kubeflow/training-operator v1.9.0 => github.com/kubeflow/training-operator-lts v1.9.0
  • Rename kubeflow/training-operator to kubeflow/trainer.
  • Remove V1 code from the master branch of kubeflow/trainer.
  • Release the first version of kubeflow/trainer on the release-0.1 branch with the v0.1.0 tag (we should delete the existing v0.1-branch and v0.1.0 tag).
  • After a grace period, delete all V1 GitHub releases, tags, and branches from the kubeflow/trainer repository.

Pros:

  • Preserves repository stars and history.
  • Starts Kubeflow Trainer releases with the v0.1.0 version (we should delete .
  • We can make major and minor releases for Kubeflow Training Operator project in kubeflow/training-operator-lts repository. For example, v1.10.0 Release.

Cons:

  • Breaking Change for users dependant on kubeflow/training-operator Go modules.

Option 2. Create a new repository kubeflow/trainer

Steps:

  • Create a brand-new GitHub repository for the Kubeflow Trainer project.
  • Release the first version of kubeflow/trainer on the release-0.1 branch with the v0.1.0 tag.

Pros:

  • No breaking change for users who depend on kubeflow/training-operator Go modules.

Cons:

  • Loses the 7.5-year history of the Kubeflow Training Operator project.
  • May confuse users about whether Kubeflow Trainer is a new or updated version of the Kubeflow Training Operator project.

Option 3. Start kubeflow/trainer with v1.10.0 release.

Steps:

  • Rename kubeflow/training-operator to kubeflow/trainer
  • Remove the V1 code from the master branch of kubeflow/trainer.
  • Release the first version of kubeflow/trainer using the release-1.10 branch with the v1.10.0 tag.

Pros:

  • No breaking change for users dependant on kubeflow/training-operator Go modules.

Cons:

  • Users may be confused about why Kubeflow Trainer and kubeflow SDK starts with version v1.10.
  • We can't release a new minor version for Kubeflow Training Operator: e.g. v1.10.0

Option 4 (updated 2025-01-27). Separate Kubeflow Trainer control plane from client SDK.

Steps:

  • Rename kubeflow/training-operator to kubeflow/trainer
  • Remove the V1 code from the master branch of kubeflow/trainer.
  • Release the first version of Kubeflow Trainer components as: v2.0.0 on release-2.0 branch:
docker.io/kubeflow/trainer-controller-manager:v2.0.0
docker.io/kubeflow/dataset-initializer:v2.0.0
docker.io/kubeflow/model-initializer:v2.0.0
docker.io/kubeflow/llm-trainer:v2.0.0
  • Update the SDK version on GitHub to v0.1.0 as an initial release.
  • Tell users to install kubeflow SDK from the kubeflow/trainer GitHub temporarily.
  • Begin drafting a design document for a new GitHub repository: kubeflow/sdk. This repository will host the Kubeflow Python client and potentially expand to include clients for other languages in the future (e.g., Rust, Swift, Java). The design document should outline how Kubernetes versions and other project dependencies will be managed effectively.

Pros:

  • We don't introduce any Breaking Changes to users of kubeflow/training-operator.
  • We can release new minor versions for Training Operator (e.g. release-1.10).
  • It is clear that V2 is the next major version for existing users of Kubeflow Training Operator.
  • Kubeflow Trainer control plane and Kubeflow client SDK has different release lifecycle. Thus, We can release SDK to PyPI without releasing the Kubeflow Trainer control plane.
  • We can expand Kubeflow SDK to support other Kubeflow APIs (e.g. Optimizer, Evaluator, etc.)

Cons:

  • It is challenging to manage test infra between kubeflow/trainer and kubeflow/sdk.
    However, I think we can deal with that if we keep examples in the Kubeflow Trainer GitHub repository and use these examples to run E2E tests for both in kubeflow/trainer and kubeflow/sdk
    Also, we can explore other options to improve our test coverage.

Personally, I prefer the Option 4 or the Option 1.

Please let us know if we have other ideas @kubeflow/wg-training-leads @kubeflow/release-team @astefanutti @kannon92 @ahg-g @kubeflow/kubeflow-steering-committee @thesuperzapper @kubeflow/wg-manifests-leads @franciscojavierarceo @Electronic-Waste @seanlaii @deepanker13 @saileshd1402 @vsoch @shravan-achar @akshaychitneni @helenxie-bit @kubeflow/release-managers @zijianjoy @james-jwu

@andreyvelich
Copy link
Member Author

cc @akgraner @chasecadet

@Electronic-Waste
Copy link
Member

I would also support Option 1, since it can preserve the history of the project and make the v1alpha1 TranJob API straightforward to users.

However,v0.1.0 release conflicts with an existing release: https://github.com/kubeflow/tf-operator/tree/v0.1.0. we may need to deal with it.

@andreyvelich
Copy link
Member Author

However,v0.1.0 release conflicts with an existing release: https://github.com/kubeflow/tf-operator/tree/v0.1.0. we may need to deal with it.

Good point! I added this in the Option 1 steps.

@chasecadet
Copy link

I like option 1.

@franciscojavierarceo
Copy link
Contributor

franciscojavierarceo commented Jan 24, 2025

For Option 1, can we migrate the release to the kubeflow/training-operator-lts repo? My preference would be to retain the release history in GitHub (even if we redirect them to the new repo). We should also be aware that we'll want to update PyPi.

@astefanutti
Copy link
Contributor

astefanutti commented Jan 24, 2025

I'd also favor option 1 as it brings the best outcomes. There might be an additional impact / cons for projects that maintain a downstream fork of the kubeflow/training-operator repository as they'll have to "re-route" it to kubeflow/training-operator-lts.

Just for completeness of the solution space, there could be option 4 where the repository would be renamed to kubeflow/trainer and the operator and SDK versions would start at 2.x (The TrainJob and TrainingRuntime APIs would still start at v1alpha1 and follow the usual graduation). The only cons would then be mainly that "2.x". Given both the "v2" operator and SDK build on v1, that could be deemed acceptable.

@astefanutti
Copy link
Contributor

For option 1, it seems it might be possible to copy/transfer the v1 releases over to the new kubeflow/training-operator-lts repository using the GitHub API: https://blog.madkoo.net/2024/03/13/migrate-releases/.

@andreyvelich
Copy link
Member Author

Great find @astefanutti! We can use this tool to migrate GitHub releases as well.

@Priyansh-jsk
Copy link

option 1

@thesuperzapper
Copy link
Member

What's good about renaming a repo is that you can still reference it with the old name (in pulls and links).

But I want to say tags must be immutable, and similarly releases, e.g. we can never delete them.

To keep things clean, we can prefix the new version tags/branches with trainer-v0.1.0, or just start trainer at v2.0.0.

@franciscojavierarceo
Copy link
Contributor

Why don't we start trainer at v2.0.0? 🤔

@andreyvelich
Copy link
Member Author

Why don't we start trainer at v2.0.0? 🤔

We discussed this at the call. We want to keep the v1 version for CRDs and kubeflow SDK since it is a brand new entities.

But I want to say tags must be immutable, and similarly releases, e.g. we can never delete them.

@thesuperzapper Please can you share what we lose if we delete tags from the repository ?
I guess, it will break the GitOps or Go modules for folks who depend on upstream GitHub repository, but ideally you should have an internal fork to avoid such problems.
Anything else, I am missing ?

@thesuperzapper
Copy link
Member

This is really a case of "you can't have your cake and eat it too", that is, you are either making a new project (which needs a new repo), or you are making a new version of an existing project (which can use the same repo, possibly renamed).

Put another way, how are users meant to see this change, is "Kubeflow Trainer" the V2 version of "Kubeflow Training Operator", or is it a fully new project?

I think the best outcome is to have a clear path of continuity from "Kubeflow Training Operator", and make it clear that:

  1. We are renaming the project to "Kubeflow Trainer", not creating a new project.
  2. We are releasing a backwards-incompatible version at the same time as we are renaming, so its new version is 2.0.0.
  3. We have a plan to keep patching 1.X.X for Y period of time to give people time to transition.

PS: About deleting tags, it's just simply never acceptable to change a tag once it's created. It's so foundational to the idea of a tag that people don't even discuss it. Violating this principle is similar to burning history books.

@andreyvelich
Copy link
Member Author

Put another way, how are users meant to see this change, is "Kubeflow Trainer" the V2 version of "Kubeflow Training Operator", or is it a fully new project?

It is fully new project, but the project have similar goals as Kubeflow Training Operator.

We are releasing a backwards-incompatible version at the same time as we are renaming, so its new version is 2.0.0.

The challenging part is that we want to keep CRD version as v1alpha1, which means the CRD and the image versions (e.g. trainer-controller-manager, model-initializer, dataset-initializer, etc.) will be inconsistent.

@andreyvelich
Copy link
Member Author

Based on the @thesuperzapper feedback, I propose the Option 4.
Please let me know what do you think @kubeflow/wg-training-leads @franciscojavierarceo @thesuperzapper @Electronic-Waste @chasecadet @astefanutti @Priyansh-jsk

@thesuperzapper
Copy link
Member

@andreyvelich just to clarify, your current proposal is:

  1. We are renaming "Kubeflow Training Operator" to "Kubeflow Trainer"
  2. We are renaming kubeflow/training-operator to kubeflow/trainer
  3. The first version of "Kubeflow Trainer" is backwards compatible with "Kubeflow Training Operator" and so will be called v1.10.0
  4. We are creating a new repo for the "Trainer SDKs", which will have its own version scheme, starting at v0.1.0

I want to clarify a few things:

  • Are you 100% sure that there will be no breaking changes from v1.9.0 to v1.10.0?
    • If there are, why not use v2.0.0?
    • If It's because the CRD versions might not be aligned, I don't think that's a problem, and it's more important that we indicate we are breaking/removing something by bumping to v2
  • I suggest we use kubeflow/trainer-sdk rather than kubeflow/sdk:
    • Otherwise it could be confused with the numerous other SDKs maintained by other components (pipelines, model registry, etc.)

@andreyvelich
Copy link
Member Author

andreyvelich commented Jan 25, 2025

The first version of "Kubeflow Trainer" is backwards compatible with "Kubeflow Training Operator" and so will be called v1.10.0

No the v1.10.0 version, will not be compatible with Training Operator. We only keep this version to make sure that CRD APIs and the major version of Kubeflow Trainer is consistent.

If It's because the CRD versions might not be aligned, I don't think that's a problem, and it's more important that we indicate we are breaking/removing something by bumping to v2

It can cause false impression that new control plane components of Kubeflow Trainer: Controller Manager, dataset and model initializers, and LLM trainer have the second version, which is incorrect.

I suggest we use kubeflow/trainer-sdk rather than kubeflow/sdk:

Please check this recording for the context: https://youtu.be/zOsRKCEcMeo?t=1275

Over the past 3 years, we've been discussing a lot with @kubeflow/wg-training-leads, @franciscojavierarceo, @astefanutti, and other contributors that we need to provide a simple Python interface for ML engineers and Data Scientists to interact with Kubeflow APIs.
Our roadmap includes adding support for Katib CRDs (with potential renaming of CRDs) alongside the existing Kubeflow Trainer CRDs.

Beyond that, this SDK is designed to integrate seamlessly with other tools like Model Registry, Feast, and Spark, delivering a unified and intuitive user experience.

With this approach, users will be able to effortlessly develop AI models using Kubeflow, by simply doing something like this in their Kubeflow Notebooks/Workspace:

$ pip install kubeflow

from kubeflow.spark import SparkClient
from kubeflow.trainer import TrainerClient
from kubeflow.optimzer import OptimizerClient

SparkClient().process_data()
TrainerClient().train()
OptimizerClient().tune()

KFP integrates seamlessly with the Kubeflow SDK to orchestrate end-to-end ML pipelines, if users want to perform E2E MLOps/LLMOps.
We believe that this provides users with an intuitive and efficient AI/ML experience with enough flexibility.

cc @kubeflow/wg-data-leads @ChenYi015 @bigsur0 @shravan-achar @chasecadet

@franciscojavierarceo
Copy link
Contributor

@varodrig
Copy link

I like option 1

@gaocegege
Copy link
Member

Personally, I prefer option 1.

@astefanutti
Copy link
Contributor

The challenging part is that we want to keep CRD version as v1alpha1, which means the CRD and the image versions (e.g. trainer-controller-manager, model-initializer, dataset-initializer, etc.) will be inconsistent.

CRD versions are independent from their control plane components. It happens quite often a controller / operator introduces a new CRD starting at v1alpha1 and that CRD gradually graduates independently from its controller version. Kubernetes is a good example, and it also comes with v2 CRDs like autoscaling/v2.

I second @thesuperzapper opinion:

If It's because the CRD versions might not be aligned, I don't think that's a problem, and it's more important that we indicate we are breaking/removing something by bumping to v2

So with option 4, the TrainJob CRD could start at v1alpha1 and graduates towards v1 when it's ready. The data scientist would primarily interact with the SDK v2 and that CRD would be more a hidden detail to them, while the platform engineer persona would be familiar with that decoupling of CRD versions w.r.t. their controller ones.

On the other hand, having the new SDK and operator starting at v2 may send the signal these are built on v1, improve over the lessons learnt, and do not start over from scratch.

@andreyvelich
Copy link
Member Author

andreyvelich commented Jan 27, 2025

Yes, these are good points @astefanutti.
We synced with @kubeflow/wg-training-leads and discussed the same concerns.
We want to preserve history as @thesuperzapper mentioned, and we want to make it clear that V2 is the next iteration for Kubeflow Training Operator for existing users.

We propose the following (I updated the Option 4):

  1. Remove Training Operator code from the master branch.
  2. Release Kubeflow Trainer control plane components as V2 in release-2.0 branch.
docker.io/kubeflow/trainer-controller-manager:v2.0.0
docker.io/kubeflow/dataset-initializer:v2.0.0
docker.io/kubeflow/model-initializer:v2.0.0
docker.io/kubeflow/llm-trainer:v2.0.0
  1. Keep the version for kubeflow SDK as: v0.1.0 for now.
  2. Prepare design doc to explain how we are going to version Client SDK with other Kubeflow project versions (e.g. Trainer, Katib, etc.)
  3. For now, ask users to install SDK from the github:
pip install git+https://github.com/kubeflow/trainer.git@master#subdirectory=sdk

@thesuperzapper
Copy link
Member

@andreyvelich if you are using semantic versioning for the SDK, technically v0.1.0 is explicitly listed as the "first" version, so v0.0.1 is not a valid semantic version.

@andreyvelich
Copy link
Member Author

Good point @thesuperzapper, updated the issue.

@franciscojavierarceo
Copy link
Contributor

I think Option 4 increases the scope of this issue quite a bit to propose the kubeflow/sdk in general, yeah?

FWIW I am in favor of Option 4 but just want to be explicit about the increase in scope.

@andreyvelich
Copy link
Member Author

andreyvelich commented Jan 27, 2025

I think Option 4 increases the scope of this issue quite a bit to propose the kubeflow/sdk in general, yeah?

@franciscojavierarceo Yes, but we will work towards establishing this repo and publish the first release to PyPI:
https://pypi.org/project/kubeflow/

The concern that @johnugeorge has is how we can make it easier to control dependency between:
Client, Kubernetes, and Kubeflow CRDs (TrainJob, OptimizeJob (e.g. Katib), etc.)
For now, we will ask users to install SDK directly from kubeflow/trainer GitHub:

pip install git+https://github.com/kubeflow/trainer.git@master#subdirectory=sdk

@StefanoFioravanzo
Copy link
Member

I love that you marching towards the creation of an official and maintained Kubeflow SDK with kubeflow/sdk. This is definitely a great step towards providing a better user experience for our users, with unified access to the various components' SDKs. Love it!

cc @ederign

@ederign
Copy link
Member

ederign commented Jan 28, 2025

Great initiative towards having a unified Kubeflow Python SDK. This will improve a lot user experience on notebooks.

Model Registry also has its own SDK, and I just want to add @tarilabs @rareddy and @dhirajsb to the loop.

@andreyvelich
Copy link
Member Author

Model Registry also has its own SDK

Yeah, that is good point, similar to Feast as we discussed with @franciscojavierarceo.
That should give us opportunity to consolidate effort and provide seamless integration between Kubeflow projects for the long-term.

@franciscojavierarceo
Copy link
Contributor

Yeah, that is good point, similar to Feast as we discussed with @franciscojavierarceo.
That should give us opportunity to consolidate effort and provide seamless integration between Kubeflow projects for the long-term.

💯

Once we discuss kubeflow/community#804, would be happy to incorporate it into the SDK as well. I think a unified SDK to make kubeflow easier to work with across the products would be quite wonderful, indeed.

@thesuperzapper
Copy link
Member

I worry that if we version all of our SDKs at the same time, it could be very difficult to create a sensible versioning strategy.

For example, what if we want to make a breaking change in the model registry SDK but not in any other part, would we need to make a new major version for the overall SDK?

@franciscojavierarceo
Copy link
Contributor

franciscojavierarceo commented Jan 29, 2025

For example, what if we want to make a breaking change in the model registry SDK but not in any other part, would we need to make a new major version for the overall SDK?

Yeah and I think that's a reasonable trade-off for a quality user experience.

@andreyvelich
Copy link
Member Author

andreyvelich commented Jan 29, 2025

For example, what if we want to make a breaking change in the model registry SDK but not in any other part

We should discuss this in the proposal, and we should identify about what breaking changes we are talking about.

For example, if breaking change is introduced into Kubernetes CRD, we should create a new version of API: v1alpha2, and we can have conversion webhook to allows users to submit v1alpha1 version.

The control plane and clients should be independent from each other. For example, we can say that kubeflow with version v0.1.0 works with these specific version of control plane components (e.g. Kubeflow Trainer, Kubeflow Optimizer, Model Registry).

At the end, it is cluster admins responsibility to make sure that correct version of control plane is installed in their k8s clusters. And correct version of kubeflow SDK is installed in their Kubeflow Workspaces/Notebooks.

Users just do: train(), optimize(), deploy().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests