Create GitHub Repository for Kubeflow Trainer #2402

andreyvelich · 2025-01-23T14:54:36Z

A the latest AutoML and Training WG call, we discussed how we can create a new GitHub repository and release the Kubeflow Trainer. We want to keep v1alpha1 API version for TrainJob and TrainingRuntime, and introduce a new kubeflow Python SDK starting from the 0.1.0 version.

(updated 2025-01-27). After discussions we decided to move forward with Option 4.

We explore four options for the Kubeflow Trainer project:

Option 1. Migrate the `kubeflow/training-operator` to a new repository.

Steps:

Migrate all branches and tags from kubeflow/training-operator to kubeflow/training-operator-lts
[Breaking Change] Tell users to add this line in their Go projects, if they are using kubeflow/training-operator Go modules:

replace github.com/kubeflow/training-operator v1.9.0 => github.com/kubeflow/training-operator-lts v1.9.0

Rename kubeflow/training-operator to kubeflow/trainer.
Remove V1 code from the master branch of kubeflow/trainer.
Release the first version of kubeflow/trainer on the release-0.1 branch with the v0.1.0 tag (we should delete the existing v0.1-branch and v0.1.0 tag).
After a grace period, delete all V1 GitHub releases, tags, and branches from the kubeflow/trainer repository.

Pros:

Preserves repository stars and history.
Starts Kubeflow Trainer releases with the v0.1.0 version (we should delete .
We can make major and minor releases for Kubeflow Training Operator project in kubeflow/training-operator-lts repository. For example, v1.10.0 Release.

Cons:

Breaking Change for users dependant on kubeflow/training-operator Go modules.

Option 2. Create a new repository `kubeflow/trainer`

Steps:

Create a brand-new GitHub repository for the Kubeflow Trainer project.
Release the first version of kubeflow/trainer on the release-0.1 branch with the v0.1.0 tag.

Pros:

No breaking change for users who depend on kubeflow/training-operator Go modules.

Cons:

Loses the 7.5-year history of the Kubeflow Training Operator project.
May confuse users about whether Kubeflow Trainer is a new or updated version of the Kubeflow Training Operator project.

Option 3. Start `kubeflow/trainer` with v1.10.0 release.

Steps:

Rename kubeflow/training-operator to kubeflow/trainer
Remove the V1 code from the master branch of kubeflow/trainer.
Release the first version of kubeflow/trainer using the release-1.10 branch with the v1.10.0 tag.

Pros:

No breaking change for users dependant on kubeflow/training-operator Go modules.

Cons:

Users may be confused about why Kubeflow Trainer and kubeflow SDK starts with version v1.10.
We can't release a new minor version for Kubeflow Training Operator: e.g. v1.10.0

Option 4 (updated 2025-01-27). Separate Kubeflow Trainer control plane from client SDK.

Steps:

Rename kubeflow/training-operator to kubeflow/trainer
Remove the V1 code from the master branch of kubeflow/trainer.
Release the first version of Kubeflow Trainer components as: v2.0.0 on release-2.0 branch:

docker.io/kubeflow/trainer-controller-manager:v2.0.0
docker.io/kubeflow/dataset-initializer:v2.0.0
docker.io/kubeflow/model-initializer:v2.0.0
docker.io/kubeflow/llm-trainer:v2.0.0

Update the SDK version on GitHub to v0.1.0 as an initial release.
Tell users to install kubeflow SDK from the kubeflow/trainer GitHub temporarily.
Begin drafting a design document for a new GitHub repository: kubeflow/sdk. This repository will host the Kubeflow Python client and potentially expand to include clients for other languages in the future (e.g., Rust, Swift, Java). The design document should outline how Kubernetes versions and other project dependencies will be managed effectively.

Pros:

We don't introduce any Breaking Changes to users of kubeflow/training-operator.
We can release new minor versions for Training Operator (e.g. release-1.10).
It is clear that V2 is the next major version for existing users of Kubeflow Training Operator.
Kubeflow Trainer control plane and Kubeflow client SDK has different release lifecycle. Thus, We can release SDK to PyPI without releasing the Kubeflow Trainer control plane.
We can expand Kubeflow SDK to support other Kubeflow APIs (e.g. Optimizer, Evaluator, etc.)

Cons:

It is challenging to manage test infra between kubeflow/trainer and kubeflow/sdk.
However, I think we can deal with that if we keep examples in the Kubeflow Trainer GitHub repository and use these examples to run E2E tests for both in kubeflow/trainer and kubeflow/sdk
Also, we can explore other options to improve our test coverage.

Personally, I prefer the Option 4 or the Option 1.

Please let us know if we have other ideas @kubeflow/wg-training-leads @kubeflow/release-team @astefanutti @kannon92 @ahg-g @kubeflow/kubeflow-steering-committee @thesuperzapper @kubeflow/wg-manifests-leads @franciscojavierarceo @Electronic-Waste @seanlaii @deepanker13 @saileshd1402 @vsoch @shravan-achar @akshaychitneni @helenxie-bit @kubeflow/release-managers @zijianjoy @james-jwu

The text was updated successfully, but these errors were encountered:

andreyvelich · 2025-01-23T15:18:15Z

cc @akgraner @chasecadet

Electronic-Waste · 2025-01-23T15:48:19Z

I would also support Option 1, since it can preserve the history of the project and make the v1alpha1 TranJob API straightforward to users.

However,v0.1.0 release conflicts with an existing release: https://github.com/kubeflow/tf-operator/tree/v0.1.0. we may need to deal with it.

andreyvelich · 2025-01-23T15:53:26Z

However,v0.1.0 release conflicts with an existing release: https://github.com/kubeflow/tf-operator/tree/v0.1.0. we may need to deal with it.

Good point! I added this in the Option 1 steps.

chasecadet · 2025-01-23T23:22:16Z

I like option 1.

franciscojavierarceo · 2025-01-24T13:26:01Z

For Option 1, can we migrate the release to the kubeflow/training-operator-lts repo? My preference would be to retain the release history in GitHub (even if we redirect them to the new repo). We should also be aware that we'll want to update PyPi.

astefanutti · 2025-01-24T13:30:24Z

I'd also favor option 1 as it brings the best outcomes. There might be an additional impact / cons for projects that maintain a downstream fork of the kubeflow/training-operator repository as they'll have to "re-route" it to kubeflow/training-operator-lts.

Just for completeness of the solution space, there could be option 4 where the repository would be renamed to kubeflow/trainer and the operator and SDK versions would start at 2.x (The TrainJob and TrainingRuntime APIs would still start at v1alpha1 and follow the usual graduation). The only cons would then be mainly that "2.x". Given both the "v2" operator and SDK build on v1, that could be deemed acceptable.

astefanutti · 2025-01-24T13:45:17Z

For option 1, it seems it might be possible to copy/transfer the v1 releases over to the new kubeflow/training-operator-lts repository using the GitHub API: https://blog.madkoo.net/2024/03/13/migrate-releases/.

andreyvelich · 2025-01-24T18:28:47Z

Great find @astefanutti! We can use this tool to migrate GitHub releases as well.

Priyansh-jsk · 2025-01-24T18:36:50Z

option 1

thesuperzapper · 2025-01-24T18:46:15Z

What's good about renaming a repo is that you can still reference it with the old name (in pulls and links).

But I want to say tags must be immutable, and similarly releases, e.g. we can never delete them.

To keep things clean, we can prefix the new version tags/branches with trainer-v0.1.0, or just start trainer at v2.0.0.

franciscojavierarceo · 2025-01-24T18:51:14Z

Why don't we start trainer at v2.0.0? 🤔

andreyvelich · 2025-01-24T18:56:05Z

Why don't we start trainer at v2.0.0? 🤔

We discussed this at the call. We want to keep the v1 version for CRDs and kubeflow SDK since it is a brand new entities.

But I want to say tags must be immutable, and similarly releases, e.g. we can never delete them.

@thesuperzapper Please can you share what we lose if we delete tags from the repository ?
I guess, it will break the GitOps or Go modules for folks who depend on upstream GitHub repository, but ideally you should have an internal fork to avoid such problems.
Anything else, I am missing ?

thesuperzapper · 2025-01-24T20:49:25Z

This is really a case of "you can't have your cake and eat it too", that is, you are either making a new project (which needs a new repo), or you are making a new version of an existing project (which can use the same repo, possibly renamed).

Put another way, how are users meant to see this change, is "Kubeflow Trainer" the V2 version of "Kubeflow Training Operator", or is it a fully new project?

I think the best outcome is to have a clear path of continuity from "Kubeflow Training Operator", and make it clear that:

We are renaming the project to "Kubeflow Trainer", not creating a new project.
We are releasing a backwards-incompatible version at the same time as we are renaming, so its new version is 2.0.0.
We have a plan to keep patching 1.X.X for Y period of time to give people time to transition.

PS: About deleting tags, it's just simply never acceptable to change a tag once it's created. It's so foundational to the idea of a tag that people don't even discuss it. Violating this principle is similar to burning history books.

andreyvelich · 2025-01-24T22:03:52Z

Put another way, how are users meant to see this change, is "Kubeflow Trainer" the V2 version of "Kubeflow Training Operator", or is it a fully new project?

It is fully new project, but the project have similar goals as Kubeflow Training Operator.

We are releasing a backwards-incompatible version at the same time as we are renaming, so its new version is 2.0.0.

The challenging part is that we want to keep CRD version as v1alpha1, which means the CRD and the image versions (e.g. trainer-controller-manager, model-initializer, dataset-initializer, etc.) will be inconsistent.

andreyvelich · 2025-01-25T00:58:00Z

Based on the @thesuperzapper feedback, I propose the Option 4.
Please let me know what do you think @kubeflow/wg-training-leads @franciscojavierarceo @thesuperzapper @Electronic-Waste @chasecadet @astefanutti @Priyansh-jsk

thesuperzapper · 2025-01-25T02:05:11Z

@andreyvelich just to clarify, your current proposal is:

We are renaming "Kubeflow Training Operator" to "Kubeflow Trainer"
We are renaming kubeflow/training-operator to kubeflow/trainer
The first version of "Kubeflow Trainer" is backwards compatible with "Kubeflow Training Operator" and so will be called v1.10.0
We are creating a new repo for the "Trainer SDKs", which will have its own version scheme, starting at v0.1.0

I want to clarify a few things:

Are you 100% sure that there will be no breaking changes from v1.9.0 to v1.10.0?
- If there are, why not use v2.0.0?
- If It's because the CRD versions might not be aligned, I don't think that's a problem, and it's more important that we indicate we are breaking/removing something by bumping to v2
I suggest we use kubeflow/trainer-sdk rather than kubeflow/sdk:
- Otherwise it could be confused with the numerous other SDKs maintained by other components (pipelines, model registry, etc.)

andreyvelich · 2025-01-25T02:27:46Z

The first version of "Kubeflow Trainer" is backwards compatible with "Kubeflow Training Operator" and so will be called v1.10.0

No the v1.10.0 version, will not be compatible with Training Operator. We only keep this version to make sure that CRD APIs and the major version of Kubeflow Trainer is consistent.

If It's because the CRD versions might not be aligned, I don't think that's a problem, and it's more important that we indicate we are breaking/removing something by bumping to v2

It can cause false impression that new control plane components of Kubeflow Trainer: Controller Manager, dataset and model initializers, and LLM trainer have the second version, which is incorrect.

I suggest we use kubeflow/trainer-sdk rather than kubeflow/sdk:

Please check this recording for the context: https://youtu.be/zOsRKCEcMeo?t=1275

Over the past 3 years, we've been discussing a lot with @kubeflow/wg-training-leads, @franciscojavierarceo, @astefanutti, and other contributors that we need to provide a simple Python interface for ML engineers and Data Scientists to interact with Kubeflow APIs.
Our roadmap includes adding support for Katib CRDs (with potential renaming of CRDs) alongside the existing Kubeflow Trainer CRDs.

Beyond that, this SDK is designed to integrate seamlessly with other tools like Model Registry, Feast, and Spark, delivering a unified and intuitive user experience.

With this approach, users will be able to effortlessly develop AI models using Kubeflow, by simply doing something like this in their Kubeflow Notebooks/Workspace:

$ pip install kubeflow

from kubeflow.spark import SparkClient
from kubeflow.trainer import TrainerClient
from kubeflow.optimzer import OptimizerClient

SparkClient().process_data()
TrainerClient().train()
OptimizerClient().tune()

KFP integrates seamlessly with the Kubeflow SDK to orchestrate end-to-end ML pipelines, if users want to perform E2E MLOps/LLMOps.
We believe that this provides users with an intuitive and efficient AI/ML experience with enough flexibility.

cc @kubeflow/wg-data-leads @ChenYi015 @bigsur0 @shravan-achar @chasecadet

franciscojavierarceo · 2025-01-25T02:42:40Z

FYI @HumairAK @mprahl @szaher @gregsheremeta

varodrig · 2025-01-25T23:07:21Z

I like option 1

gaocegege · 2025-01-26T01:09:51Z

Personally, I prefer option 1.

astefanutti · 2025-01-27T17:44:23Z

The challenging part is that we want to keep CRD version as v1alpha1, which means the CRD and the image versions (e.g. trainer-controller-manager, model-initializer, dataset-initializer, etc.) will be inconsistent.

CRD versions are independent from their control plane components. It happens quite often a controller / operator introduces a new CRD starting at v1alpha1 and that CRD gradually graduates independently from its controller version. Kubernetes is a good example, and it also comes with v2 CRDs like autoscaling/v2.

I second @thesuperzapper opinion:

If It's because the CRD versions might not be aligned, I don't think that's a problem, and it's more important that we indicate we are breaking/removing something by bumping to v2

So with option 4, the TrainJob CRD could start at v1alpha1 and graduates towards v1 when it's ready. The data scientist would primarily interact with the SDK v2 and that CRD would be more a hidden detail to them, while the platform engineer persona would be familiar with that decoupling of CRD versions w.r.t. their controller ones.

On the other hand, having the new SDK and operator starting at v2 may send the signal these are built on v1, improve over the lessons learnt, and do not start over from scratch.

andreyvelich · 2025-01-27T19:52:28Z

Yes, these are good points @astefanutti.
We synced with @kubeflow/wg-training-leads and discussed the same concerns.
We want to preserve history as @thesuperzapper mentioned, and we want to make it clear that V2 is the next iteration for Kubeflow Training Operator for existing users.

We propose the following (I updated the Option 4):

Remove Training Operator code from the master branch.
Release Kubeflow Trainer control plane components as V2 in release-2.0 branch.

docker.io/kubeflow/trainer-controller-manager:v2.0.0
docker.io/kubeflow/dataset-initializer:v2.0.0
docker.io/kubeflow/model-initializer:v2.0.0
docker.io/kubeflow/llm-trainer:v2.0.0

Keep the version for kubeflow SDK as: v0.1.0 for now.
Prepare design doc to explain how we are going to version Client SDK with other Kubeflow project versions (e.g. Trainer, Katib, etc.)
For now, ask users to install SDK from the github:

pip install git+https://github.com/kubeflow/trainer.git@master#subdirectory=sdk

thesuperzapper · 2025-01-27T20:59:47Z

@andreyvelich if you are using semantic versioning for the SDK, technically v0.1.0 is explicitly listed as the "first" version, so v0.0.1 is not a valid semantic version.

andreyvelich · 2025-01-27T21:30:36Z

Good point @thesuperzapper, updated the issue.

franciscojavierarceo · 2025-01-27T21:39:31Z

I think Option 4 increases the scope of this issue quite a bit to propose the kubeflow/sdk in general, yeah?

FWIW I am in favor of Option 4 but just want to be explicit about the increase in scope.

andreyvelich · 2025-01-27T22:20:25Z

I think Option 4 increases the scope of this issue quite a bit to propose the kubeflow/sdk in general, yeah?

@franciscojavierarceo Yes, but we will work towards establishing this repo and publish the first release to PyPI:
https://pypi.org/project/kubeflow/

The concern that @johnugeorge has is how we can make it easier to control dependency between:
Client, Kubernetes, and Kubeflow CRDs (TrainJob, OptimizeJob (e.g. Katib), etc.)
For now, we will ask users to install SDK directly from kubeflow/trainer GitHub:

pip install git+https://github.com/kubeflow/trainer.git@master#subdirectory=sdk

StefanoFioravanzo · 2025-01-28T08:00:57Z

I love that you marching towards the creation of an official and maintained Kubeflow SDK with kubeflow/sdk. This is definitely a great step towards providing a better user experience for our users, with unified access to the various components' SDKs. Love it!

cc @ederign

ederign · 2025-01-28T14:28:22Z

Great initiative towards having a unified Kubeflow Python SDK. This will improve a lot user experience on notebooks.

Model Registry also has its own SDK, and I just want to add @tarilabs @rareddy and @dhirajsb to the loop.

andreyvelich · 2025-01-28T15:41:27Z

Model Registry also has its own SDK

Yeah, that is good point, similar to Feast as we discussed with @franciscojavierarceo.
That should give us opportunity to consolidate effort and provide seamless integration between Kubeflow projects for the long-term.

franciscojavierarceo · 2025-01-28T15:46:29Z

Yeah, that is good point, similar to Feast as we discussed with @franciscojavierarceo.
That should give us opportunity to consolidate effort and provide seamless integration between Kubeflow projects for the long-term.

💯

Once we discuss kubeflow/community#804, would be happy to incorporate it into the SDK as well. I think a unified SDK to make kubeflow easier to work with across the products would be quite wonderful, indeed.

thesuperzapper · 2025-01-29T18:00:06Z

I worry that if we version all of our SDKs at the same time, it could be very difficult to create a sensible versioning strategy.

For example, what if we want to make a breaking change in the model registry SDK but not in any other part, would we need to make a new major version for the overall SDK?

franciscojavierarceo · 2025-01-29T18:02:47Z

For example, what if we want to make a breaking change in the model registry SDK but not in any other part, would we need to make a new major version for the overall SDK?

Yeah and I think that's a reasonable trade-off for a quality user experience.

andreyvelich · 2025-01-29T18:12:48Z

For example, what if we want to make a breaking change in the model registry SDK but not in any other part

We should discuss this in the proposal, and we should identify about what breaking changes we are talking about.

For example, if breaking change is introduced into Kubernetes CRD, we should create a new version of API: v1alpha2, and we can have conversion webhook to allows users to submit v1alpha1 version.

The control plane and clients should be independent from each other. For example, we can say that kubeflow with version v0.1.0 works with these specific version of control plane components (e.g. Kubeflow Trainer, Kubeflow Optimizer, Model Registry).

At the end, it is cluster admins responsibility to make sure that correct version of control plane is installed in their k8s clusters. And correct version of kubeflow SDK is installed in their Kubeflow Workspaces/Notebooks.

Users just do: train(), optimize(), deploy().

andreyvelich added the kind/discussion label Jan 23, 2025

andreyvelich pinned this issue Jan 23, 2025

ederign mentioned this issue Jan 29, 2025

Proposal: IDE Working Group kubeflow/community#808

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create GitHub Repository for Kubeflow Trainer #2402

Create GitHub Repository for Kubeflow Trainer #2402

andreyvelich commented Jan 23, 2025 •

edited

Loading

andreyvelich commented Jan 23, 2025

Electronic-Waste commented Jan 23, 2025

andreyvelich commented Jan 23, 2025

chasecadet commented Jan 23, 2025

franciscojavierarceo commented Jan 24, 2025 •

edited

Loading

astefanutti commented Jan 24, 2025 •

edited

Loading

astefanutti commented Jan 24, 2025

andreyvelich commented Jan 24, 2025

Priyansh-jsk commented Jan 24, 2025

thesuperzapper commented Jan 24, 2025

franciscojavierarceo commented Jan 24, 2025

andreyvelich commented Jan 24, 2025

thesuperzapper commented Jan 24, 2025

andreyvelich commented Jan 24, 2025

andreyvelich commented Jan 25, 2025

thesuperzapper commented Jan 25, 2025

andreyvelich commented Jan 25, 2025 •

edited

Loading

franciscojavierarceo commented Jan 25, 2025

varodrig commented Jan 25, 2025

gaocegege commented Jan 26, 2025

astefanutti commented Jan 27, 2025

andreyvelich commented Jan 27, 2025 •

edited

Loading

thesuperzapper commented Jan 27, 2025

andreyvelich commented Jan 27, 2025

franciscojavierarceo commented Jan 27, 2025

andreyvelich commented Jan 27, 2025 •

edited

Loading

StefanoFioravanzo commented Jan 28, 2025

ederign commented Jan 28, 2025

andreyvelich commented Jan 28, 2025

franciscojavierarceo commented Jan 28, 2025

thesuperzapper commented Jan 29, 2025

franciscojavierarceo commented Jan 29, 2025 •

edited

Loading

andreyvelich commented Jan 29, 2025 •

edited

Loading

Create GitHub Repository for Kubeflow Trainer #2402

Create GitHub Repository for Kubeflow Trainer #2402

Comments

andreyvelich commented Jan 23, 2025 • edited Loading

Option 1. Migrate the kubeflow/training-operator to a new repository.

Option 2. Create a new repository kubeflow/trainer

Option 3. Start kubeflow/trainer with v1.10.0 release.

Option 4 (updated 2025-01-27). Separate Kubeflow Trainer control plane from client SDK.

andreyvelich commented Jan 23, 2025

Electronic-Waste commented Jan 23, 2025

andreyvelich commented Jan 23, 2025

chasecadet commented Jan 23, 2025

franciscojavierarceo commented Jan 24, 2025 • edited Loading

astefanutti commented Jan 24, 2025 • edited Loading

astefanutti commented Jan 24, 2025

andreyvelich commented Jan 24, 2025

Priyansh-jsk commented Jan 24, 2025

thesuperzapper commented Jan 24, 2025

franciscojavierarceo commented Jan 24, 2025

andreyvelich commented Jan 24, 2025

thesuperzapper commented Jan 24, 2025

andreyvelich commented Jan 24, 2025

andreyvelich commented Jan 25, 2025

thesuperzapper commented Jan 25, 2025

andreyvelich commented Jan 25, 2025 • edited Loading

franciscojavierarceo commented Jan 25, 2025

varodrig commented Jan 25, 2025

gaocegege commented Jan 26, 2025

astefanutti commented Jan 27, 2025

andreyvelich commented Jan 27, 2025 • edited Loading

thesuperzapper commented Jan 27, 2025

andreyvelich commented Jan 27, 2025

franciscojavierarceo commented Jan 27, 2025

andreyvelich commented Jan 27, 2025 • edited Loading

StefanoFioravanzo commented Jan 28, 2025

ederign commented Jan 28, 2025

andreyvelich commented Jan 28, 2025

franciscojavierarceo commented Jan 28, 2025

thesuperzapper commented Jan 29, 2025

franciscojavierarceo commented Jan 29, 2025 • edited Loading

andreyvelich commented Jan 29, 2025 • edited Loading

andreyvelich commented Jan 23, 2025 •

edited

Loading

Option 1. Migrate the `kubeflow/training-operator` to a new repository.

Option 2. Create a new repository `kubeflow/trainer`

Option 3. Start `kubeflow/trainer` with v1.10.0 release.

franciscojavierarceo commented Jan 24, 2025 •

edited

Loading

astefanutti commented Jan 24, 2025 •

edited

Loading

andreyvelich commented Jan 25, 2025 •

edited

Loading

andreyvelich commented Jan 27, 2025 •

edited

Loading

andreyvelich commented Jan 27, 2025 •

edited

Loading

franciscojavierarceo commented Jan 29, 2025 •

edited

Loading

andreyvelich commented Jan 29, 2025 •

edited

Loading