-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
📖 Add In-place updates proposal #11029
base: main
Are you sure you want to change the base?
📖 Add In-place updates proposal #11029
Conversation
Skipping CI for Draft Pull Request. |
c77a225
to
be97dc6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the write up.
i left some comments, but i did not go in detailed review on the controller interaction (diagrams) part.
|
||
An External Update Extension implementing custom update strategies will report the subset of changes they know how to perform. Cluster API will orchestrate the different extensions, polling the update progress from them. | ||
|
||
If the totality of the required changes cannot be covered by the defined extensions, Cluster API will allow to fall back to the current behavior (rolling update). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this might be only what some users want. IMO, if a in-place update fails, it should fail and give the signal for it. there could be a "fallback" option with default value "false", but it also opens to some questions - what if the external update tempered with objects in a way that the fallback is no longer possible? i think that in-place upgrades should be a "hard-toggle" i.e. it's either replace or in-place. no fallbacks from CAPIs perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic can also use fallback scenario in case of timeout or some general condition. It might not scale well with multiple upgraders, but having options here would seem beneficial.
Since the changes are constrained on the single machine, machine replace should still work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the external update tempered with objects in a way that the fallback is no longer possible
You mean there is (or there will be) case that external update can do but rollout update can't? If it happens, we can introduce some verification logic to determine if it can fallback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be in favor of having a possibility to disable fallback to rollout updates. In some cases, users would want only certain fields to be handled in-place, for example, instance tags, if any other fields were changed it should be ok to do rollout update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@neolit123 A couple of clarifications:
- The fallback strategy is not meant for the scenario where the in-place update starts and fails. In this case, the update will remain in "failed" state until either the user manually intervenes or remediation (if configured) kicks in a deletes the failed machine. The fallback strategy is meant for when the external updaters cannot handle the desired update. In other words, when capi detects the need for an update, it queries the external updaters and decides to either start an in-place update or a rolling update (fallback strategy). But once it makes that decision and the update starts, it doesn't switch strategies.
- We were thinking that the fallback strategy would be optional. TBD if opt-in or opt-out, pending the discussion on the API changes.
|
||
As this proposal is an output of the In-place updates Feature Group, ensuring that the rollout extension allows the implementation of in-place rollout strategies is considered a non-negotiable goal of this effort. | ||
|
||
Please note that the practical consequence of focusing on in-place rollout strategies, is that the possibility to implement different types of custom rollout strategies, even if technically possible, won’t be validated in this first iteration (future goal). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by 'validated', do you mean something CAPI will maintain e2e tests for?
i would think there could be some community owned e2e tests for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@neolit123 can you take a look at "Test Plan" section at the end of the proposal? The initial plan was to have it in CAPI CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What this paragraph tries to say is that although the concept of "external updater" theoretically allows to implement different types of update strategies (other than in-place), our focus here is to ensure that it can be used to implement in-place updates and that's what we will validate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Is it even a future goal though? (not sure what other update strategies might be and how additional test coverage for them would look like)
Basically I wonder if it makes sense to have additional test coverage beyond what we need to validate in-place & the core CAPI functionality.
We should validate that our core CAPI implementation works. We don't have to validate in core CAPI CI that various ways of implementing an updater works (similar to how we test core CAPI today with CAPD, but not with an implementation for AWS, Google Cloud, ...)
|
||
### Non-Goals | ||
|
||
- To provide rollbacks in case of an in-place update failure. Failed updates need to be fixed manually by the user on the machine or by replacing the machine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to the earlier points, if in-place fails, how would the controllers know to leave it to the user for a manual fix vs rollout the machine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Controllers will never rollout the machine in case of in-place update failure. At most, MHC might mark the machine for remediation. But that's a separate process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To @neolit123's point, this should be configurable — not everyone will want to fallback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the idea is for both the update fallback strategy and MHC remediation (already is) to be optional
end | ||
mach->>apiserver: Mark Machine as updated | ||
end | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the diagram is missing the feedback signal from external updater to CAPI controllers whether the update has passed and what is the follow up for them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's correct. This is a high level flow that simplifies certain things. The idea is to help get a high level understanding of the flow with subsequent sections digging into the details of each part of the flow.
|
||
If this set is reduced to zero, then CAPI will determine that the update can be performed using the external strategy. CAPI will define the update plan as a list of sequential external updaters in a particular order and proceed to execute it. The update plan will be stored in the Machine object as an array of strings (the names of the selected external updaters). | ||
|
||
If after iterating over all external updaters the remaining set still contains uncovered changes, CAPI will determine the desired state cannot be reached through external updaters. If a fallback rolling update strategy has been configured (this is optional), CAPI will replace the machines. If no fallback strategy is configured, we will surface the issue in the resource status. Machines will remain unchanged and the desired state won't be reached unless remediated by the user. Depending on the scenario, users can: ammend the desired state to something that the registered updaters can cover, register additional updaters capable of handling the desired changes or simply enable the fallback strategy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the order of external upgraders can be defined? Since there will be implicit requirements which will make them dependent on each other.
Since the idea is to iterate over an array of upgraders, this should have support for multiple iterations, and more clever mechanism than substraction. One iteration will not be enough to mark desired state unreachable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current proposal, updaters need to be independent in order to be scheduled in the same upgrade plan. Updaters just look at the set of required changes and tell capi what is the subset of changes they can take care of. And they need to be capable of updating those fields regardless of how many other updaters are scheduled and no matter if they run before or after.
If for some reason an updater needs certain fields to be updated first before being able to execute its update, then two update plans will be needed, hence the change would need to be performed by the user in two phases.
We could (probably in future iterations) add a "priority" property to the updaters that would help order updaters when they have overlapping functions. However this would be a global priority and not relative between updaters.
Now all that said, this is a what we are proposing, which might not cover all usecases. Do you have a particular usecase where order matters and updaters must be dependent on each other.
|
||
Both `KCP` and `MachineDeployment` controllers follow a similar pattern around updates, they first detect if an update is required and then based on the configured strategy follow the appropiate update logic (note that today there is only one valid strategy, `RollingUpdate`). | ||
|
||
With `ExternalUpdate` strategy, CAPI controllers will compute the set of desired changes and iterate over the registered external updaters, requesting through the Runtime Hook the set of changes each updater can handle. The changes supported by an updater can be the complete set of desired changes, a subset of them or an empty set, signaling it cannot handle any of the desired changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're falling back to rolling update, to @neolit123's point, it doesn't make sense to me that ExternalUpdate
is a rollout strategy on its own, but rather it should be a field, or set of fields within rolling update that control its behavior?
Note that technically, a rolling update it doesn't have to be a replace operation, but it can be done in place, so imo it can be expanded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an interesting point. I'm not against representing external updates as a subtype of rolling update strategy. You are right that we what we are proposing here, CAPI is following a rolling update process except it delegates the machine update instead of replacing the machine by itself. But capi orchestrates the rolling process.
As long as we can represent the fallback as optional, I'm ok with this if folks think it makes more sense.
|
||
CAPI expects the `/UpdateMachine` endpoint of an updater to be idempotent: for the same Machine with the same spec, the endpoint can be called any number of times (before and after it completes), and the end result should be the same. CAPI guarantees that once an `/UpdateMachine` endpoint has been called once, it won't change the Machine spec until the update reaches a terminal state. | ||
|
||
Once the update completes, the Machine controller will remove the name of the updater that has finished from the list of updaters and will start the next one. If the update fails, this will be reflected in the Machine status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like we're tracking state and keeping this state in the Machine controller itself. This is usually a common source of issues given that the state can drift from reality. Have we considered the set of hooks be only ever present on the MachineDeployment object, and the Machine object only contain its status, hence every updater has to be 1) re-entrant and 2) track where it "left-off".
This way, the status can be calculated from scratch at every iteration, rather than rely on sync calls and other means of strict operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I follow. What state are you referring to? The list of updaters to be run?
Answering your other question, yeah we opted to have the set of hooks at the Machine level because that allows to reuse the same mechanism for both KCP and MD machines.
Regarding re-entrance for updaters: yeah, that is the idea here (it might need more clarification in the doc). CAPI will continue call the /UpdateMachine
endpoint of an updater until this either returns success or failure. It's up to the updater to track the "update progress". Or maybe I didn't understand your comment correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding reading the proposal, it sounds like we're building a plan and tracking it in the Machine spec it self, which can be error prone; I'd suggest instead to find an approach that's ultimately declarative: declare the plan somewhere else and reflect the status of that plan in Machine status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I understand now, thanks for the clarification. Yeah this sounds very reasonable, let me give it a thought and I'll come back to it.
* A way to define different rules for Machines on-going an update. This might involve new fields in the MHC object. We will decouple these API changes from this proposal. For the first implementation of in-place updates, we might decide to just disable remediation for Machines that are on-going an update. | ||
|
||
|
||
### API Changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be great to start the proposal with an example in how we envision the end state to look like, from defining the state and provide an in depth example with KCP, kubeadm bootstrap provider, and an example infra provider (like AWS or similar)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting we do this before we define the API changes? or as part of that work?
We purposefully left the API design for later so we can focus the conversation on the core ideas and high level flow and make sure we are aligned there first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without seeing the API changes we're proposing it's generally hard to grasp the high level concept. I would like to see it from a user/operator perspective:
- How will we setup this feature in yaml?
- What are the required pieces that we need to install?
- Are there any assumptions we're making?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a section with examples with what we believe are 3 of the most common scenarios.
|
||
We propose a pluggable update strategy architecture that allows External Update Extension to handle the update process. The design decouples core CAPI controllers from the specific extension implementation responsible for updating a machine. The External Update Strategy will be configured reusing the existing field in KCP and MD resources, by introducing new type of strategy called `ExternalUpdate` (reusing the existing field in KCP and MD). This allows us to provide a consistent user experience: the interaction witht he CAPI resources is the same as in rolling updates. | ||
|
||
This proposal introduces a Lifecycle Hook named `ExternalUpdate` for communication between CAPI and external update implementers. Multiple external updaters can be registered, each of them only covering a subset of machine changes. The CAPI controllers will ask the external updaters what kind of changes they can handle and, based on the reponse, compose and orchestrate them to achieve the desired state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposal is a missing details in how the external updater logic would work, and how the "kind of changes they can handle" is handled. How is that going to work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think It'd be good for the proposal to include a reference external updater implementation and shape around one common/trivial driving use case. E.g perform an in-place rolling update of the kubernetes version for a pool of Nodes. Then we can grasp and discuss design implications for RBAC, drain...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@enxebre In the 'test plan' section we mention a "CAPD Kubeadm Updater", which will be a reference implementation and also used for testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean with "how is that going to work?"? Are you referring to how the external updater knows what are the desired changes? Or how does the external updater compute what changes it can perform and what changes it can't?
Trying to give a generic answer here, the external updater will receive something like "current state" and "desired state" for a particular machine (including machine, infra machine and bootstrap) in the CanUpdateRequest
. Then it will respond with something like an array of fields for those objects (kubeadmconfig -> ["spec.files", "spec.mounts", "spec.files"]
), which would signal the subset of fields that it can update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@enxebre
The idea of opening the draft at this stage for review is to get feedback on the core ideas and high level flow before we invest more time on this direction. Unless you think that a reference implementation is necessary to have these discussions, I would prefer to avoid that work.
That said, I totally get that it's possible that the lack of detail in certain areas is making difficult to have the high level discussion. If that's the case, we are happy to add that detail wherever needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to give a generic answer here, the external updater will receive something like "current state" and "desired state" for a particular machine (including machine, infra machine and bootstrap) in the CanUpdateRequest. Then it will respond with something like an array of fields for those objects (kubeadmconfig -> ["spec.files", "spec.mounts", "spec.files"]), which would signal the subset of fields that it can update.
These details must be part of the proposal, the details on how the entire flow from MachineDeployment, to the external request, back to the Machine, and reflecting status are not present, which makes it hard to understand how the technical flow will go and/or propose alternative solutions.
|
||
* More efficient updates (multiple instances) that don't require re-bootstrap. Re-bootstrapping a bare metal machine takes ~10-15 mins on average. Speed matters when you have 100s - 1000s of nodes to upgrade. For a common telco RAN use case, users can have 30000-ish nodes. Depending on the parallelism, that could take days / weeks to upgrade because of the re-bootstrap time. | ||
* Single node cluster without extra hardware available. | ||
* `TODO: looking for more real life usecases here` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we include certificate rotation in the use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great usecase. However, I'm not sure if we should add it because what we have in this doc doesn't really solve that problem.
The abstractions/ideas we present can totally be used for cert rotation. However, what we have only covers changes triggered by updates to the KCP/MD specs. If I'm not mistaken, in-place cert rotation would be a separate process, similar to what capi does today, where the expiration date of certs is tracked in the background and handled separately from machine rollouts.
Opinions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use credential rotation instead, e.g. authorized keys for ssh (which can be configured via KubeadmConfig)
Co-authored-by: Lubomir I. Ivanov <[email protected]>
Co-authored-by: Lubomir I. Ivanov <[email protected]>
Co-authored-by: Lubomir I. Ivanov <[email protected]>
Co-authored-by: Lubomir I. Ivanov <[email protected]>
Co-authored-by: Alexander Demicev <[email protected]>
5eb6664
to
472a336
Compare
Hey folks 👋 @g-gaston Dropping by from the Flatcar Container Linux project - we're a container optimised Linux distro; we joined the CNCF a few weeks ago (incubating). We've been driving implementation spikes of in-place OS and Kubernetes updates in ClusterAPI for some time - at the OS level. Your proposal looks great from our point of view. While progress has been slower in the recent months due to project resource constraints, Flatcar has working proof-of-concept implementations for both in-place updating the OS and Kubernetes - independently. Our implementation is near production ready on the OS level, update activation can be coordinated via kured, and the worker cluster control plane picks up the correct versions. We do lack any signalling to the management cluster as well as more advanced features like coordinated roll-backs (though this would be easy to implement on the OS level). In theory, our approach of in-place Kubernetes updates is distro agnostic (given the "mutable sysext" changes in recent versions of systemd starting with release 256). We presented our work in a CAPZ office hours call earlier this year: https://youtu.be/Fpn-E9832UQ?feature=shared&t=164 (slide deck: https://drive.google.com/file/d/1MfBQcRvGHsb-xNU3g_MqvY4haNJl-WY2/view). We hope our work can provide some insights that help to further flesh out this proposal. Happy to chat if folks are interested. (CC: @tormath1 for visibility) EDIT after initial feedback from @neolit123 : in-place updates of Kubernetes in CAPI are in "proof of concept" stage. Just using sysexts to ship Kubernetes (with and without CAPI) has been in production on (at least) Flatcar for quite some time. Several CAPI providers (OpenStack, Linode) use sysexts as preferred mechanism for Flatcar worker nodes. |
i don't think i've seen usage of sysext with k8s. it's provisioning of image extensions seems like something users can do, but they might as well stick to the vanilla way of using the k8s package registries and employing update scripts for e.g. containerd. the kubeadm upgrade docs, just leverage the package manager upgrade way: one concern that i think i have with systemd-sysext that you still have a intermediate build process for the extension, while the k8s package build process is already done by the k8s release folks. |
On Flatcar, sysexts are the preferred way to run Kubernetes. "Packaging" is straightforward - create a filesystem from a subdirectory - and does not require any distro specific information. The resulting sysext can be used across many distros. I'd argue that the overhead is negligible: download release binaries into a sub-directory and run Drawbacks of the packaging process are:
Sysexts are already used by the ClusterAPI OpenStack and the Linode providers with Flatcar (though without in-place updates). |
|
||
If this set is reduced to zero, then CAPI will determine that the update can be performed using the external strategy. CAPI will define the update plan as a list of sequential external updaters in a particular order and proceed to execute it. The update plan will be stored in the Machine object as an array of strings (the names of the selected external updaters). | ||
|
||
If after iterating over all external updaters the remaining set still contains uncovered changes, CAPI will determine the desired state cannot be reached through external updaters. If a fallback rolling update strategy has been configured (this is optional), CAPI will replace the machines. If no fallback strategy is configured, we will surface the issue in the resource status. Machines will remain unchanged and the desired state won't be reached unless remediated by the user. Depending on the scenario, users can: ammend the desired state to something that the registered updaters can cover, register additional updaters capable of handling the desired changes or simply enable the fallback strategy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If after iterating over all external updaters the remaining set still contains uncovered changes
How do we envision this to take place? Diffing each field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I envision something like capi would generate a set with all the fields that are changing for an object by diffing current state with desired state. Then as it iterates over the updaters, it would remove fields from the set. If it finish iterating over the updaters and there are still fields left in the set, then the update can't be performed in-place.
the kubeadm and kubelet systemd drop-in files (in the official k8s packages) have some distro specific nuances like Debian vs RedHat paths. is sysexts capable of managing different drop-in files if the target distro is different, perhaps even detecting that automatically? |
Sysexts focus on shipping application bits (Kubernetes in the case at hand); configuration is usually supplied by separate means. That said, a complementary image-based configuration mechanism ("confext") exists for etc. Both approaches have their pros and cons, I'd say it depends on the specifics (I'm not very familiar with kubeadm on Debian vs. Red Hat, I'm more of an OS person :) ). But this should by no means be a blocker. (Sorry for the sysext nerd sniping. I think we should stick to the topic of this PR - I merely wanted to raise that we have a working PoC of in-place Kubernetes updates. Happy to discuss Kubernetes sysexts elsewhere) |
while the nuances between distros are subtle in the k8s packages, the drop-in files are critical. i won't argue if they are config or not, but if kubeadm and systemd is used, e.g. without
i think it's a useful POV. perhaps @g-gaston has comments on the sysext topic. although, this proposal is more about the CAPI integration of the in-place upgrade concept. |
Shipping this file in a sysext is straightforward. In fact, the kubernetes sysexts we publish in our "sysext bakery" include it.
That's what originally motivated me to speak up: the proposal appears to discuss the control plane "upper half" our proof of concept implementation lacks. As stated we're OS folks :) And we're very happy to see this gets some traction. |
@t-lo thanks for reaching out! really appreciated +1 from me to keep discussion on this PR focused on the first layer But great to see things are moving for the Flatcar Container Linux project; let's make sure the design work that is happening here does not prevent using Flatcar in place upgrade capabilities (but at the same time, we should make sure it could work with other OS as well, even the ones less "cloud native") |
It would be nice also to ensure the process is also compatible or at least gears well with talos.dev. Which is managed completely by a set of controllers that expose just an API. Useful for single-node long-lived clusters. As far as I read I see no complications yet for it. |
Hello folks, We've briefly discussed systemd-sysext and its potential uses for ClusterAPI in the September 25, 2024 ClusterAPI meeting (https://docs.google.com/document/d/1GgFbaYs-H6J5HSQ6a7n4aKpk0nDLE2hgG2NSOM9YIRw/edit#heading=h.s6d5g3hqxxzt). Summarising the points made here so you don't need to watch the recording 😉 . Let's wrap up the sysext discussion in this PR so we can get the focus back to in-place updates. If there's more interest in this technology from ClusterAPI folks I'm happy to have a separate discussion (here: #11227).
|
Would this mechanism as proposed allow me to do a node rebuild on clouds that support that, instead of a create/delete? I think from reading the proposal that the answer is yes, but I am not 100% certain... Mainly, I am thinking about nodes in a bare-metal cloud using OpenStack Ironic (via Nova). We don't want to keep "spare" bare-metal nodes hanging around in order to be able to do an upgrade, and even if we did have a spare node the create/delete cycle would involve "cleaning" each node which can take a while - O(30m) - before it can be reprovisioned into the cluster. Cleaning is intended to make the node suitable for use with another tenant, so can include operations such as secure erase that are totally unnecessary when the node is being recycled back into the same tenant. OpenStack supports a REBUILD operation on these hosts that basically re-images the node without having to do a delete/create, and I am hoping to use that in the future for these clusters potentially. The plan in this case would not necessarily be to update the Kubernetes components in place, but to trigger a rebuild of the node using a new image with updated Kubernetes components, and having the node rejoin the cluster without having to go through a cleaning cycle. |
Yes, that should be doable. That said, and although I'm not familiar with the rebuild functionality, but that sounds like something that the infra provider could implement today without the in-place update functionality. |
|
||
## Motivation | ||
|
||
Cluster API by default performs rollouts by deleting a machine and creating a new one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cluster API by default performs rollouts by deleting a machine and creating a new one. | |
Cluster API by default performs rollouts by creating a new machine and deleting the old one. |
Isn't the flow the other way around? This has implications for example on bare-metal when you don't have a +1 spare machine to start your rollouts with, reason why In-place updates would be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few lines down:
Support for delete first strategy, thus making it easier to do immutable rollouts on bare metal / environments with constrained resources.
But I'm having trouble thinking of compact wording for "(creating and deleting) or (deleting and creating)", so maybe fine to leave this line as it stands and hope folks get a few lines down and find the delete-first line before getting too concerned?
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 | ||
kind: VSphereMachineTemplate | ||
metadata: | ||
name: md-1-2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name should probably change here to indicate this is a new resource, not editing an existing one.
end | ||
``` | ||
|
||
The MachineDeployment controller updates machines in place in a very similar way to rolling updates: by creating a new MachineSet and moving the machines from the old MS to the new one. We want to stress that the Machine objects won't be deleted and recreated like in the current rolling strategy. The MachineDeployment will just update the OwnerRefs, effectively moving the existing Machine object from one MS to another. The number of machines moved at once might be made configurable on the MachineDeployment in the same way `maxSurge` and `maxUnavailable` control this for rolling updates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: in addition to the OwnerRef, a label on the Machine will need to be updated to match the selector in the new MachineSet.
2. CP/MD Controller: query defined update extensions, and based on the subset of changes each of them supports, defines the final update plan. | ||
3. CP/MD Controller: mark machines for update. | ||
4. Machine Controller: invoke all the updaters included in the plan, sequentially, one by one. | ||
5. Machine Controller: make machine is updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5. Machine Controller: make machine is updated. | |
5. Machine Controller: mark machine as updated. |
Co-authored-by: Alexandr Demicev <[email protected]> Co-authored-by: Danil-Grigorev <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
@anmazzotti: changing LGTM is restricted to collaborators In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Would it make sense to move this PR out of draft status? |
|
||
* More efficient updates (multiple instances) that don't require re-bootstrap. Re-bootstrapping a bare metal machine takes ~10-15 mins on average. Speed matters when you have 100s - 1000s of nodes to upgrade. For a common telco RAN use case, users can have 30000-ish nodes. Depending on the parallelism, that could take days / weeks to upgrade because of the re-bootstrap time. | ||
* Single node cluster without extra hardware available. | ||
* `TODO: looking for more real life usecases here` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use credential rotation instead, e.g. authorized keys for ssh (which can be configured via KubeadmConfig)
|
||
As this proposal is an output of the In-place updates Feature Group, ensuring that the rollout extension allows the implementation of in-place rollout strategies is considered a non-negotiable goal of this effort. | ||
|
||
Please note that the practical consequence of focusing on in-place rollout strategies, is that the possibility to implement different types of custom rollout strategies, even if technically possible, won’t be validated in this first iteration (future goal). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Is it even a future goal though? (not sure what other update strategies might be and how additional test coverage for them would look like)
Basically I wonder if it makes sense to have additional test coverage beyond what we need to validate in-place & the core CAPI functionality.
We should validate that our core CAPI implementation works. We don't have to validate in core CAPI CI that various ways of implementing an updater works (similar to how we test core CAPI today with CAPD, but not with an implementation for AWS, Google Cloud, ...)
|
||
- Enable the implementation of in-place update strategies. | ||
- Allow users to update Kubernetes clusters using pluggable External Update Extension. | ||
- Maintain a coherent user experience for both rolling and in-place updates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the OnDelete strategy of MDs?
Probably we just shouldn't try in-place if OnDelete is configured? (so maybe it's a non-goal?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to add this as a non goal
|
||
- To provide rollbacks in case of an in-place update failure. Failed updates need to be fixed manually by the user on the machine or by replacing the machine. | ||
- Introduce any changes to KCP (or any other control plane provider), MachineDeployment, MachineSet, Machine APIs. | ||
- Ammend the desired state to something that the registered updaters can cover or register additional updaters capable of handling the desired changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't understand what that means, can you expand / rephrase?
|
||
We propose a pluggable update strategy architecture that allows External Update Extension to handle the update process. | ||
|
||
Initially, this feature will be implemented without making API changes in the current core Cluster API objects. It will follow Kubernetes' feature gate mechanism and be contained within the experimental package. This means that any changes in behavior are controlled by the feature gate `InPlaceUpdates`, which must be enabled by users for the new in-place updates workflow to be available. It is disabled unless explicitly configured. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
be contained within the experimental package
Not sure if that is desirable. In general I would like to challenge the idea of putting experimental code into explicit exp packages (we stopped doing this in some cases already, I'm not aware that Kubernetes is doing that at all).
Also how would we modify KCP/MD code if every code we touch has to be moved into an exp package? I think this has a lot more downsides than benefits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sbueringer the runtime extensions are still in the exp
folder. I agree the KCP/MD code for in-place updates should live in the same place as all controller code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to drop from the proposal details about code organization, we can discuss this during/before implementation (but I agree with Stefan, most of the code changes will be in existing controllers)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm definitely fine with modifying existing code where it is (both the code in regular packages as well as the ones in exp package).
I just don't want us moving part of the KCP/MD/Machine controllers out of their current locations and into an exp package. I don't see any benefit of doing that.
In general fine to just not mention that here in the proposal. My general stance is that we should simply use the feature gate wherever we need it and not additionally move the code also into exp packages. (we don't have to cleanup existing exp packages, we'll do that eventually).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sbueringer that totally makes sense
{ | ||
"error": null, | ||
"status": "InProgress", | ||
"tryAgain": "5m0s" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"tryAgain": "5m0s" | |
"retryAfterSeconds": "5m0s" |
nit. This should reuse the existing CommonRetryResponse in my opinion (i.e. embed it)
type: UpToDate | ||
``` | ||
|
||
This process is repeated a third time with the last KCP machine, finally marking the KCP object as up to date. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the above only shows it for one Machine, right? So this here should be "is repeated for the second and third KCP Machine"
} | ||
``` | ||
|
||
The request is also made to `kcp-version-upgrade` but it responds with an empty array, indicating it cannot handle any of the changes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think the diagram above is missing the "kcp-version-upgrade" upgrader
+ cluster.x-k8s.io/cluster-name: cluster1 | ||
+ cluster.x-k8s.io/deployment-name: md-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of these label should change, there's another one for MS that should change.
(I think some of this is missing in the first example)
|
||
```json | ||
{ | ||
"changes": ["machine.spec.version", "bootstrap.spec.clusterConfiguration.kubernetesVersion"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to ignore if the details are TBD.
This only covers that these fields are changed, but not from which value to which value.
Does this mean that if an updater supports in-place updating a specific field they have to be able to handle all possible value changes?
(e.g. let's say they can update v1.28=>v1.29, but don't support upgrading v1.29=>v1.30 or v1.28=>v1.30, there are more complex cases of course)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the disclaimer about message content not being fully spec-ed out apply also here
I think this goes in the right direction |
As a cluster service provider, I want guidance/documentation on how to write external update extension for own my use case. | ||
|
||
#### Story 7 | ||
As a bootstrap/controlplane provider developer, I want guidance/documentation on how to reuse some parts of this pluggable external update mechanism. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Which parts could be reused by bootstrap/control plane providers?
Is this referring to the changes in KCP? Let's be slightly more specific here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would drop this user story for a couple of reasons:
- It is too early; we don't know how this logic will looks like and if it can be reused in some ways
- I think we should preserve the possibility to change KCP implementation as we see fit, including the upgrade workflow
- We don't have deep knowledge of other control plane providers
Please note that this doesn't mean we should not try to factor stuff in a sort of library during implementation, but It is something we should figure out along the way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work in-place WG team!
Mostly a few nits and cleanup from my side, If we keep iterating quickly on feedback I really think we can get this merged after KubeCon
|
||
__External Update Lifecycle Hook__: CAPI Lifecycle Runtime Hook to invoke external update extensions. | ||
|
||
__External Update Extension__: Runtime Extension (Implementation) is a component responsible to perform in place updates when the `External Update Lifecycle Hook` is invoked. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: What about dropping External (all the Extensions are External)
|
||
Even if the project continues to improve immutable rollouts, most probably there are and there will always be some remaining use cases where it is complex for users to perform immutable rollouts, or where users perceive immutable rollouts to be too disruptive to how they are used to manage machines in their organization: | ||
* More efficient updates (multiple instances) that don't require re-bootstrap. Re-bootstrapping a bare metal machine takes ~10-15 mins on average. Speed matters when you have 100s - 1000s of nodes to upgrade. For a common telco RAN use case, users can have 30000-ish nodes. Depending on the parallelism, that could take days / weeks to upgrade because of the re-bootstrap time. | ||
* Single node cluster without extra hardware available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to move this to Future goals (it should be covered by the design, but AFAIK we are not really focusing on this use case)
With this proposal, Cluster API provides a new extensibility point for users willing to implement their own specific solution for these problems, allowing them to implement a custom rollout strategy to be triggered via a new external update extension point implemented using the existing runtime extension framework. | ||
|
||
With the implementation of custom rollout strategy, users can take ownership of the rollout process and embrace in-place rollout strategies, intentionally trading off some of the benefits that you get from immutable infrastructure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this proposal, Cluster API provides a new extensibility point for users willing to implement their own specific solution for these problems, allowing them to implement a custom rollout strategy to be triggered via a new external update extension point implemented using the existing runtime extension framework. | |
With the implementation of custom rollout strategy, users can take ownership of the rollout process and embrace in-place rollout strategies, intentionally trading off some of the benefits that you get from immutable infrastructure. | |
With this proposal, Cluster API provides a new extensibility point for users willing to implement their own specific solution for these problems by Implementing an Update extension. | |
With the implementation of update extension, users can take ownership of the rollout process and embrace in-place rollout strategies, intentionally trading off some of the benefits that you get from immutable infrastructure. |
I will drop references to the custom upgrade strategy for now.
### Divide and conquer | ||
|
||
As this proposal is an output of the In-place updates Feature Group, ensuring that the external update extension allows the implementation of in-place rollout strategies is considered a non-negotiable goal of this effort. | ||
|
||
Please note that the practical consequence of focusing on in-place rollout strategies, is that the possibility to implement different types of custom rollout strategies, even if technically possible, won’t be validated in this first iteration (future goal). | ||
|
||
Another important point to surface, before digging into implementation details of the proposal, is the fact that this proposal is not tackling the problem of improving CAPI to embrace all the possibilities that external update extensions are introducing. E.g. If an external update extension introduces support for in-place updates, using “BootstrapConfig” (emphasis on bootstrap) as the place where most of the machine configurations are defined seems not ideal. | ||
|
||
However, at the same time we would like to make it possible for Cluster API users to start exploring this field, gain experience, and report back so we can have concrete use cases and real-world feedback to evolve our API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In line with recent discussions I wold propose to update this paragraph into:
### Divide and conquer | |
As this proposal is an output of the In-place updates Feature Group, ensuring that the external update extension allows the implementation of in-place rollout strategies is considered a non-negotiable goal of this effort. | |
Please note that the practical consequence of focusing on in-place rollout strategies, is that the possibility to implement different types of custom rollout strategies, even if technically possible, won’t be validated in this first iteration (future goal). | |
Another important point to surface, before digging into implementation details of the proposal, is the fact that this proposal is not tackling the problem of improving CAPI to embrace all the possibilities that external update extensions are introducing. E.g. If an external update extension introduces support for in-place updates, using “BootstrapConfig” (emphasis on bootstrap) as the place where most of the machine configurations are defined seems not ideal. | |
However, at the same time we would like to make it possible for Cluster API users to start exploring this field, gain experience, and report back so we can have concrete use cases and real-world feedback to evolve our API. | |
### Divide and conquer | |
Considering the complexity of this topic, a phased approach is required to design and implement | |
the solution for in-place upgrades. | |
The main goal of the first iteration of this proposal is to make it possible for Cluster API users to start experimenting usage of in-place upgrades, so we can gather feedback and evolve to the next stage. | |
This iteration will focus on implementing the machinery required to interact with upgrade extensions, while user facing changes in the API types are deferred to follow up iterations. |
|
||
### Goals | ||
|
||
- Enable the implementation of in-place update strategies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Enable the implementation of pluggable update extensions
|
||
Remediation can be used as the solution to recover machine when in-place update fails on a machine. The remediation process stays the same as today: the MachineHealthCheck controller monitors machine health status and marks it to be remediated based on pre-configured rules, then ControlPlane/MachineDeployment replaces the machine or call external remediation. | ||
|
||
However, in-place updates might cause Nodes to become unhealthy while the update is in progress. In addition, an in-place update might take more (or less) time than a fresh machine creation. Hence, in order to successfully use MHC to remediate in-place updated Machines, we require: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, in-place updates might cause Nodes to become unhealthy while the update is in progress. In addition, an in-place update might take more (or less) time than a fresh machine creation. Hence, in order to successfully use MHC to remediate in-place updated Machines, we require: | |
However, in-place updates might cause Nodes to become unhealthy while the update is in progress. In addition, an in-place update might take more (or less) time than a fresh machine creation. Hence, in order to successfully use MHC to remediate in-place updated Machines, in a future iteration of this proposal we will consider: |
|
||
### Examples | ||
|
||
*All functionality related to In-Place Updates will be available only if the `InPlaceUpdates` feature flag is set to true.* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this sentence should be moved in the proposal paragraph
|
||
Since the fallback to machine replacement is a default strategy and always enabled, the MachineDeployment controller proceeds with the rollout process as it does today, replacing the old machines with new ones. | ||
|
||
### API Changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will drop this section, we will get back to it at a later stage
|
||
### Risks and Mitigations | ||
|
||
1. One of the risks for this process could be that during a single node cluster in-place update, extension implementation might decline the update and that would result in falling back to rolling update strategy by default, which could possibly lead to breaking a cluster. For the first iteration, users must ensure that the changes they make will be accepted by their updater. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's drop this if single node cluster becomes a future goal.
|
||
However, each external updater should define their own security model. Depending on the mechanism used to update machines in-place, different privileges might be needed, from scheduling privileged pods to SSH access to the hosts. Moreover, external updaters might need RBAC to read CAPI resources. | ||
|
||
### Risks and Mitigations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add short description about the risks due to the complexity of this change, which are mitigated by implementing this feature in incremental steps and by avoiding user facing changes in the first iteration.
discussing this today at the office hours, plan is to merge by lazy consensus 1 or 2 weeks after kubecon. |
What this PR does / why we need it:
Proposal doc for In-place updates written by the In-place updates feature group.
Starting this as a draft to collect early feedback on the main ideas and high level flow. APIs and some other lower level details are left purposefully as TODOs to focus the conversation on the rest of the doc, speed up consensus and avoid rework.
Fixes #9489
/area documentation