-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AIS-Operator: new proxy replicas cannot join the cluster after deleting the original master replica #208
Comments
Thanks for opening, will take a look. Right now a new node will start off trying to connect to proxy-0 and, ideally, if proxy-0 is not primary it will update the cluster map provided to the new node, including the current primary. But since in your case proxy-0 is not ready, this fails. To address this, we could have the init container query the proxy service to set the correct primary in the initial config. However if I understand correctly, this situation comes up because you're asking it to scale up when proxy-0 can't be scheduled onto a running node. There's no real reason proxies need to be a statefulset vs. a deployment, so we could possibly look into updating that and removing any volume bindings that restrict a proxy to a specific node. This way proxy-0 would simply be rescheduled and when proxy-2 comes up, proxy-0 would be ready to receive requests. (Targets are another issue -- inherently very stateful, so cordoning and setting up new PVs is a more risky/manual process.) |
Thanks for your reply @aaronnw .
Agree. This is more reasonable than updating the primary url in the global configuration.
In fact, I want to mock a node failure scenario to test the election process of aistore proxy and the impact of the intermediate state on file reading and writing.
I am curious, is the data synchronized between proxies just the list of asinodes (proxy and target)? |
Just a quick reaction to something that was said earlier:
There's no reason, real or imaginary. Proxies can run anywhere with no restrictions or expectations other than intra-cluster connectivity at low latency. |
@eahydra
I was referring to the state PVs we use for caching data which includes several types of metadata including the configuration and cluster map. This can all be synced when a proxy joins the cluster initially, thus no need for long-term storage or any statefulset. I believe we did this initially for consistency with the target nodes, which do need to be stateful. AIS already does support the idea of a "discovery" url. We may be able to simply set this to the headless service as a fallback. Looking into it... |
Is there an existing issue for this?
Describe the bug
Hi guys, I created a aistore cluster with ais-operator, the cluster has two proxy replicas. I tried to simulate a cluster failure to determine if the cluster was still working. So I performed the following steps:
aistore-proxy-1
, becomes the primary server.spec.proxySpec.size=3
in the AIStore CRD object to try to increase the scale.aistore-proxy-2
failed to join the cluster. From the log, it tried to connect the original primary replica due to the primary url in global config still isaistore-proxy-0
.Expected Behavior
The new replica
aistore-proxy-2
should connect to the new primary replicaaistore-proxy-1
and successfully join to the cluster.Current Behavior
The new replica
aistore-proxy-2
failed to join the cluster.Steps To Reproduce
As I described in
Describe the bug
Possible Solution
I think it would be a good idea to update the global config with the latest master URL the next time reconcile.
Additional Information/Context
No response
AIStore build/version
latest, ais-operator/latest
Environment details (OS name and version, etc.)
Ubuntu 22.04, K8s v1.30
The text was updated successfully, but these errors were encountered: