-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Etcd send a corrupt snapshot or missing hash snapshot to a snapshot api call which causes the restoration to fail. #18340
Comments
There are two possible reasons,
|
it was the
we tried but restoration failed with fatal error:
unfortunately I don't have logs, I saw this occurrence twice. First, in one of our test cluster which don't have observability stack, hence I'm unable to get logs and another occurrence is reported by one of our community user: gardener/etcd-backup-restore#749 |
It means the snapshot operation actually failed. So the received snapshot isn't a completed snapshot. |
yes, it seems ... Is there a way to verify the integrity of snapshot either on etcd side before sending the snapshot or on etcd client side ? For verifying the integrity of snapshot on etcd client side, I thought of this. It's a similar way how's restoration is verifying the snapshot before restoration.
What do you think ? |
You need to use the client side error to detect such failure. Lines 216 to 226 in 9a55333
Also I just had a quick read on the server side implementation, it seems that there is a minor issue on the etcd/server/etcdserver/api/v3rpc/maintenance.go Lines 148 to 198 in d6c0127
|
we do handle the error at client side while taking the etcd snapshot but I guess client side didn't throw any error. |
why this issue is not caused by this ? TBH, to me it feels it caused by this as it sends the snapshot but failed to send the sha256 checksum and due to this there was no client side error detected as it feels snapshot taken was successful but it fails during restoration as it fails hash check/validation. |
Bug report criteria
What happened?
It has been observed that during the restoration of etcd cluster from the etcd snapshot (taken via snapshot api call) that snapshot was missing the hash value or snapshot was corrupted, which caused the restoration to fail.
What did you expect to happen?
Is there a way to detect corruption or missing hash of snapshot (taken via snapshot) early rather than waiting till restoration ?
I have following methods in my mind:
This method will work but starting a embedded etcd and wait for restoration to complete can be time taken and costly process.
x
revision and compare it with hash of snapshot (removing the appended hash). If it matched then our snapshot integrity is intact else re-try to take the snapshot again till hash matches.But, I'm not sure about how to calculate the hash of db till
x
revision ? Is there any api call available for that ?I guess api call HashKV won't work here as value return by HashKV api call till
x
revision can't be equal to the Hash of snapshot taken uptox
revision (removing the appended hash) as HashKV api call calculates the hash of all MVCC key-values, whereas snapshot is snapshot of etcd db which also contains cluster information, hence the hash will not be same.How can we reproduce it (as minimally and precisely as possible)?
Not sure, it should be a rare scenario but it might occur more frequently as well since we get to know about this error only during the restoration which itself is a rare occurrence as we don't do restoration frequently(due to persistent volumes).
Anything else we need to know?
No response
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
paste your configuration here
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output
We don't have etcd logs.
The text was updated successfully, but these errors were encountered: