Fix: catastrophic data loss on NATS connectivity issues #3449
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
Closes #3448
We discovered an issue where the current argo-events webhook implementation will drop data when experiencing connectivity issues with NATS while still returning HTTP 200 status codes to the client. This breaks common webhook specifications including CloudEvent's and will result in catastrophic data loss for those who are using argo-events as a part of their core infrastructure.
This impacts all of the following EventSources:
Changes
This PR introduces a few changes to address the immediate problem. The goal was to keep the changes as small as possible while also ensuring the problem is thoroughly addressed for all of the impacted EventSources.
DataCh
is now aDispatchChan
which takesDispatch
structs. TheDispatch
struct is a wrapper around the original[]byte
data that also includes a bundledSuccessChan
so that the dispatch caller can indicate toDispatch
sender whether the dispatch was successful or not.DispatchEvent
function has been introduced that is called at the end of theHandleRoute
methods of the above EventSource types. The logic was already essentially the same for all of them, so it made sense to abstract that out to ensure this critical code path remains correct for all EventSource types.DispatchEvent
now waits until the caller ofdispatch
(inmanageRouteChannels
) sends eithertrue
orfalse
onDispatch.SuccessChan
. Iffalse
is sent,SendInternalErrorResponse
is called which sends a 500 HTTP status code to the webhook client to correctly indicate that the saving of the webhook event data failed.Testing
I have deployed this in a live Kubernetes system and verified that if the EventSources are cut off from NATS (e.g., NATS is spun down, network partition, etc.) a
500
response code is returned (instead of the original200
).I have also performed some basic load testing to ensure that this did not hinder the overall system performance (though I would argue correctness around data integrity is significantly more important).
When running the new argo-event build on a very small instance (50m cpu, 50Mi memory), I was able to achieve 400 reqs / sec with a very slow client parallelism number (5). This was similar enough to my load testing prior to these changes that I do not believe this impacted overall system performance for the majority of use cases.