Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: catastrophic data loss on NATS connectivity issues #3449

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

fullykubed
Copy link
Contributor

@fullykubed fullykubed commented Jan 29, 2025

Problem

Closes #3448

We discovered an issue where the current argo-events webhook implementation will drop data when experiencing connectivity issues with NATS while still returning HTTP 200 status codes to the client. This breaks common webhook specifications including CloudEvent's and will result in catastrophic data loss for those who are using argo-events as a part of their core infrastructure.

This impacts all of the following EventSources:

  • awssns
  • bitbucket
  • bitbucketserver
  • github
  • gitlab
  • slack
  • gerrit
  • slack
  • storagegrid
  • stripe
  • webhook

Changes

This PR introduces a few changes to address the immediate problem. The goal was to keep the changes as small as possible while also ensuring the problem is thoroughly addressed for all of the impacted EventSources.

  1. Instead of using a unidirectional channel for sending event payloads to the dispatch function, the DataCh is now a DispatchChan which takes Dispatch structs. The Dispatch struct is a wrapper around the original []byte data that also includes a bundled SuccessChan so that the dispatch caller can indicate to Dispatch sender whether the dispatch was successful or not.
  2. A new DispatchEvent function has been introduced that is called at the end of the HandleRoute methods of the above EventSource types. The logic was already essentially the same for all of them, so it made sense to abstract that out to ensure this critical code path remains correct for all EventSource types.
  3. DispatchEvent now waits until the caller of dispatch (in manageRouteChannels) sends either true or false on Dispatch.SuccessChan. If false is sent, SendInternalErrorResponse is called which sends a 500 HTTP status code to the webhook client to correctly indicate that the saving of the webhook event data failed.
  4. I have updated the unit tests and some of the log messages to reflect the new names and data structures.

Testing

I have deployed this in a live Kubernetes system and verified that if the EventSources are cut off from NATS (e.g., NATS is spun down, network partition, etc.) a 500 response code is returned (instead of the original 200).

I have also performed some basic load testing to ensure that this did not hinder the overall system performance (though I would argue correctness around data integrity is significantly more important).

When running the new argo-event build on a very small instance (50m cpu, 50Mi memory), I was able to achieve 400 reqs / sec with a very slow client parallelism number (5). This was similar enough to my load testing prior to these changes that I do not believe this impacted overall system performance for the majority of use cases.

Signed-off-by: fullykubed <[email protected]>
@fullykubed fullykubed requested a review from whynowy as a code owner January 29, 2025 04:06
@fullykubed fullykubed changed the title Fix: catastrophic data loss on NATs connectivity issues Fix: catastrophic data loss on NATS connectivity issues Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data Loss: Webhook EventSource drops data and returns 200 when failing to publish to NATS
1 participant