Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPA: Fail fast on discovery or bundle download errors #3120

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

mjungsbluth
Copy link
Collaborator

@mjungsbluth mjungsbluth commented Jun 20, 2024

As part of #3119, this PR changes the behaviour on starting up OPA instances inside the OPA based Skipper filters.

Currently even in the face of download errors of bundles, the default timeout of 30s is applied.

By using listeners both for bundle download and discovery bundle download, we can detect a failure early despite the respective plugins still waiting for the bundles to become available.

Still a draft because tests are missing, there are some cleanups in the code that need to be done.

Pending: Unregister discovery and bundle plugin listeners.

@Pushpalanka Pushpalanka force-pushed the opa_fail_fast_startup branch from 73bdf5a to 583fc58 Compare June 28, 2024 16:07
@Pushpalanka Pushpalanka force-pushed the opa_fail_fast_startup branch 5 times, most recently from c838d65 to 0b106c2 Compare July 2, 2024 13:13
@Pushpalanka Pushpalanka added optimization bugfix Bug fixes and patches and removed optimization labels Jul 3, 2024
@Pushpalanka Pushpalanka force-pushed the opa_fail_fast_startup branch 2 times, most recently from 0f696aa to 6f69e3c Compare July 3, 2024 21:41
@Pushpalanka
Copy link
Collaborator

Pushpalanka commented Jul 4, 2024

Addressing the race condition in the upstream as a part of this reported issue.

@Pushpalanka Pushpalanka force-pushed the opa_fail_fast_startup branch 3 times, most recently from d32f5f3 to 45183e6 Compare July 17, 2024 07:19
@Pushpalanka Pushpalanka force-pushed the opa_fail_fast_startup branch from 4a1d25e to 0fe55f1 Compare July 22, 2024 16:02
@Pushpalanka Pushpalanka marked this pull request as ready for review July 22, 2024 16:51
@Pushpalanka Pushpalanka requested review from szuecs and ZunairaSid July 22, 2024 16:52
filters/openpolicyagent/openpolicyagent.go Outdated Show resolved Hide resolved
filters/openpolicyagent/openpolicyagent.go Outdated Show resolved Hide resolved
filters/openpolicyagent/openpolicyagent.go Outdated Show resolved Hide resolved
filters/openpolicyagent/openpolicyagent.go Show resolved Hide resolved
filters/openpolicyagent/openpolicyagent.go Outdated Show resolved Hide resolved
filters/openpolicyagent/openpolicyagent.go Outdated Show resolved Hide resolved
failed chan error,
prefix string,
) {
if status.Code == "bundle_error" {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still feels like we are only handling a subset of errors, is this really reliable? My understanding was that the Code only gives information about what went wrong and that length(status.Errors) essentially gives yoj if anything goes wrong…

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That indeed is more intuitive, but had below concerns.

When looking at the OPA implementation below, status.Errors it is set only in 1 occasion out of the 3. But the other 2 occasions set status.Errors to nil even though it is setting an error. Based on the existing implementation status.Code is most reliable as I see.

I also understand if the bundle_error constant is ever modified then we are in trouble. But then again, just relying on code to be non-empty is also not good if the code got ever set in an non-errornous scenario.

Status Error code segment, /plugins/bundle/status.go#L64-L97
const (
	errCode = "bundle_error"
)

func (s *Status) SetError(err error) {
	var (
		astErrors ast.Errors
		httpError download.HTTPError
	)
	switch {
	case err == nil:
		s.Code = ""
		s.HTTPCode = ""
		s.Message = ""
		s.Errors = nil

	case errors.As(err, &astErrors):
		s.Code = errCode
		s.HTTPCode = ""
		s.Message = types.MsgCompileModuleError
		s.Errors = make([]error, len(astErrors))
		for i := range astErrors {
			s.Errors[i] = astErrors[i]
		}

	case errors.As(err, &httpError):
		s.Code = errCode
		s.HTTPCode = json.Number(strconv.Itoa(httpError.StatusCode))
		s.Message = err.Error()
		s.Errors = nil

	default:
		s.Code = errCode
		s.HTTPCode = ""
		s.Message = err.Error()
		s.Errors = nil
	}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like we want to reach out to styra to make a proper error handling possible.
I also wonder about the typing here. Status.HTTPCode being a string and httpError not being an error, what the hack is this code?

Copy link
Collaborator

@Pushpalanka Pushpalanka Sep 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will raise this in the OPA community and get back.

PS: open-policy-agent/opa#6983 raised with minimum changes that will help us here. Will take it forward based on review.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went ahead with the bundle_error constant as a change there will be communicated in OPA release notes.

filters/openpolicyagent/openpolicyagent_test.go Outdated Show resolved Hide resolved
@Pushpalanka Pushpalanka marked this pull request as draft August 1, 2024 15:00
@Pushpalanka
Copy link
Collaborator

Pushpalanka commented Aug 1, 2024

Moving to draft as using OPA 0.67.0, which is yet to be taken by Skipper dependency upgrades.

@MustafaSaber
Copy link
Member

Moving to draft as using OPA 0.67.0, which is yet to be taken by Skipper dependency upgrades.

OPA 0.67.1 is merged now.

@Pushpalanka Pushpalanka force-pushed the opa_fail_fast_startup branch 2 times, most recently from f2b5ca6 to 47d31f2 Compare August 18, 2024 22:05
@Pushpalanka Pushpalanka force-pushed the opa_fail_fast_startup branch from 6849365 to 07ebb8c Compare October 11, 2024 09:35
@Pushpalanka Pushpalanka force-pushed the opa_fail_fast_startup branch from 07ebb8c to ed756e6 Compare October 11, 2024 09:54
@Pushpalanka Pushpalanka force-pushed the opa_fail_fast_startup branch 2 times, most recently from 1bfa347 to e67d998 Compare October 18, 2024 13:03
@Pushpalanka Pushpalanka force-pushed the opa_fail_fast_startup branch from e67d998 to 0b857aa Compare October 18, 2024 13:17
@Pushpalanka Pushpalanka marked this pull request as ready for review October 21, 2024 08:18
@Pushpalanka
Copy link
Collaborator

Requesting a review as the PR goes green without the 2 ToDo's to be addressed to unregister the listeners.
Unregistering is removed as it prematurely unregistering the listeners without handling the errors. Appreciate inputs.

@Pushpalanka Pushpalanka marked this pull request as draft October 22, 2024 11:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix Bug fixes and patches
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants