Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vpc-resource-validating-webhook causing pods to fail to create sporadically even though we're not using it #38

Open
Chili-Man opened this issue Jun 23, 2021 · 5 comments
Labels
bug Something isn't working

Comments

@Chili-Man
Copy link

Describe the Bug:

We don't use the security group for pods feature, so we should not get errors creating pods.

We tried to create the a regular pod but recieved the following error message from the webhook

https://github.com/aws/amazon-vpc-resource-controller-k8s/blob/master/webhooks/core/pod_webhook.go#L94

kubernetes.client.exceptions.ApiException: (403)
Reason: Forbidden
HTTP response headers: HTTPHeaderDict({'Audit-Id': 'b1fea4b3-b577-4845-a4ff-9bd167448adf', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Wed, 23 Jun 2021 20:00:19 GMT', 'Content-Length': '290'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"admission webhook \"mpod.vpc.k8s.aws\" denied the request: Webhood encountered error to Get or List object from k8s cache.","reason":"Webhood encountered error to Get or List object from k8s cache.","code":403}

I didn't even know that this admission webhook was installed by default on the EKS clusters until we got this error message.

Observed Behavior:

We got an error from this webhook when trying to create a pod.

Expected Behavior:

I expect that admission webhook to not cause any issues especially since we're not using the pod security group feature

How to reproduce it (as minimally and precisely as possible):

We're not sure how to reproduce it, this issue happens rarely after creating lots of pods over time.

Additional Context:

Environment:

  • Kubernetes version (use kubectl version): v1.19.6-eks-49a6c0
  • CNI Version: v1.7.5-eksbuild.1
  • OS (Linux/Windows): Amazon Linux 2
@Chili-Man Chili-Man added the bug Something isn't working label Jun 23, 2021
@abhipth
Copy link
Contributor

abhipth commented Jun 25, 2021

@Chili-Man dynamically enabling/disabling the webhook based on feature flags is not supported as it would require installing and uninstalling MutatingWebhookConfiguration and ValidatingWebhookConfiguration dynamically on toggling the SGP feature.

Would it be possible to share your cluster ARN along with the time frame when you saw this issue at [email protected]. I would like to root cause if the issue is due to a bug in the webhook or the issue manifested due to some other dependency being unhealthy.

We can evaluate one enhancement in the webhook i.e to allow admission to all Pods without any checks when ENABLE_POD_ENI is set to False i.e Security Group for Pods is disabled. Alternatively, if the issue is happening due to let's say API Server being unhealthy then bubble up the error message to avoid masking any issue due to dependency as the current error message doesn't specify the exact reason for failure.

@Chili-Man
Copy link
Author

hey @abhipth thanks for the response; I've sent you a follow up email with the requested information. we appreciate the help!

@abhipth
Copy link
Contributor

abhipth commented Jul 9, 2021

Based on the offline discussion with @Chili-Man we discovered that the failure mode can be triggered in the following scenario.

  1. Create a Service Account.
  2. Immediately create Pod using that Service Account.

If the cache has not been updated with the new Service Account when a request to create new Pod in intercepted by the Webhook then the user could see this error.

For any other user that may have been seeing this error, as a short term resolution you could add a small delay between the SA creation and the Pod Creation using the this SA. We are evaluating simply allowing all Pods bypassing the SA check when the SGP feature is disabled.

@stijndehaes
Copy link

We are also running into this issue. It introduces flakes in our own controller loop luckily we have automatic retries.
@abhipth if you need someone to test the solution I am willing to help

@haouc
Copy link
Contributor

haouc commented Jun 9, 2022

@stijndehaes thanks for reaching out to us for another case. We haven't been able to finalize the attemption to dynamically check if the SGP feature is enabled due to various customized ways enabling the feature from VPC CNI. We will keep looking for a reliable way to avoid the webhook interfering with non-SGP pods creation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants