-
Notifications
You must be signed in to change notification settings - Fork 759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes to attach probes at pod start #3188
base: master
Are you sure you want to change the base?
Conversation
Could you add the details on these to the PR description. The cluster setup and how/what was verified? |
cmd/routed-eni-cni-plugin/cni.go
Outdated
K8S_POD_NAMESPACE: string(k8sArgs.K8S_POD_NAMESPACE), | ||
}) | ||
|
||
// NP agent will never return an error if its not able to delete ebpf probes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment slightly offsets the code below.
- Under what other conditions does err wont be nil or npr won't be Success.
- We do seem to error out on a behavior from NP agent on Del IP, which is change from the previous behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today in NP agent we wont return error on delete probes or bpf program deletion failures https://github.com/aws/aws-network-policy-agent/blob/main/pkg/ebpf/bpf_client.go. Probes attached to veth will anyways be deleted when pod and veth gets deleted. Rest of the clean up is to avoid memory leak. Erroring out on cleaning these up is a rare case (any ebpf sdk failures) - so very small memory leak should be acceptable if incase clean up fails. When NP agent restarts (any upgrades or aws-node restart) this memory leak would go away too
--> Under what other conditions does err wont be nil or npr won't be Success.
NP agent does not return any error to the DeletePodNp request. It will log failures and returns success. Just added check here for any future changes. It seemed safe to have this check though we don't throw any errors today
--> We do seem to error out on a behavior from NP agent on Del IP, which is change from the previous behavior.
Even today we don't error out in NP when we are not able to cleanup bpf programs https://github.com/aws/aws-network-policy-agent/blob/main/pkg/ebpf/bpf_client.go. Other than memory leak this won't effect any behavior. Throwing an error for that and kubelet retrying on delete is not required
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this error - 41aa1a1 didn't require mock change?
Hope we covered for the rest.
The testing done in the description looks good to me. Let's please clarify the behavior expected with cmdDel above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
if err != nil || !npr.Success { | ||
log.Errorf("Failed to setup default network policy for Pod Name %s and NameSpace %s: GRPC returned - %v Network policy agent returned - %v", | ||
string(k8sArgs.K8S_POD_NAME), string(k8sArgs.K8S_POD_NAMESPACE), err, npr) | ||
return errors.New("add cmd: failed to setup network policy") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With new changes I have which always attaches probes at pod start @jayanthvn @M00nF1sh
-> pod creation will fail for SGP strict mode pods due to vlan prefix
Two options we have:
-> pod creation should not fail if NP is not able to attach probes. We will wait for reconcile request to retry probe attaching. Downside is we might run into same issue of return packet drop as pod will not have probes attached at beginning
-> Check if if veth prefix is "vlanxx" and if so don't make a call to attach probes to np at pod creation. We can update docs that we don't support it today and can work on this separately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will have to fix the vlan prefix since we try attach in strict mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have a check here for clusters where network policy is disabled and skip applying the default allow policy... Since the GRPC endpoint won't be available on NP agent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NP agent grpc endpoint will be available even if np is disabled. We have enable-network-policy flag in np agent container and is not visible to cni. Removed returning error when not able to connect to np agent grpc endpoint as cx might have removed np agent container when they are not using network policies. If np agent container is available and grpc endpoint is not available due to some issue, np agent will be in crash loop and we will catch the issue there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check if if veth prefix is "vlanxx" and if so don't make a call to attach probes to np at pod creation. We can update docs that we don't support it today and can work on this separately
We will have to fix the vlan prefix since we try attach in strict mode.
Did we verify and fix this? I don't see the corresponding change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log.Debugf("Network Policy agent returned Success : %v", npr.Success) | ||
// No need to cleanup IP and network, kubelet will send delete. | ||
if err != nil || !npr.Success { | ||
log.Errorf("Failed to setup default network policy for Pod Name %s and NameSpace %s: GRPC returned - %v Network policy agent returned - %v", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is default or strict mode right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant default policy which will be allow for standard and deny for strict
We can close this PR right @Pavani-Panakanti ? |
What type of PR is this?
bug
Which issue does this PR fix?:
aws/aws-network-policy-agent#345
What does this PR do / Why do we need it?:
NP agent PR to handle these calls - Fix standard mode return packet drop at pod startup aws-network-policy-agent#361
Testing done on this change:
Setup a generic ipv4 cluster with 3 nodes
EnforceNpToPod Testing - NP enabled
Created a pod and verified the EnforceNptoPod calls and probes are being attached to pods as expected in default allow. Verified pod connectivity in above case
Created multiple pods (10) at once and verified locking is happening in NP as expected and probes are being attached correctly
Did a large scale up and verified calls are being made as expected
EnforceNpToPod Testing - NP disabled
Created pods and verified that NP is returning success immediately
DeletePodNp - NP enabled
Deleted a pod and verified caches are being cleaned up and bpf programs are being deleted as expected
Deleted multiple pods at once and verified cleanup of caches and programs is working as expected
Verified locking on NP side when multiple calls from cni to NP happen at same time
Did a huge scale down and verified the functionality
DeletePodNp Testing - NP disabled
Deleted pods and verified that NP is returning success immediately
Tested below upgrade/downgrade functionality
If a pod is created before upgrade and deleted after upgrade - we will call DeletePodNp. If pod has policies applied, there will be probes attached and we will clean the caches and programs. If pod does not have any policies applied, we will not find any entries in the internal caches in NP related to this pod and we will just return success
If pod is created after upgrade and then deleted after downgrade - No call will be made from cni to np after downgrade. If pod has policies applied, probes will be cleaned up as part of reconcile code flow (This is how it works before these changes). If no policies are applied, there will be pod info left in internal caches and bpf program will not be deleted. Probes will be anyway deleted as part of pod delete. This should not be a concern. Internal caches will be reset on np agent restart
Tested NP enable -> disable and disable -> enable functionality as well
Tested all of the above with ipv6 cluster
Will this PR introduce any new dependencies?:
No
Will this break upgrades or downgrades? Has updating a running cluster been tested?:
Tested upgrades and downgrades. It will not break any updates or downgrades. CNI and NP have to be compatible after this change
Does this change require updates to the CNI daemonset config files to work?:
No
Does this PR introduce any user-facing change?:
No
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.