Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to attach probes at pod start #3188

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

Pavani-Panakanti
Copy link
Contributor

@Pavani-Panakanti Pavani-Panakanti commented Feb 3, 2025

What type of PR is this?
bug

Which issue does this PR fix?:
aws/aws-network-policy-agent#345

What does this PR do / Why do we need it?:

  • On every pod add, make a call to NP agent to configure ebpf probes. If NP is enabled, np agent will configure probes accordingly for the pod. If NP is disabled, np agent returns success. This is the same behavior for strict mode earlier, we are making same changes for standard mode too to fix the above github issue
  • On every pod del, make a call to NP agent to clean up ebpf probes. If NP is enabled, np agent will clean up caches and probes associated with this pod. If not it will return success. Pod del will not throw any errors from NP.
    NP agent PR to handle these calls - Fix standard mode return packet drop at pod startup aws-network-policy-agent#361

Testing done on this change:

  • Added unit tests
  • Manual tests of pod creation and deletion with NP disabled and NP enabled.

Setup a generic ipv4 cluster with 3 nodes

EnforceNpToPod Testing - NP enabled
Created a pod and verified the EnforceNptoPod calls and probes are being attached to pods as expected in default allow. Verified pod connectivity in above case
Created multiple pods (10) at once and verified locking is happening in NP as expected and probes are being attached correctly
Did a large scale up and verified calls are being made as expected

EnforceNpToPod Testing - NP disabled
Created pods and verified that NP is returning success immediately

DeletePodNp - NP enabled
Deleted a pod and verified caches are being cleaned up and bpf programs are being deleted as expected
Deleted multiple pods at once and verified cleanup of caches and programs is working as expected
Verified locking on NP side when multiple calls from cni to NP happen at same time
Did a huge scale down and verified the functionality

DeletePodNp Testing - NP disabled
Deleted pods and verified that NP is returning success immediately

Tested below upgrade/downgrade functionality
If a pod is created before upgrade and deleted after upgrade - we will call DeletePodNp. If pod has policies applied, there will be probes attached and we will clean the caches and programs. If pod does not have any policies applied, we will not find any entries in the internal caches in NP related to this pod and we will just return success
If pod is created after upgrade and then deleted after downgrade - No call will be made from cni to np after downgrade. If pod has policies applied, probes will be cleaned up as part of reconcile code flow (This is how it works before these changes). If no policies are applied, there will be pod info left in internal caches and bpf program will not be deleted. Probes will be anyway deleted as part of pod delete. This should not be a concern. Internal caches will be reset on np agent restart

Tested NP enable -> disable and disable -> enable functionality as well
Tested all of the above with ipv6 cluster

Will this PR introduce any new dependencies?:
No

Will this break upgrades or downgrades? Has updating a running cluster been tested?:
Tested upgrades and downgrades. It will not break any updates or downgrades. CNI and NP have to be compatible after this change

Does this change require updates to the CNI daemonset config files to work?:
No

Does this PR introduce any user-facing change?:
No


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@orsenthil
Copy link
Member

Manual tests of pod creation and deletion with NP disabled and NP enabled.

Could you add the details on these to the PR description. The cluster setup and how/what was verified?

K8S_POD_NAMESPACE: string(k8sArgs.K8S_POD_NAMESPACE),
})

// NP agent will never return an error if its not able to delete ebpf probes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment slightly offsets the code below.

  • Under what other conditions does err wont be nil or npr won't be Success.
  • We do seem to error out on a behavior from NP agent on Del IP, which is change from the previous behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today in NP agent we wont return error on delete probes or bpf program deletion failures https://github.com/aws/aws-network-policy-agent/blob/main/pkg/ebpf/bpf_client.go. Probes attached to veth will anyways be deleted when pod and veth gets deleted. Rest of the clean up is to avoid memory leak. Erroring out on cleaning these up is a rare case (any ebpf sdk failures) - so very small memory leak should be acceptable if incase clean up fails. When NP agent restarts (any upgrades or aws-node restart) this memory leak would go away too

--> Under what other conditions does err wont be nil or npr won't be Success.

NP agent does not return any error to the DeletePodNp request. It will log failures and returns success. Just added check here for any future changes. It seemed safe to have this check though we don't throw any errors today

--> We do seem to error out on a behavior from NP agent on Del IP, which is change from the previous behavior.

Even today we don't error out in NP when we are not able to cleanup bpf programs https://github.com/aws/aws-network-policy-agent/blob/main/pkg/ebpf/bpf_client.go. Other than memory leak this won't effect any behavior. Throwing an error for that and kubelet retrying on delete is not required

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this error - 41aa1a1 didn't require mock change?
Hope we covered for the rest.

@orsenthil
Copy link
Member

The testing done in the description looks good to me. Let's please clarify the behavior expected with cmdDel above.

orsenthil
orsenthil previously approved these changes Feb 6, 2025
Copy link
Member

@orsenthil orsenthil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

if err != nil || !npr.Success {
log.Errorf("Failed to setup default network policy for Pod Name %s and NameSpace %s: GRPC returned - %v Network policy agent returned - %v",
string(k8sArgs.K8S_POD_NAME), string(k8sArgs.K8S_POD_NAMESPACE), err, npr)
return errors.New("add cmd: failed to setup network policy")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With new changes I have which always attaches probes at pod start @jayanthvn @M00nF1sh
-> pod creation will fail for SGP strict mode pods due to vlan prefix

Two options we have:
-> pod creation should not fail if NP is not able to attach probes. We will wait for reconcile request to retry probe attaching. Downside is we might run into same issue of return packet drop as pod will not have probes attached at beginning
-> Check if if veth prefix is "vlanxx" and if so don't make a call to attach probes to np at pod creation. We can update docs that we don't support it today and can work on this separately

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will have to fix the vlan prefix since we try attach in strict mode

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a check here for clusters where network policy is disabled and skip applying the default allow policy... Since the GRPC endpoint won't be available on NP agent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NP agent grpc endpoint will be available even if np is disabled. We have enable-network-policy flag in np agent container and is not visible to cni. Removed returning error when not able to connect to np agent grpc endpoint as cx might have removed np agent container when they are not using network policies. If np agent container is available and grpc endpoint is not available due to some issue, np agent will be in crash loop and we will catch the issue there

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check if if veth prefix is "vlanxx" and if so don't make a call to attach probes to np at pod creation. We can update docs that we don't support it today and can work on this separately

We will have to fix the vlan prefix since we try attach in strict mode.

Did we verify and fix this? I don't see the corresponding change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log.Debugf("Network Policy agent returned Success : %v", npr.Success)
// No need to cleanup IP and network, kubelet will send delete.
if err != nil || !npr.Success {
log.Errorf("Failed to setup default network policy for Pod Name %s and NameSpace %s: GRPC returned - %v Network policy agent returned - %v",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is default or strict mode right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant default policy which will be allow for standard and deny for strict

jayanthvn
jayanthvn previously approved these changes Feb 17, 2025
@jayanthvn
Copy link
Contributor

We can close this PR right @Pavani-Panakanti ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants