Changes to attach probes at pod start #3188

Pavani-Panakanti · 2025-02-03T20:30:09Z

What type of PR is this?
bug

Which issue does this PR fix?:
aws/aws-network-policy-agent#345

What does this PR do / Why do we need it?:

On every pod add, make a call to NP agent to configure ebpf probes. If NP is enabled, np agent will configure probes accordingly for the pod. If NP is disabled, np agent returns success. This is the same behavior for strict mode earlier, we are making same changes for standard mode too to fix the above github issue
On every pod del, make a call to NP agent to clean up ebpf probes. If NP is enabled, np agent will clean up caches and probes associated with this pod. If not it will return success. Pod del will not throw any errors from NP.
NP agent PR to handle these calls - Fix standard mode return packet drop at pod startup aws-network-policy-agent#361

Testing done on this change:

Added unit tests
Manual tests of pod creation and deletion with NP disabled and NP enabled.

Setup a generic ipv4 cluster with 3 nodes

EnforceNpToPod Testing - NP enabled
Created a pod and verified the EnforceNptoPod calls and probes are being attached to pods as expected in default allow. Verified pod connectivity in above case
Created multiple pods (10) at once and verified locking is happening in NP as expected and probes are being attached correctly
Did a large scale up and verified calls are being made as expected

EnforceNpToPod Testing - NP disabled
Created pods and verified that NP is returning success immediately

DeletePodNp - NP enabled
Deleted a pod and verified caches are being cleaned up and bpf programs are being deleted as expected
Deleted multiple pods at once and verified cleanup of caches and programs is working as expected
Verified locking on NP side when multiple calls from cni to NP happen at same time
Did a huge scale down and verified the functionality

DeletePodNp Testing - NP disabled
Deleted pods and verified that NP is returning success immediately

Tested below upgrade/downgrade functionality
If a pod is created before upgrade and deleted after upgrade - we will call DeletePodNp. If pod has policies applied, there will be probes attached and we will clean the caches and programs. If pod does not have any policies applied, we will not find any entries in the internal caches in NP related to this pod and we will just return success
If pod is created after upgrade and then deleted after downgrade - No call will be made from cni to np after downgrade. If pod has policies applied, probes will be cleaned up as part of reconcile code flow (This is how it works before these changes). If no policies are applied, there will be pod info left in internal caches and bpf program will not be deleted. Probes will be anyway deleted as part of pod delete. This should not be a concern. Internal caches will be reset on np agent restart

Tested NP enable -> disable and disable -> enable functionality as well
Tested all of the above with ipv6 cluster

Integration tests successful - https://github.com/aws/amazon-vpc-cni-k8s/actions/runs/13122307097

Will this PR introduce any new dependencies?:
No

Will this break upgrades or downgrades? Has updating a running cluster been tested?:
Tested upgrades and downgrades. It will not break any updates or downgrades. CNI and NP have to be compatible after this change

Does this change require updates to the CNI daemonset config files to work?:
No

Does this PR introduce any user-facing change?:
No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

orsenthil · 2025-02-04T18:42:12Z

Manual tests of pod creation and deletion with NP disabled and NP enabled.

Could you add the details on these to the PR description. The cluster setup and how/what was verified?

orsenthil · 2025-02-05T16:51:20Z

cmd/routed-eni-cni-plugin/cni.go

+				K8S_POD_NAMESPACE: string(k8sArgs.K8S_POD_NAMESPACE),
+			})
+
+		// NP agent will never return an error if its not able to delete ebpf probes


This comment slightly offsets the code below.

Under what other conditions does err wont be nil or npr won't be Success.

We do seem to error out on a behavior from NP agent on Del IP, which is change from the previous behavior.

Today in NP agent we wont return error on delete probes or bpf program deletion failures https://github.com/aws/aws-network-policy-agent/blob/main/pkg/ebpf/bpf_client.go. Probes attached to veth will anyways be deleted when pod and veth gets deleted. Rest of the clean up is to avoid memory leak. Erroring out on cleaning these up is a rare case (any ebpf sdk failures) - so very small memory leak should be acceptable if incase clean up fails. When NP agent restarts (any upgrades or aws-node restart) this memory leak would go away too

--> Under what other conditions does err wont be nil or npr won't be Success.

NP agent does not return any error to the DeletePodNp request. It will log failures and returns success. Just added check here for any future changes. It seemed safe to have this check though we don't throw any errors today

--> We do seem to error out on a behavior from NP agent on Del IP, which is change from the previous behavior.

Even today we don't error out in NP when we are not able to cleanup bpf programs https://github.com/aws/aws-network-policy-agent/blob/main/pkg/ebpf/bpf_client.go. Other than memory leak this won't effect any behavior. Throwing an error for that and kubelet retrying on delete is not required

Removing this error - 41aa1a1 didn't require mock change?
Hope we covered for the rest.

orsenthil · 2025-02-05T17:05:03Z

The testing done in the description looks good to me. Let's please clarify the behavior expected with cmdDel above.

orsenthil

LGTM

Pavani-Panakanti · 2025-02-11T22:13:15Z

cmd/routed-eni-cni-plugin/cni.go

+	if err != nil || !npr.Success {
+		log.Errorf("Failed to setup default network policy for Pod Name %s and NameSpace %s: GRPC returned - %v Network policy agent returned - %v",
+			string(k8sArgs.K8S_POD_NAME), string(k8sArgs.K8S_POD_NAMESPACE), err, npr)
+		return errors.New("add cmd: failed to setup network policy")


With new changes I have which always attaches probes at pod start @jayanthvn @M00nF1sh
-> pod creation will fail for SGP strict mode pods due to vlan prefix

Two options we have:
-> pod creation should not fail if NP is not able to attach probes. We will wait for reconcile request to retry probe attaching. Downside is we might run into same issue of return packet drop as pod will not have probes attached at beginning
-> Check if if veth prefix is "vlanxx" and if so don't make a call to attach probes to np at pod creation. We can update docs that we don't support it today and can work on this separately

We will have to fix the vlan prefix since we try attach in strict mode

We should have a check here for clusters where network policy is disabled and skip applying the default allow policy... Since the GRPC endpoint won't be available on NP agent.

NP agent grpc endpoint will be available even if np is disabled. We have enable-network-policy flag in np agent container and is not visible to cni. Removed returning error when not able to connect to np agent grpc endpoint as cx might have removed np agent container when they are not using network policies. If np agent container is available and grpc endpoint is not available due to some issue, np agent will be in crash loop and we will catch the issue there

Check if if veth prefix is "vlanxx" and if so don't make a call to attach probes to np at pod creation. We can update docs that we don't support it today and can work on this separately

We will have to fix the vlan prefix since we try attach in strict mode.

Did we verify and fix this? I don't see the corresponding change.

aws/aws-network-policy-agent#367

jayanthvn · 2025-02-17T18:40:08Z

cmd/routed-eni-cni-plugin/cni.go

-		log.Debugf("Network Policy agent returned Success : %v", npr.Success)
+	// No need to cleanup IP and network, kubelet will send delete.
+	if err != nil || !npr.Success {
+		log.Errorf("Failed to setup default network policy for Pod Name %s and NameSpace %s: GRPC returned - %v Network policy agent returned - %v",


this is default or strict mode right?

I meant default policy which will be allow for standard and deny for strict

jayanthvn · 2025-02-19T22:56:00Z

We can close this PR right @Pavani-Panakanti ?

Changes to attach probes at pod start

2f210ae

Pavani-Panakanti requested a review from a team as a code owner February 3, 2025 20:30

Pavani-Panakanti requested review from jayanthvn, M00nF1sh and orsenthil February 3, 2025 20:34

update go version for vulncheck

40beb41

orsenthil reviewed Feb 5, 2025

View reviewed changes

minor error change

41aa1a1

orsenthil previously approved these changes Feb 6, 2025

View reviewed changes

Pavani-Panakanti commented Feb 11, 2025

View reviewed changes

do not ret error on grpc dial

bb74db2

Pavani-Panakanti dismissed orsenthil’s stale review via bb74db2 February 17, 2025 17:54

git merge master

3b775db

jayanthvn reviewed Feb 17, 2025

View reviewed changes

jayanthvn previously approved these changes Feb 17, 2025

View reviewed changes

add dial with context

e9c4309

Pavani-Panakanti dismissed jayanthvn’s stale review via e9c4309 February 18, 2025 14:31

haouc mentioned this pull request Feb 18, 2025

Changes to attach probes at pod start #3206

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes to attach probes at pod start #3188

Changes to attach probes at pod start #3188

Pavani-Panakanti commented Feb 3, 2025 •

edited

Loading

orsenthil commented Feb 4, 2025

orsenthil Feb 5, 2025

Pavani-Panakanti Feb 6, 2025

orsenthil Feb 6, 2025

orsenthil commented Feb 5, 2025

orsenthil left a comment

Pavani-Panakanti Feb 11, 2025

jayanthvn Feb 12, 2025

jaydeokar Feb 12, 2025

Pavani-Panakanti Feb 17, 2025

orsenthil Feb 19, 2025

jayanthvn Feb 19, 2025

jayanthvn Feb 17, 2025

Pavani-Panakanti Feb 19, 2025

jayanthvn commented Feb 19, 2025

Changes to attach probes at pod start #3188

Are you sure you want to change the base?

Changes to attach probes at pod start #3188

Conversation

Pavani-Panakanti commented Feb 3, 2025 • edited Loading

orsenthil commented Feb 4, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

orsenthil commented Feb 5, 2025

orsenthil left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayanthvn commented Feb 19, 2025

Pavani-Panakanti commented Feb 3, 2025 •

edited

Loading