Changes to attach probes at pod start #3206

haouc · 2025-02-18T20:18:05Z

Note: This PR is updated from #3188

What type of PR is this?
bug

Which issue does this PR fix?:
aws/aws-network-policy-agent#345

What does this PR do / Why do we need it?:

On every pod add, make a call to NP agent to configure ebpf probes. If NP is enabled, np agent will configure probes accordingly for the pod. If NP is disabled, np agent returns success. This is the same behavior for strict mode earlier, we are making same changes for standard mode too to fix the above github issue
On every pod del, make a call to NP agent to clean up ebpf probes. If NP is enabled, np agent will clean up caches and probes associated with this pod. If not it will return success. Pod del will not throw any errors from NP.
NP agent PR to handle these calls - 

https://github.com/aws/aws-network-policy-agent/pull/361

Testing done on this change:

Added unit tests
Manual tests of pod creation and deletion with NP disabled and NP enabled.

Setup a generic ipv4 cluster with 3 nodes

EnforceNpToPod Testing - NP enabled
Created a pod and verified the EnforceNptoPod calls and probes are being attached to pods as expected in default allow. Verified pod connectivity in above case
Created multiple pods (10) at once and verified locking is happening in NP as expected and probes are being attached correctly
Did a large scale up and verified calls are being made as expected

EnforceNpToPod Testing - NP disabled
Created pods and verified that NP is returning success immediately

DeletePodNp - NP enabled
Deleted a pod and verified caches are being cleaned up and bpf programs are being deleted as expected
Deleted multiple pods at once and verified cleanup of caches and programs is working as expected
Verified locking on NP side when multiple calls from cni to NP happen at same time
Did a huge scale down and verified the functionality

DeletePodNp Testing - NP disabled
Deleted pods and verified that NP is returning success immediately

Tested below upgrade/downgrade functionality
If a pod is created before upgrade and deleted after upgrade - we will call DeletePodNp. If pod has policies applied, there will be probes attached and we will clean the caches and programs. If pod does not have any policies applied, we will not find any entries in the internal caches in NP related to this pod and we will just return success
If pod is created after upgrade and then deleted after downgrade - No call will be made from cni to np after downgrade. If pod has policies applied, probes will be cleaned up as part of reconcile code flow (This is how it works before these changes). If no policies are applied, there will be pod info left in internal caches and bpf program will not be deleted. Probes will be anyway deleted as part of pod delete. This should not be a concern. Internal caches will be reset on np agent restart

Tested NP enable -> disable and disable -> enable functionality as well
Tested all of the above with ipv6 cluster

Integration tests successful - https://github.com/aws/amazon-vpc-cni-k8s/actions/runs/13122307097

Will this PR introduce any new dependencies?:
No

Will this break upgrades or downgrades? Has updating a running cluster been tested?:
Tested upgrades and downgrades. It will not break any updates or downgrades. CNI and NP have to be compatible after this change

Does this change require updates to the CNI daemonset config files to work?:
No

Does this PR introduce any user-facing change?:
No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

jayanthvn

lgtm, this is the same PR as - #3188

add new lines to satisfy format check update unit tests for DialContext

janavenkat · 2025-03-06T09:20:00Z

@haouc Still I have this issue even after updating to version 1.19.3 and my k8s version 1.31 and #3203 explained already here.

I can reproduce:

A script that connects to RabbitMQ and other sites.
I've created a cron job that creates multiple pods (40), which also creates new nodes, running every 5 minutes.
However, I'm experiencing connection errors as described in the issue."

m00lecule · 2025-03-06T22:25:04Z

@haouc Still I have this issue even after updating to version 1.19.3 and my k8s version 1.31 and #3203 explained already here.

I can reproduce:

A script that connects to RabbitMQ and other sites.

I've created a cron job that creates multiple pods (40), which also creates new nodes, running every 5 minutes.

However, I'm experiencing connection errors as described in the issue."

+1, upgrading to 1.19.3 didn't resolve connectivity issues during pod startup

Pavani-Panakanti · 2025-03-10T15:34:19Z

@janavenkat @m00lecule Can you let us the know the flow in which you are seeing the issue or repro steps that would help us repro the issue. Can you also share the logs of the issue to this email "[email protected]". Thanks

janavenkat · 2025-03-10T15:57:23Z

@janavenkat @m00lecule Can you let us the know the flow in which you are seeing the issue or repro steps that would help us repro the issue. Can you also share the logs of the issue to this email "[email protected]". Thanks

Reproduce steps:

A script that connects to RabbitMQ and other sites.
I have created a cron job that creates multiple pods (40), which also creates new nodes, running every 5 minutes.
However, I am experiencing connection errors as described in the issue.

janavenkat · 2025-03-10T15:58:50Z

@Pavani-Panakanti sample logs #3203

Pavani-Panakanti added 4 commits February 18, 2025 12:03

Changes to attach probes at pod start

97d8085

minor error change

67bcb1c

do not ret error on grpc dial

feaef73

add dial with context

0d5bd6d

haouc requested a review from a team as a code owner February 18, 2025 20:18

haouc force-pushed the np_changes branch 2 times, most recently from fa711f8 to b467bfd Compare February 18, 2025 21:28

jayanthvn previously approved these changes Feb 18, 2025

View reviewed changes

update mocked grpc wrapper and unit tests

9179279

add new lines to satisfy format check update unit tests for DialContext

haouc dismissed jayanthvn’s stale review via 9179279 February 18, 2025 22:17

haouc force-pushed the np_changes branch from b467bfd to 9179279 Compare February 18, 2025 22:17

jayanthvn approved these changes Feb 18, 2025

View reviewed changes

haouc merged commit af32e99 into aws:master Feb 18, 2025
5 of 6 checks passed

This was referenced Feb 22, 2025

🥳 aws-vpc-cni v1.19.3 Automated Release! 🥑 aws/eks-charts#1198

Open

🥳 cni-metrics-helper v1.19.3 Automated Release! 🥑 aws/eks-charts#1199

Open

janavenkat mentioned this pull request Mar 6, 2025

Return traffic can be denied for a short duration once the policies are reconciled on a new pod aws/aws-network-policy-agent#345

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes to attach probes at pod start #3206

Changes to attach probes at pod start #3206

haouc commented Feb 18, 2025 •

edited

Loading

jayanthvn left a comment

janavenkat commented Mar 6, 2025 •

edited

Loading

m00lecule commented Mar 6, 2025 •

edited

Loading

Pavani-Panakanti commented Mar 10, 2025

janavenkat commented Mar 10, 2025

janavenkat commented Mar 10, 2025

Changes to attach probes at pod start #3206

Changes to attach probes at pod start #3206

Conversation

haouc commented Feb 18, 2025 • edited Loading

Note: This PR is updated from #3188

jayanthvn left a comment

Choose a reason for hiding this comment

janavenkat commented Mar 6, 2025 • edited Loading

m00lecule commented Mar 6, 2025 • edited Loading

Pavani-Panakanti commented Mar 10, 2025

janavenkat commented Mar 10, 2025

janavenkat commented Mar 10, 2025

haouc commented Feb 18, 2025 •

edited

Loading

janavenkat commented Mar 6, 2025 •

edited

Loading

m00lecule commented Mar 6, 2025 •

edited

Loading