Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to attach probes at pod start #3206

Merged
merged 5 commits into from
Feb 18, 2025
Merged

Changes to attach probes at pod start #3206

merged 5 commits into from
Feb 18, 2025

Conversation

haouc
Copy link
Contributor

@haouc haouc commented Feb 18, 2025

Note: This PR is updated from #3188

What type of PR is this?
bug

Which issue does this PR fix?:
aws/aws-network-policy-agent#345

What does this PR do / Why do we need it?:

On every pod add, make a call to NP agent to configure ebpf probes. If NP is enabled, np agent will configure probes accordingly for the pod. If NP is disabled, np agent returns success. This is the same behavior for strict mode earlier, we are making same changes for standard mode too to fix the above github issue
On every pod del, make a call to NP agent to clean up ebpf probes. If NP is enabled, np agent will clean up caches and probes associated with this pod. If not it will return success. Pod del will not throw any errors from NP.
NP agent PR to handle these calls - 

https://github.com/aws/aws-network-policy-agent/pull/361

Testing done on this change:

Added unit tests
Manual tests of pod creation and deletion with NP disabled and NP enabled.

Setup a generic ipv4 cluster with 3 nodes

EnforceNpToPod Testing - NP enabled
Created a pod and verified the EnforceNptoPod calls and probes are being attached to pods as expected in default allow. Verified pod connectivity in above case
Created multiple pods (10) at once and verified locking is happening in NP as expected and probes are being attached correctly
Did a large scale up and verified calls are being made as expected

EnforceNpToPod Testing - NP disabled
Created pods and verified that NP is returning success immediately

DeletePodNp - NP enabled
Deleted a pod and verified caches are being cleaned up and bpf programs are being deleted as expected
Deleted multiple pods at once and verified cleanup of caches and programs is working as expected
Verified locking on NP side when multiple calls from cni to NP happen at same time
Did a huge scale down and verified the functionality

DeletePodNp Testing - NP disabled
Deleted pods and verified that NP is returning success immediately

Tested below upgrade/downgrade functionality
If a pod is created before upgrade and deleted after upgrade - we will call DeletePodNp. If pod has policies applied, there will be probes attached and we will clean the caches and programs. If pod does not have any policies applied, we will not find any entries in the internal caches in NP related to this pod and we will just return success
If pod is created after upgrade and then deleted after downgrade - No call will be made from cni to np after downgrade. If pod has policies applied, probes will be cleaned up as part of reconcile code flow (This is how it works before these changes). If no policies are applied, there will be pod info left in internal caches and bpf program will not be deleted. Probes will be anyway deleted as part of pod delete. This should not be a concern. Internal caches will be reset on np agent restart

Tested NP enable -> disable and disable -> enable functionality as well
Tested all of the above with ipv6 cluster

Integration tests successful - https://github.com/aws/amazon-vpc-cni-k8s/actions/runs/13122307097

Will this PR introduce any new dependencies?:
No

Will this break upgrades or downgrades? Has updating a running cluster been tested?:
Tested upgrades and downgrades. It will not break any updates or downgrades. CNI and NP have to be compatible after this change

Does this change require updates to the CNI daemonset config files to work?:
No

Does this PR introduce any user-facing change?:
No


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@haouc haouc requested a review from a team as a code owner February 18, 2025 20:18
@haouc haouc force-pushed the np_changes branch 2 times, most recently from fa711f8 to b467bfd Compare February 18, 2025 21:28
jayanthvn
jayanthvn previously approved these changes Feb 18, 2025
Copy link
Contributor

@jayanthvn jayanthvn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, this is the same PR as - #3188

add new lines to satisfy format check

update unit tests for DialContext
@haouc haouc merged commit af32e99 into aws:master Feb 18, 2025
5 of 6 checks passed
@janavenkat
Copy link

janavenkat commented Mar 6, 2025

@haouc Still I have this issue even after updating to version 1.19.3 and my k8s version 1.31 and #3203 explained already here.

I can reproduce:

  1. A script that connects to RabbitMQ and other sites.
  2. I've created a cron job that creates multiple pods (40), which also creates new nodes, running every 5 minutes.
  3. However, I'm experiencing connection errors as described in the issue."

@m00lecule
Copy link

m00lecule commented Mar 6, 2025

@haouc Still I have this issue even after updating to version 1.19.3 and my k8s version 1.31 and #3203 explained already here.

I can reproduce:

  1. A script that connects to RabbitMQ and other sites.
  2. I've created a cron job that creates multiple pods (40), which also creates new nodes, running every 5 minutes.
  3. However, I'm experiencing connection errors as described in the issue."

+1, upgrading to 1.19.3 didn't resolve connectivity issues during pod startup

@Pavani-Panakanti
Copy link
Contributor

@janavenkat @m00lecule Can you let us the know the flow in which you are seeing the issue or repro steps that would help us repro the issue. Can you also share the logs of the issue to this email "[email protected]". Thanks

@janavenkat
Copy link

@janavenkat @m00lecule Can you let us the know the flow in which you are seeing the issue or repro steps that would help us repro the issue. Can you also share the logs of the issue to this email "[email protected]". Thanks

Reproduce steps:

  1. A script that connects to RabbitMQ and other sites.
  2. I have created a cron job that creates multiple pods (40), which also creates new nodes, running every 5 minutes.
  3. However, I am experiencing connection errors as described in the issue.

@janavenkat
Copy link

@Pavani-Panakanti sample logs #3203

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants