Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VPC RC taking upwards of 40 minutes to start up in big account #451

Open
GnatorX opened this issue Aug 5, 2024 · 0 comments
Open

VPC RC taking upwards of 40 minutes to start up in big account #451

GnatorX opened this issue Aug 5, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@GnatorX
Copy link
Contributor

GnatorX commented Aug 5, 2024

Describe the Bug

Observed Behavior:
On accounts with big VPC, 100k+ ENIs, and large number of nodes with Trunk ENI, ~2600 nodes, VPC RC takes > 40 minutes to start up after leader elected a new leader. During this time, VPC RC is unable to do any work causing significant slow down in pod creation to pod creation failures.

Screenshot 2024-08-05 at 1 57 48 PM

Expected Behavior:
VPC RC start up should take sub minute at worst low single digit minutes.

How to reproduce it (as minimally and precisely as possible):
Have large VPC account with 100k+ ENIs and ~2000 nodes with trunk ENI. Kill the current leader of VPC RC and you should see the new leader take 40+ minutes to start up. This does scale with the number of nodes. With 1k nodes it took around 15-20 minutes.

How we fixed it:
We increased the CPU to 3 cores and 3 Gb of memory (this is likely over kill however I haven't had the time to dial this in).
As you can see here the CPU and memory are relatively low and we have low level of CPU throttling

Screenshot 2024-08-05 at 1 47 32 PM

We increased the number of workers and QPS against both EC2's API and K8s's
Screenshot 2024-07-31 at 6 21 25 PM
These likely are also over kill however its difficult to tune.

We also removed paginated DescribeNetworkInterface call due to the improvement in tag based filtering on that API which EC2 recently released.

Lastly, we updated node worker count to 100 from the original 10.

With these changes we improved start up time to 1/9th of the time, from 45+ minutes to 5 minute with the same cluster.
Screenshot 2024-08-05 at 2 06 13 PM

What we want to see

  • We want these parameters to be exposed rather than spread between config loader and/or hard coded as part of the start up arguments.
  • Similar to Improving VPC RC's behavior for large accounts #411 we need some configuration that allows VPC RC to behave differently when the user knows that account and cluster is large.
@GnatorX GnatorX added the bug Something isn't working label Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant