VPC RC taking upwards of 40 minutes to start up in big account #451

GnatorX · 2024-08-05T21:12:04Z

Describe the Bug

Observed Behavior:
On accounts with big VPC, 100k+ ENIs, and large number of nodes with Trunk ENI, ~2600 nodes, VPC RC takes > 40 minutes to start up after leader elected a new leader. During this time, VPC RC is unable to do any work causing significant slow down in pod creation to pod creation failures.

Expected Behavior:
VPC RC start up should take sub minute at worst low single digit minutes.

How to reproduce it (as minimally and precisely as possible):
Have large VPC account with 100k+ ENIs and ~2000 nodes with trunk ENI. Kill the current leader of VPC RC and you should see the new leader take 40+ minutes to start up. This does scale with the number of nodes. With 1k nodes it took around 15-20 minutes.

How we fixed it:
We increased the CPU to 3 cores and 3 Gb of memory (this is likely over kill however I haven't had the time to dial this in).
As you can see here the CPU and memory are relatively low and we have low level of CPU throttling

We increased the number of workers and QPS against both EC2's API and K8s's

These likely are also over kill however its difficult to tune.

We also removed paginated DescribeNetworkInterface call due to the improvement in tag based filtering on that API which EC2 recently released.

Lastly, we updated node worker count to 100 from the original 10.

With these changes we improved start up time to 1/9th of the time, from 45+ minutes to 5 minute with the same cluster.

What we want to see

We want these parameters to be exposed rather than spread between config loader and/or hard coded as part of the start up arguments.
Similar to Improving VPC RC's behavior for large accounts #411 we need some configuration that allows VPC RC to behave differently when the user knows that account and cluster is large.

The text was updated successfully, but these errors were encountered:

GnatorX added the bug Something isn't working label Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VPC RC taking upwards of 40 minutes to start up in big account #451

VPC RC taking upwards of 40 minutes to start up in big account #451

GnatorX commented Aug 5, 2024 •

edited

Loading

VPC RC taking upwards of 40 minutes to start up in big account #451

VPC RC taking upwards of 40 minutes to start up in big account #451

Comments

GnatorX commented Aug 5, 2024 • edited Loading

GnatorX commented Aug 5, 2024 •

edited

Loading