You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Observed Behavior:
On accounts with big VPC, 100k+ ENIs, and large number of nodes with Trunk ENI, ~2600 nodes, VPC RC takes > 40 minutes to start up after leader elected a new leader. During this time, VPC RC is unable to do any work causing significant slow down in pod creation to pod creation failures.
Expected Behavior:
VPC RC start up should take sub minute at worst low single digit minutes.
How to reproduce it (as minimally and precisely as possible):
Have large VPC account with 100k+ ENIs and ~2000 nodes with trunk ENI. Kill the current leader of VPC RC and you should see the new leader take 40+ minutes to start up. This does scale with the number of nodes. With 1k nodes it took around 15-20 minutes.
How we fixed it:
We increased the CPU to 3 cores and 3 Gb of memory (this is likely over kill however I haven't had the time to dial this in).
As you can see here the CPU and memory are relatively low and we have low level of CPU throttling
We increased the number of workers and QPS against both EC2's API and K8s's
These likely are also over kill however its difficult to tune.
We also removed paginated DescribeNetworkInterface call due to the improvement in tag based filtering on that API which EC2 recently released.
Describe the Bug
Observed Behavior:
On accounts with big VPC, 100k+ ENIs, and large number of nodes with Trunk ENI, ~2600 nodes, VPC RC takes > 40 minutes to start up after leader elected a new leader. During this time, VPC RC is unable to do any work causing significant slow down in pod creation to pod creation failures.
Expected Behavior:
VPC RC start up should take sub minute at worst low single digit minutes.
How to reproduce it (as minimally and precisely as possible):
Have large VPC account with 100k+ ENIs and ~2000 nodes with trunk ENI. Kill the current leader of VPC RC and you should see the new leader take 40+ minutes to start up. This does scale with the number of nodes. With 1k nodes it took around 15-20 minutes.
How we fixed it:
We increased the CPU to 3 cores and 3 Gb of memory (this is likely over kill however I haven't had the time to dial this in).
As you can see here the CPU and memory are relatively low and we have low level of CPU throttling
We increased the number of workers and QPS against both EC2's API and K8s's
These likely are also over kill however its difficult to tune.
We also removed paginated DescribeNetworkInterface call due to the improvement in
tag
based filtering on that API which EC2 recently released.Lastly, we updated node worker count to 100 from the original 10.
With these changes we improved start up time to 1/9th of the time, from 45+ minutes to 5 minute with the same cluster.
What we want to see
The text was updated successfully, but these errors were encountered: