Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add exponential backoff #54

Merged
merged 10 commits into from
Feb 24, 2025
Merged

Add exponential backoff #54

merged 10 commits into from
Feb 24, 2025

Conversation

sylane
Copy link
Contributor

@sylane sylane commented Feb 19, 2025

I don't really know how to add a test for this, any idea ?

@sylane sylane requested review from maehjam and matlaj February 19, 2025 08:21
@matlaj
Copy link
Member

matlaj commented Feb 19, 2025

Maybe by connecting to some non-existent server?

@sylane
Copy link
Contributor Author

sylane commented Feb 19, 2025

Maybe by connecting to some non-existent server?

I already kind of run this code when testing the bad protocol version, what I don't know is how to validate that it is actually doing exponential backoff...

@matlaj
Copy link
Member

matlaj commented Feb 19, 2025

Can you observe the state of the process, for example by calling sys:get_state/1?

@sylane
Copy link
Contributor Author

sylane commented Feb 19, 2025

Can you observe the state of the process, for example by calling sys:get_state/1?

I could, but I am wondering if this wouldn't be a bad idea. And to really test it properly I would have to try first without the server, kind of check it retries with a timing that kind of match expbackoff (it is random), then start the server, wait it connects, then shut down the server and check the delay reset... Seems like a lot of trouble...

Base automatically changed from sylane/add-proto-version to main February 20, 2025 10:45
@sylane sylane force-pushed the sylane/add-exp-backoff branch from 4bca5c8 to ab2ab04 Compare February 20, 2025 11:03
@ziopio ziopio self-requested a review February 21, 2025 08:23
rebar.lock Outdated
@@ -1,28 +1,28 @@
{"1.2.0",
[{<<"certifi">>,{pkg,<<"certifi">>,<<"2.13.0">>},0},
{<<"cowlib">>,{pkg,<<"cowlib">>,<<"2.13.0">>},2},
{<<"grisp">>,{pkg,<<"grisp">>,<<"2.7.0">>},0},
{<<"grisp">>,{pkg,<<"grisp">>,<<"2.8.0">>},0},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bump in grisp version should appear in the commit history

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ziopio I added it to the changeling, but do you want me to remove it from the PR altogether ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is fine just needs to appear in the commit history, once on main

Copy link
Member

@ziopio ziopio Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically what I am asking is just that the commit that increases the version tells that is doing it. Ideally, this should be a commit by itself.

@ziopio
Copy link
Member

ziopio commented Feb 21, 2025

@GwendalLaurent bsl removes the need of rounding the result of math:pow

@sylane
Copy link
Contributor Author

sylane commented Feb 21, 2025

@ziopio You are right, the RetryCount -1 is wrong, I didn't really validated it yet.

@ziopio ziopio self-requested a review February 24, 2025 08:18
Comment on lines 180 to 183
%% Calculate the connection delay with exponential backoff.
%% The delay is selected randomly between `1000' and
%% `2 ^ RETRY_COUNT - 1000' with a maximum value of `64000'.
Delay = 1000 + rand:uniform(min(64000, (1 bsl RetryCount) * 1000) - 1000),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just add the link to the Amazon blog post and mention that we decided to go with the FullJitter algorithm. Like that anyone coming back to that code in a couple of weeks/months/years will know where the calculation comes from

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why first add 1000 and then subtract it again? With the subtraction one always has to double check that the calculation never can lead to a negative delay.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about

Delay = 1000 * rand:uniform(min(64, (1 bsl RetryCount))) 

It's less expensive for rand:uniform/1, since it's a smaller interval. About code readability: it's clearly a positive value. If you want to start at 1s instead of 2s (assuming RetryCount >= 1; I haven't checked the code), you can also do 1 bsl (RetryCount - 1).

Copy link
Contributor Author

@sylane sylane Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ir RetryCount is always > 0, (1 bsl RetryCount) * 1000 is always >= 2000` so - 1000 cannot be negative. The 1000 + random() is to have always a minimum delay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want the value to be between 1000 and (2^RetryCount * 1000)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind the timer stuff. First one was of cause much slower.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So looking at this times: You need nearly half a second just to have a random number in steps of microseconds your way. That does not make sense at all.

YES it is bad. The whole point of exponential backoff, is to distribute the load of the client reconnecting to the server. The jitter done because without it the reconnection append in fixed time and overload the server every N seconds (see your own link for more details). Are we actually trying to optimise a call to random:uniform/1 happening at most every seconds on the grisp board while not connected at the cost of the goal of the exponential backoff itself ????

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new version of the formula looks good.

Copy link
Contributor Author

@sylane sylane Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole point is to spread the delay over the whole time zone, if you do the random like this they will all be clustered at multiples of 1000, this wouldn't make sense.

Why? What is so bad of doing retries in steps of seconds?

BTW:

1> timer:tc(random, uniform, [64000]).
{407,28390}
2> timer:tc(random, uniform, [64]).
{3,47}

Not sure where you get that from. On the board:

Eshell V15.2.2 (press Ctrl+G to abort, type help(). for help)
(grisp_demo@wasp)1> timer:tc(random, uniform, [64000]).
{33,28390}
(grisp_demo@wasp)2> timer:tc(random, uniform, [64000]).
{31,46275}
(grisp_demo@wasp)3> timer:tc(random, uniform, [64000]).
{31,60533}
(grisp_demo@wasp)4> timer:tc(random, uniform, [64000]).
{32,32096}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said:

Never mind the timer stuff.

Copy link
Member

@peerst peerst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the comments and anywhere else there are no units for time delays mentioned. Is it ms, µs or something else? I recommend always being clear about units everywhere (including possibly variable name suffix and function suffix). but certainly in all values that have a unit in comments. Otherwise I have to assume its count of potatoes ;-)

@GwendalLaurent GwendalLaurent self-requested a review February 24, 2025 16:44
@sylane sylane force-pushed the sylane/add-exp-backoff branch from ba0f06f to ee158fc Compare February 24, 2025 16:49
@sylane sylane merged commit b4e28c8 into main Feb 24, 2025
1 check passed
@sylane sylane deleted the sylane/add-exp-backoff branch February 24, 2025 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants