-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add exponential backoff #54
Conversation
Maybe by connecting to some non-existent server? |
I already kind of run this code when testing the bad protocol version, what I don't know is how to validate that it is actually doing exponential backoff... |
Can you observe the state of the process, for example by calling sys:get_state/1? |
I could, but I am wondering if this wouldn't be a bad idea. And to really test it properly I would have to try first without the server, kind of check it retries with a timing that kind of match expbackoff (it is random), then start the server, wait it connects, then shut down the server and check the delay reset... Seems like a lot of trouble... |
4bca5c8
to
ab2ab04
Compare
rebar.lock
Outdated
@@ -1,28 +1,28 @@ | |||
{"1.2.0", | |||
[{<<"certifi">>,{pkg,<<"certifi">>,<<"2.13.0">>},0}, | |||
{<<"cowlib">>,{pkg,<<"cowlib">>,<<"2.13.0">>},2}, | |||
{<<"grisp">>,{pkg,<<"grisp">>,<<"2.7.0">>},0}, | |||
{<<"grisp">>,{pkg,<<"grisp">>,<<"2.8.0">>},0}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bump in grisp version should appear in the commit history
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ziopio I added it to the changeling, but do you want me to remove it from the PR altogether ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is fine just needs to appear in the commit history, once on main
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically what I am asking is just that the commit that increases the version tells that is doing it. Ideally, this should be a commit by itself.
@GwendalLaurent |
@ziopio You are right, the |
src/grisp_connect_client.erl
Outdated
%% Calculate the connection delay with exponential backoff. | ||
%% The delay is selected randomly between `1000' and | ||
%% `2 ^ RETRY_COUNT - 1000' with a maximum value of `64000'. | ||
Delay = 1000 + rand:uniform(min(64000, (1 bsl RetryCount) * 1000) - 1000), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just add the link to the Amazon blog post and mention that we decided to go with the FullJitter algorithm. Like that anyone coming back to that code in a couple of weeks/months/years will know where the calculation comes from
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why first add 1000 and then subtract it again? With the subtraction one always has to double check that the calculation never can lead to a negative delay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about
Delay = 1000 * rand:uniform(min(64, (1 bsl RetryCount)))
It's less expensive for rand:uniform/1
, since it's a smaller interval. About code readability: it's clearly a positive value. If you want to start at 1s instead of 2s (assuming RetryCount >= 1
; I haven't checked the code), you can also do 1 bsl (RetryCount - 1)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ir RetryCount is always > 0, (1 bsl RetryCount) * 1000
is always >= 2000` so - 1000 cannot be negative. The 1000 + random() is to have always a minimum delay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want the value to be between 1000 and (2^RetryCount * 1000)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind the timer stuff. First one was of cause much slower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So looking at this times: You need nearly half a second just to have a random number in steps of microseconds your way. That does not make sense at all.
YES it is bad. The whole point of exponential backoff, is to distribute the load of the client reconnecting to the server. The jitter done because without it the reconnection append in fixed time and overload the server every N seconds (see your own link for more details). Are we actually trying to optimise a call to random:uniform/1
happening at most every seconds on the grisp board while not connected at the cost of the goal of the exponential backoff itself ????
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new version of the formula looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole point is to spread the delay over the whole time zone, if you do the random like this they will all be clustered at multiples of 1000, this wouldn't make sense.
Why? What is so bad of doing retries in steps of seconds?
BTW:
1> timer:tc(random, uniform, [64000]). {407,28390} 2> timer:tc(random, uniform, [64]). {3,47}
Not sure where you get that from. On the board:
Eshell V15.2.2 (press Ctrl+G to abort, type help(). for help)
(grisp_demo@wasp)1> timer:tc(random, uniform, [64000]).
{33,28390}
(grisp_demo@wasp)2> timer:tc(random, uniform, [64000]).
{31,46275}
(grisp_demo@wasp)3> timer:tc(random, uniform, [64000]).
{31,60533}
(grisp_demo@wasp)4> timer:tc(random, uniform, [64000]).
{32,32096}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said:
Never mind the timer stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the comments and anywhere else there are no units for time delays mentioned. Is it ms, µs or something else? I recommend always being clear about units everywhere (including possibly variable name suffix and function suffix). but certainly in all values that have a unit in comments. Otherwise I have to assume its count of potatoes ;-)
ba0f06f
to
ee158fc
Compare
I don't really know how to add a test for this, any idea ?