Fix race condition crash by removing unnecessary retain/release #75
+2
−11
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What did you change and why?
NTPConnection
that was crashing after being released too many times in some race conditions.HostResolver
didn’t retain itself for resolving connections when the initial attempt timed out. From what I can tell, it was just added here to be safe. @msanders if you’re still active and somehow remember, feel free to correct me if I’m wrong!Potential risks introduced?
NTPConnection
is used elsewhere in your app, since the socket isn't retaining it it's possible that a socket callback could cause a crash.query(addresses: [SocketAddress], host: String)
is called whenconnections
is not cleared, that could cause the same crash as above. I cannot find a case where this happens, and it's in a private extension.What tests were performed (include steps)?
Checklist
Additional Information
Why don’t we need to increment the retain count of
NTPConnection
for the socket’s callback?NTPConnection
s are used byNTPClient
, andNTPClient
is used by the singletonTrueTimeClient
.TrueTimeClient
maintains a strong reference toNTPClient
.NTPClient
maintains a strong reference to[NTPConnection]
. When removing the connections, each connection is synchronously closed (which removes itself from the socket callback), then the array is cleared. At no point is it necessary to increment the retain count from the sockets as long we’re guaranteed thatclose
will be called before release our strong reference, which it is.History of this bug
HostResolver
NTPConnection
was not getting released on a timeout, only on success.How do you reproduce the new crash?
TrueTimeClient
instance and callfetchIfNeeded
NTPConnection
objects to be really small (ex: 0.0001)NTPConnection.swift
line 28. If you felt more ambitious, you could fire up the Network Link Conditioner and set it to be 8 seconds or more.NTPConnection complete
after therelease()
is called. I just addedThread.sleep(forTimeInterval: 0.5)
on line207
.timeoutError
is hit.NTPConnection.swift
if you want to try for yourself.Okay but why does this happen?
There are actually two stack traces: one in
NTPClient
, and one inNTPConnection
. The latter is the more prevalent one by about 10x. We’ll start with that one. Note: this is probably pretty hard to follow, and I'm not 100% sure about the correctness. All of the logic below is moot by this PR though, and these race conditions no longer exist.Assume that we are running on a slow internet where the roundtrip time of the requests is about 8 seconds, conveniently the same as the default timeout :P
NTPConnection
s are created.passRetained
. The socket completion callback (dataCallback
) is then enqueued to be executed on the main thread.timeoutError
), andcomplete(.failure(error))
is called. This timer callback happens on the lock queue (instacart npt connection queue).complete()
, callsclose()
, but notice in that implementation ofclose
all of the code is executed asynchronously on thelockQueue
. CallinglockQueue.async
if you’re already onlockQueue
will append that code to the end of the current runloop. This is important because it means that the socket callback isn't actually removed yet.complete()
, we’re specifically a timeout error, soonComplete
is called on thecallbackQueue
(instacart ntp client queue, this is a different thread).release()
is called ifcallbackPending
istrue
(which it is here).NTPConnection
completion handler is still active, because it’s been appended to the end of the current run loop.dataCallback
).release()
is then called again. Not good. We’ve now accidentally released this twice.onComplete
.onComplete
on line 35. Since there’s both the reference toconnections
inNTPClient
and theconnection
here, the ref count for theNTPConnection
is still one. We can callprogress(connection, result)
without any issue.throttleConnections
is called, and the above extra reference in the closure is released,$0.canRetry
now causes a crash because the reference count is 0.I’m less sure about the
NTPClient
crash. I think it might be the same as above, except in that race condition theNTPConnection
has managed to be released twice (maybe if it was also cancelled?) and the reference count is0
for theprogress
callback. We’d expect it to crash then on line 186 ofNTPClient
.tl;dr: we have several different threads with several connections happening at the same time that enqueue differing things, with logic that retains, then releases at different times depending on state (and potentially in different threads). This is both hard to follow and leads to potential race conditions. I do not see a reason for an extra retain and release here at all, since
connections
are retained by NTPClient and only released after synchronously closing each connection and removing the callback.