-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pillar/devicenetwork: fix goroutine leak #4409
pillar/devicenetwork: fix goroutine leak #4409
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer just to believe it works... I tried and failed to understand the async and channel magic in the function. We use one channel as a semaphore, another - for signalling an IP and the third for checking to quit...
@christoph-zededa, I see you moved the PR into the draft state. Any concerns there? |
a4f6777
to
efc0671
Compare
I heard you want more: https://github.com/lf-edge/eve/pull/4409/files#diff-73f9b01c33a31700345aa9b1f29289dfdb697a5593a9585cf68443155aa12e50R169 |
There were some more leaks ... |
efc0671
to
841bd3c
Compare
pkg/pillar/devicenetwork/dns.go
Outdated
go func() { | ||
wg.Wait() | ||
close(wgChan) | ||
}() | ||
|
||
defer close(quit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you see this leak in the sigusr1
log file we investigated earlier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see it now.
There are goroutines that are stuck in line 164 waiting for wg.Wait()
and because quit
was not closed in all cases, not all wg.Done()
were called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see. 19170 occurrences.
😨😨😨 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use the same leak-detecting lib for the other components as well?
pkg/pillar/devicenetwork/dns.go
Outdated
@@ -166,12 +166,17 @@ func ResolveWithPortsLambda(domain string, | |||
|
|||
wgChan := make(chan struct{}) | |||
|
|||
wgWaitDone := make(chan struct{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a fix for a leak or a prerequisite for a test? Are you sure it's in the right commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is needed for test, because otherwise the following goroutine would still persist when the test already finished and then it would get detected by the leak detector.
@christoph-zededa, if you find and fix more leaks, could you please drop here information about which versions are affected by them? We need this information to understand where to backport (or just inform users who use these versions). Also, it may make sense to make a commit per fix to make it easier to track the backports. |
pkg/pillar/devicenetwork/dns.go
Outdated
}() | ||
|
||
defer func() { | ||
close(quit) | ||
<-wgWaitDone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use <-wgChan
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason - it is simply better to use <-wgChan
here :-) Thx.
They were all introduced with 8573372 . |
Got it, thanks! |
841bd3c
to
cb967ba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Did the added test logic detect the failiure in the old code?
(Running that test on the old code would be a useful check.)
Yes, that's how I found all the additional problems; the original fix was only:
|
No idea what problems eden has, is it because of Halloween? 👻 Locally I could start eve and onboard it to the commercial controller. |
All the tests show the same error:
Some glitch, I hope. I am restarting the tests. |
Tests restart does not help =\ |
@christoph-zededa @OhmSpectator @eriknordmark I think we should review how frequently we invoke this functionality to update the DNS resolution for the Ethernet interfaces' IP addresses. In my opinion, the DNS names are unlikely to change often—probably only when the IP addresses of the interfaces themselves change. Therefore, we might not need to perform updates so frequently. Also we could do that only if the IP of the ethernet interfaces will change. What are your thoughts? |
It is already cached here: https://github.com/lf-edge/eve/blob/master/pkg/pillar/cmd/nim/resolvercache.go#L108 I guess we do not cache if resolving the name did not succeed and then it gets called again and again. |
Yes, so calls are not that often |
So, are we waiting for a new suggestion from @uncleDecart ? |
IMO we don't have to wait; we can always create another PR. |
I'd wait, since it's about this PR :D Edit: giving it a second though it might take me a bit to write a fix for it, I'm fine with both options, if we merge it I'll create separate PR and we can talk about it there |
go func() { | ||
wg.Wait() | ||
close(wgChan) | ||
}() | ||
|
||
defer func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolvedIPsChan
and work
channels are not closed at all. We could have a defer func to close them too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They get gc-ed, but I added it explicitly.
defer func() { | ||
wg.Done() | ||
<-work | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the following lines (
select {
case work <- struct{}{}:
// if writable, means less than dnsMaxParallelRequests goroutines are currently running
}
)
could be simplified to
work <- struct{}{}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@christoph-zededa how do you like this? I think that's easier to read, if it's good, I'll come up with some tests for it and we can include this solution |
@uncleDecart I agree it makes it easier to read. Thx. |
Do you want me to create another PR, add commit to your PR or you want incorporate those changes yourself? I'm fine with any case The one thing I'm not sure about is timeout, we can add timeout context in high level function and pass it to resolveDNS, but then we gotta deal with canceling things, however my concern is about weather or not we should have it in the first place. Meaning we usually have eventual consistency with no time-constraints, i.e. it might take a while to resolve dns, but we'll get there one way or another, setting timeout means that no matter what hardware or configuration we're running this thing on we should be able to resolve it in this particular timeframe. Can we guarantee that? Well, I mean, in theory yes, but then we have to remember about this particular part in the system and we should create an Eden test for it, so that. At least we will be able to catch a regression on a virtualised hardware, not the slowest one to handle things, but close |
Just create a PR into my PR.
That's a new feature and not a bug fix, isn't it? |
So that you could review my PR while I review yours? Sure. I'll think about the tests and then open a PR
Agreed, I wouldn't go down that path, let's fix the bug |
Mmm, guys, I would say, first of all, we need to fix the very specific bug in the scope of this PR. Time is still matters in this case. Refactoring and readability improvements - we can do it in dedicated PRs. |
cb967ba
to
688d6bc
Compare
Then let's merge this PR @OhmSpectator and then I'll open mine |
688d6bc
to
acff711
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to force the workflows to go further...
acff711
to
61ebeb7
Compare
Test failed, because it also needs a:
|
introduced in 8573372 without this fix, the goroutine is blocked by trying to send into a channel that has no receiver Signed-off-by: Christoph Ostarek <[email protected]>
use fancy uber leak checker to check for leaks which have been fixed in the commit before Signed-off-by: Christoph Ostarek <[email protected]>
61ebeb7
to
6bf74f5
Compare
@milan-zededa If we know that we should be try to reproduce the issue by having DNS lookups fail for some of the names which we resolve (but still want the device to be able to talk to the controller). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - re-run tests
It is only used for the controller domain name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should merge now and backport into 13.4-stable, 12.0-stable and 11.0-stable...
introduced in 8573372
without this fix, the goroutine is blocked by trying to send
into a channel that has no receiver