Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pillar/devicenetwork: fix goroutine leak #4409

Merged

Conversation

christoph-zededa
Copy link
Contributor

introduced in 8573372

without this fix, the goroutine is blocked by trying to send
into a channel that has no receiver

@christoph-zededa christoph-zededa added bug Something isn't working stable Should be backported to stable release(s) go Pull requests that update Go code labels Oct 31, 2024
Copy link
Member

@OhmSpectator OhmSpectator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer just to believe it works... I tried and failed to understand the async and channel magic in the function. We use one channel as a semaphore, another - for signalling an IP and the third for checking to quit...

@christoph-zededa christoph-zededa marked this pull request as draft October 31, 2024 12:44
@OhmSpectator
Copy link
Member

@christoph-zededa, I see you moved the PR into the draft state. Any concerns there?

@christoph-zededa
Copy link
Contributor Author

I prefer just to believe it works... I tried and failed to understand the async and channel magic in the function. We use one channel as a semaphore, another - for signalling an IP and the third for checking to quit...

I heard you want more: https://github.com/lf-edge/eve/pull/4409/files#diff-73f9b01c33a31700345aa9b1f29289dfdb697a5593a9585cf68443155aa12e50R169

@christoph-zededa
Copy link
Contributor Author

@christoph-zededa, I see you moved the PR into the draft state. Any concerns there?

There were some more leaks ...

go func() {
wg.Wait()
close(wgChan)
}()

defer close(quit)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you see this leak in the sigusr1 log file we investigated earlier?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it now.
There are goroutines that are stuck in line 164 waiting for wg.Wait() and because quit was not closed in all cases, not all wg.Done() were called.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see. 19170 occurrences.

@OhmSpectator
Copy link
Member

I prefer just to believe it works... I tried and failed to understand the async and channel magic in the function. We use one channel as a semaphore, another - for signalling an IP and the third for checking to quit...

I heard you want more: https://github.com/lf-edge/eve/pull/4409/files#diff-73f9b01c33a31700345aa9b1f29289dfdb697a5593a9585cf68443155aa12e50R169

😨😨😨

Copy link
Member

@OhmSpectator OhmSpectator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the same leak-detecting lib for the other components as well?

@@ -166,12 +166,17 @@ func ResolveWithPortsLambda(domain string,

wgChan := make(chan struct{})

wgWaitDone := make(chan struct{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a fix for a leak or a prerequisite for a test? Are you sure it's in the right commit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is needed for test, because otherwise the following goroutine would still persist when the test already finished and then it would get detected by the leak detector.

@OhmSpectator
Copy link
Member

@christoph-zededa, if you find and fix more leaks, could you please drop here information about which versions are affected by them? We need this information to understand where to backport (or just inform users who use these versions). Also, it may make sense to make a commit per fix to make it easier to track the backports.

}()

defer func() {
close(quit)
<-wgWaitDone
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use <-wgChan here?

Copy link
Contributor Author

@christoph-zededa christoph-zededa Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason - it is simply better to use <-wgChan here :-) Thx.

@christoph-zededa
Copy link
Contributor Author

@christoph-zededa, if you find and fix more leaks, could you please drop here information about which versions are affected by them? We need this information to understand where to backport (or just inform users who use these versions). Also, it may make sense to make a commit per fix to make it easier to track the backports.

They were all introduced with 8573372 .

@OhmSpectator
Copy link
Member

@christoph-zededa, if you find and fix more leaks, could you please drop here information about which versions are affected by them? We need this information to understand where to backport (or just inform users who use these versions). Also, it may make sense to make a commit per fix to make it easier to track the backports.

They were all introduced with 8573372 .

Got it, thanks!

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Did the added test logic detect the failiure in the old code?
(Running that test on the old code would be a useful check.)

@christoph-zededa
Copy link
Contributor Author

LGTM.

Did the added test logic detect the failiure in the old code? (Running that test on the old code would be a useful check.)

Yes, that's how I found all the additional problems; the original fix was only:

diff --git a/pkg/pillar/devicenetwork/dns.go b/pkg/pillar/devicenetwork/dns.go
index 7f31eb11e..80f49be55 100644
--- a/pkg/pillar/devicenetwork/dns.go
+++ b/pkg/pillar/devicenetwork/dns.go
@@ -150,7 +150,10 @@ func ResolveWithPortsLambda(domain string,
                                                errs = append(errs, err)
                                        }
                                        if response != nil {
-                                               resolvedIPsChan <- response
+                                               select {
+                                               case <-resolvedIPsChan:
+                                               case resolvedIPsChan <- response:
+                                               }
                                        }
                                        <-work
                                        wg.Done()
@@ -183,6 +186,7 @@ func ResolveWithPortsLambda(domain string,
                }
                return responses, errs
        case ip := <-resolvedIPsChan:
+               close(resolvedIPsChan)
                close(quit)
                return ip, nil
        }

@christoph-zededa christoph-zededa marked this pull request as ready for review October 31, 2024 16:46
@christoph-zededa
Copy link
Contributor Author

christoph-zededa commented Oct 31, 2024

No idea what problems eden has, is it because of Halloween? 👻

Locally I could start eve and onboard it to the commercial controller.

@OhmSpectator
Copy link
Member

No idea what problems eden has, is it because of Halloween? 👻

Locally I could start eve and onboard it to the commercial controller.

All the tests show the same error:

time="2024-10-31T20:25:08Z" level=info msg="Adam waiting for EVE registration (14) of (20)"
time="2024-10-31T20:25:28Z" level=info msg="Adam waiting for EVE registration (15) of (20)"
time="2024-10-31T20:25:48Z" level=info msg="Adam waiting for EVE registration (16) of (20)"
time="2024-10-31T20:26:08Z" level=info msg="Adam waiting for EVE registration (17) of (20)"
time="2024-10-31T20:26:28Z" level=info msg="Adam waiting for EVE registration (18) of (20)"
time="2024-10-31T20:26:48Z" level=info msg="Adam waiting for EVE registration (19) of (20)"
time="2024-10-31T20:27:08Z" level=fatal msg="Eve onboard failed: error onboarding onboarding timeout. You may try to run 'eden eve onboard' command again in several minutes. If not successful see logs of adam/eve"

Some glitch, I hope. I am restarting the tests.

@OhmSpectator
Copy link
Member

Tests restart does not help =\

@cperakis
Copy link

cperakis commented Nov 1, 2024

@christoph-zededa @OhmSpectator @eriknordmark I think we should review how frequently we invoke this functionality to update the DNS resolution for the Ethernet interfaces' IP addresses.

In my opinion, the DNS names are unlikely to change often—probably only when the IP addresses of the interfaces themselves change. Therefore, we might not need to perform updates so frequently.

Also we could do that only if the IP of the ethernet interfaces will change.

What are your thoughts?

@christoph-zededa
Copy link
Contributor Author

@christoph-zededa @OhmSpectator @eriknordmark I think we should review how frequently we invoke this

CC @milan-zededa

functionality to update the DNS resolution for the Ethernet interfaces' IP addresses.

In my opinion, the DNS names are unlikely to change often—probably only when the IP addresses of the interfaces themselves change. Therefore, we might not need to perform updates so frequently.

Also we could do that only if the IP of the ethernet interfaces will change.

What are your thoughts?

It is already cached here: https://github.com/lf-edge/eve/blob/master/pkg/pillar/cmd/nim/resolvercache.go#L108

I guess we do not cache if resolving the name did not succeed and then it gets called again and again.

@uncleDecart
Copy link
Member

I guess we do not cache if resolving the name did not succeed and then it gets called again and again.

Yes, so calls are not that often

@OhmSpectator
Copy link
Member

So, are we waiting for a new suggestion from @uncleDecart ?

@christoph-zededa
Copy link
Contributor Author

So, are we waiting for a new suggestion from @uncleDecart ?

IMO we don't have to wait; we can always create another PR.

@uncleDecart
Copy link
Member

uncleDecart commented Nov 1, 2024

I'd wait, since it's about this PR :D

Edit: giving it a second though it might take me a bit to write a fix for it, I'm fine with both options, if we merge it I'll create separate PR and we can talk about it there

go func() {
wg.Wait()
close(wgChan)
}()

defer func() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolvedIPsChan and work channels are not closed at all. We could have a defer func to close them too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They get gc-ed, but I added it explicitly.

defer func() {
wg.Done()
<-work
}()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the following lines (

select {
    case work <- struct{}{}:
	// if writable, means less than dnsMaxParallelRequests goroutines are currently running
}

)
could be simplified to

work <- struct{}{}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@uncleDecart
Copy link
Member

@christoph-zededa how do you like this? I think that's easier to read, if it's good, I'll come up with some tests for it and we can include this solution

@christoph-zededa
Copy link
Contributor Author

@christoph-zededa how do you like this? I think that's easier to read, if it's good, I'll come up with some tests for it and we can include this solution

@uncleDecart I agree it makes it easier to read. Thx.

@uncleDecart
Copy link
Member

Do you want me to create another PR, add commit to your PR or you want incorporate those changes yourself? I'm fine with any case

The one thing I'm not sure about is timeout, we can add timeout context in high level function and pass it to resolveDNS, but then we gotta deal with canceling things, however my concern is about weather or not we should have it in the first place. Meaning we usually have eventual consistency with no time-constraints, i.e. it might take a while to resolve dns, but we'll get there one way or another, setting timeout means that no matter what hardware or configuration we're running this thing on we should be able to resolve it in this particular timeframe. Can we guarantee that? Well, I mean, in theory yes, but then we have to remember about this particular part in the system and we should create an Eden test for it, so that. At least we will be able to catch a regression on a virtualised hardware, not the slowest one to handle things, but close

@christoph-zededa
Copy link
Contributor Author

Do you want me to create another PR, add commit to your PR or you want incorporate those changes yourself? I'm fine with any case

Just create a PR into my PR.

The one thing I'm not sure about is timeout, we can add timeout context in high level function and pass it to resolveDNS

That's a new feature and not a bug fix, isn't it?

@uncleDecart
Copy link
Member

Just create a PR into my PR.

So that you could review my PR while I review yours? Sure. I'll think about the tests and then open a PR

That's a new feature and not a bug fix, isn't it?

Agreed, I wouldn't go down that path, let's fix the bug

@OhmSpectator
Copy link
Member

Mmm, guys, I would say, first of all, we need to fix the very specific bug in the scope of this PR. Time is still matters in this case. Refactoring and readability improvements - we can do it in dedicated PRs.

@uncleDecart
Copy link
Member

Mmm, guys, I would say, first of all, we need to fix the very specific bug in the scope of this PR. Time is still matters in this case. Refactoring and readability improvements - we can do it in dedicated PRs.

Then let's merge this PR @OhmSpectator and then I'll open mine

Copy link
Member

@OhmSpectator OhmSpectator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to force the workflows to go further...

@christoph-zededa
Copy link
Contributor Author

Test failed, because it also needs a:

-       resolvedIPsChan := make(chan []DNSResponse)
+       resolvedIPsChan := make(chan []DNSResponse, 1)

introduced in 8573372

without this fix, the goroutine is blocked by trying to send
into a channel that has no receiver

Signed-off-by: Christoph Ostarek <[email protected]>
use fancy uber leak checker to check for leaks
which have been fixed in the commit before

Signed-off-by: Christoph Ostarek <[email protected]>
@eriknordmark
Copy link
Contributor

I guess we do not cache if resolving the name did not succeed and then it gets called again and again.

@milan-zededa
Do we know that names we try to resolve using this code? Is it only the controller domain name? Or also things used by diag (like www.google.com)?

If we know that we should be try to reproduce the issue by having DNS lookups fail for some of the names which we resolve (but still want the device to be able to talk to the controller).

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - re-run tests

@milan-zededa
Copy link
Contributor

I guess we do not cache if resolving the name did not succeed and then it gets called again and again.

@milan-zededa Do we know that names we try to resolve using this code? Is it only the controller domain name? Or also things used by diag (like www.google.com)?

If we know that we should be try to reproduce the issue by having DNS lookups fail for some of the names which we resolve (but still want the device to be able to talk to the controller).

It is only used for the controller domain name.
Maybe we could reproduce this issue by having multiple DNS servers published by the DHCP server, where one can resolve controller name while others fail (or possibly are not reachable/responsive).

Copy link
Member

@OhmSpectator OhmSpectator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should merge now and backport into 13.4-stable, 12.0-stable and 11.0-stable...

@OhmSpectator OhmSpectator merged commit 145e98e into lf-edge:master Nov 6, 2024
45 of 67 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working go Pull requests that update Go code stable Should be backported to stable release(s)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants