Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add driver killing capabilities #297

Closed
r0mainK opened this issue Jun 28, 2019 · 8 comments · Fixed by #306
Closed

Add driver killing capabilities #297

r0mainK opened this issue Jun 28, 2019 · 8 comments · Fixed by #306
Assignees

Comments

@r0mainK
Copy link

r0mainK commented Jun 28, 2019

I have found myself having a problem when trying to parse large amounts of files (~5k files). Basically what happens is that after parsing a large amount of files, the C++ driver stops responding completely, and the Parse request that triggered it, and all subsequent ones, timeout.

After some testing, I've realized that it seems that killing the running driver instance and relaunching it solves the problem, so I would like:

  1. This process to be automatically done when there is a user or parse timeout, possibly with a flag when launching the docker image or an environment constant.
  2. To be able to kill specific instances by using bblfshctl instances, for example by doing bblfshctl instances --kill ccp or even bblfshctl instances --kill-all
@creachadair
Copy link
Contributor

Having the ability to restart drivers explicitly isn't a bad idea. However, I think to really address the underlying issue of drivers getting stuck, we probably need to make it happen automatically as you suggested in (1). I'm a little reluctant to implement (2) until it's demonstrated to be really necessary, because it encourages people to bake hacks into external scripts that will make it hard for us to debug issues in production.

I'm not sure how the libcontainer plumbing affects this, but it should be possible to kill the driver process with a signal when the context times out. For an ordinary subprocess this is already implemented in the standard library; but we might need to do something extra to work around the container management.

@dennwc
Copy link
Member

dennwc commented Jun 29, 2019

Totally agree. As discussed of Slack, we could kill the driver after userTimeout + parseTimeout, this shouldn't affect the pool too much.

@vmarkovtsev
Copy link
Contributor

I think this is related to #303

@kuba-- kuba-- self-assigned this Sep 11, 2019
@kuba--
Copy link
Member

kuba-- commented Sep 11, 2019

I assume, we all agreed, a new functionality (--timeout argument) should be added to each driver, so bblfshctl can pass it.

@creachadair
Copy link
Contributor

I don't think a flag is necessarily required: The key behaviour is that when the context is cancelled, the daemon needs to terminate the driver container serving that request. Once we have that, we could of course add a flag to set the timeout—but the client can also set its own timeout without any software changes on our side. The missing feature right now is the termination step.

@dennwc
Copy link
Member

dennwc commented Sep 12, 2019

I agree with @creachadair, the cancellation will be propagated to the Go driver server already, we only need to kill a native driver after receiving a cancellation signal.

To be specific, here is the relevant piece of code:
https://github.com/bblfsh/sdk/blob/master/driver/native/native.go#L255
We already stop waiting for the response if we see a timeout, but we also need to terminate the native driver to avoid further processing.

@kuba--
Copy link
Member

kuba-- commented Sep 12, 2019

I've started working on this, but so far I mainly focused on adding here:
https://github.com/bblfsh/bblfshd/blob/master/daemon/service.go#L75
some kind of:

select {
   case <-ctx.Done():
      		pool.killDriver(drv)
		pool.scale()
                return nil, ctx.Err()
  // ...
}

Assuming that passed ctx. here and to the Parser will/may have a timeout which we can eventually get from command line flag.
WDYT?

in native.go we are inside a driver, but we need to catch the timeout, outside a driver to be able to restart it.

@dennwc
Copy link
Member

dennwc commented Sep 12, 2019

@kuba-- I think it's worth adding to both sides, actually. In bblfshd we can notice the parseTimeout, wait a bit more (userTimeout), and if request is still blocked - kill the driver completely.

but we need to catch the timeout, outside a driver

The driver will notice the timeout through gRPC just fine. So we can try killing it on the native level first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants