You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Where payloads.csv has some random short texts (<100 input tokens each)
The server eventually crashes due to the health check timing out:
ERROR 09-18 21:04:43 client.py:261] TimeoutError("MQLLMEngine didn't reply within 10000ms")
ERROR 09-18 21:04:43 client.py:261] Traceback (most recent call last):
ERROR 09-18 21:04:43 client.py:261] File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py", line 157, in run_check_health_loop
ERROR 09-18 21:04:43 client.py:261] await self._await_ack(error_message="Health check failed.",
ERROR 09-18 21:04:43 client.py:261] File "/workspace/my-vllm/lib64/python3.11/site-packages/vllm/engine/multiprocessing/client.py", line 308, in _await_ack
ERROR 09-18 21:04:43 client.py:261] raise TimeoutError("MQLLMEngine didn't reply within "
ERROR 09-18 21:04:43 client.py:261] TimeoutError: MQLLMEngine didn't reply within 10000ms
CRITICAL 09-18 21:04:44 launcher.py:99] MQLLMEngine is already dead, terminating server process
INFO: 10.131.3.24:42640 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 09-18 21:04:44 launcher.py:99] MQLLMEngine is already dead, terminating server process
INFO: 10.131.3.24:42640 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 09-18 21:04:44 launcher.py:99] MQLLMEngine is already dead, terminating server process
INFO: 10.131.3.24:42640 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 09-18 21:04:44 launcher.py:99] MQLLMEngine is already dead, terminating server process
INFO: 10.131.3.24:42640 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
INFO: Shutting down
INFO: Waiting for connections to close. (CTRL+C to force quit)
I then added a little print statement to see how long self.engine_step() takes in the MQLLMEngine, and it looks like every now and then a step takes multiple seconds, where it's usually sub-second. Maybe this is because of large amounts of prefill happening? Something is taking quite a long time though and this causes the engine to not be able to respond to a healthcheck since inputs from the client are read serially after each step.
I can dig in more, but it seems not great that it's this easy to knock over the server
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
After looking into this more, it appears that the logits processors coming over the zmq wire are roughly 4MB. They seem to usually unpack in <1sec, but under load the calls to cloudpickle.loads() sometimes take longer, and appear to block up the GIL while doing so. Because the gil is blocked, even solutions like #8583 do not fix the problem of the client losing connection with the engine and exiting.
I don't know why unpickling the lp is sometimes slow, maybe reading the bytes from the buffer is slow under load when there's 1GB+ coming into the socket, maybe there's contention with other inference work going on, maybe cloudpickle drops the ball? Unclear so far
Your current environment
The output of `python collect_env.py`
(vllm code copied from this PR (@84789334a) was used: #8574)
Model Input Dumps
No response
🐛 Describe the bug
I was benchmarking the performance of guided decoding using the
lm-format-enforcer
backend.Here's the artillery snippet:
Where payloads.csv has some random short texts (<100 input tokens each)
The server eventually crashes due to the health check timing out:
I then added a little print statement to see how long
self.engine_step()
takes in the MQLLMEngine, and it looks like every now and then a step takes multiple seconds, where it's usually sub-second. Maybe this is because of large amounts of prefill happening? Something is taking quite a long time though and this causes the engine to not be able to respond to a healthcheck since inputs from the client are read serially after each step.e.g. I see this printing out:
I can dig in more, but it seems not great that it's this easy to knock over the server
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: