You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I want to use the OpenAI library to do offline batch inference leveraging Ray (for scaling and scheduling) on top of vLLM.
Context: The plan is to built a FastAPI service that closely mimicks OpenAI's batch API and allows to process a larger number of prompts (tens of thousands) in 24h. There are a few options of achieving this with vLLM but every one has some important drawback, but maybe I am missing something:
There is an existing guide that uses the LLMClassin the docs with Ray. While the LLMClass shares the OpenAI sampling parameters, it does lack the important OpenAI prompt templating.
The run_batch.py entrypoint that was introduced here would be the simplest one. But it does not support Ray out of the box.
The third option would be to use the AsyncLLMEngine as done here and bundle it with OpenAIServingChat as has been done in run_batch.py. But this would entail some (potential) performance degredation due to going asynch even though it is not really needed for offline batch inference.
The fourth option could be to use Ray serve like in this example from Ray's docs. But this would lack the OpenAI batch format and is – again – async.
Maybe this helps other people as well. Would be super grateful for some feedback. 🙂
And thanks a ton for this very nice piece of software and the great community!
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
Your current environment
None
How would you like to use vllm
I want to use the OpenAI library to do offline batch inference leveraging Ray (for scaling and scheduling) on top of vLLM.
Context: The plan is to built a FastAPI service that closely mimicks OpenAI's batch API and allows to process a larger number of prompts (tens of thousands) in 24h. There are a few options of achieving this with vLLM but every one has some important drawback, but maybe I am missing something:
LLMClass
in the docs with Ray. While theLLMClass
shares the OpenAI sampling parameters, it does lack the important OpenAI prompt templating.run_batch.py
entrypoint that was introduced here would be the simplest one. But it does not support Ray out of the box.AsyncLLMEngine
as done here and bundle it with OpenAIServingChat as has been done in run_batch.py. But this would entail some (potential) performance degredation due to going asynch even though it is not really needed for offline batch inference.Maybe this helps other people as well. Would be super grateful for some feedback. 🙂
And thanks a ton for this very nice piece of software and the great community!
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: