-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Significant Inference Time Increase with Multiple Models in OpenVINO Model Server #3136
Comments
@sriram-dsl Duplicating the models is not the recommended method for scalability and multi concurrency. Nireq parameter is also not causing the parallel processing - it is a size of the queue of the requests. In case you have 4 completely different models, the recommendations are different because the streams are managed in the scope of a single model. With multiple models you should use just one stream in each model (default setting) and reserve the resources for each model. There are special parameters which can reserve and pin individual cores to each model. Consider this config,json: {
"model_config_list": [
{
"config": {
"name": "model1",
"base_path": "/model/resnet-50-tf",
"plugin_config": {
"INFERENCE_NUM_THREADS":"2",
"ENABLE_CPU_RESERVATION": "true",
"ENABLE_CPU_PINNING": "true"
}
}
},
{
"config": {
"name": "model2",
"base_path": "/model/resnet-50-tf",
"plugin_config": {
"INFERENCE_NUM_THREADS":"2",
"ENABLE_CPU_RESERVATION": "true",
"ENABLE_CPU_PINNING": "true"
}
}
},
{
"config": {
"name": "model3",
"base_path": "/model/resnet-50-tf",
"plugin_config": {
"INFERENCE_NUM_THREADS":"2",
"ENABLE_CPU_RESERVATION": "true",
"ENABLE_CPU_PINNING": "true"
}
}
}
]
}
It will assign 2 different cpu cores to each model. It will ensure that execution on one model is not impacting the rest. |
How can I measure throughput effectively? Is there a tool similar to NVIDIA’s perf_analyzer that I can use for this purpose? Additionally, if I deploy a container within a Kubernetes cluster to enable scaling, how can I scale the models running inside each container? |
@sriram-dsl you can use perf_analyser. Kserve API in OVMS is compatible with Triton so the same benchmarking tool can be used. You can use also this tool https://github.com/openvinotoolkit/model_server/tree/main/demos/benchmark/python. |
Significant Inference Time Increase with Multiple Models in OpenVINO Model Server
Environment
openvino/model_server:latest
(Docker container).xml
and.bin
), FP32 precisionIssue
I deployed the OpenVINO Model Server container with a single YOLOv5 model (FP32 precision) and observed inference times of 8-20 milliseconds per request, which is acceptable. However, when I load 4 YOLOv5 models on the same server, the inference time spikes to 30-100 milliseconds per model request. This significant increase in latency occurs despite using parallelism in my client script (via
ThreadPoolExecutor
) and setting"nireq": 4
per model in the server configuration.This spike leads to higher hardware resource usage (e.g., CPU/GPU contention) and impacts real-time performance. I expected multi-model inference to maintain closer to single-model latency with proper resource allocation, especially given OpenVINO's support for parallel inference.
Logs
Single Model (
model2
)Multi-Model (4 models loaded)
Configuration
}
Steps to Reproduce
config.json
containing onlymodel2
.(http://localhost:811/metrics)
.(model1, model2, model3, model4)
.Expected Behavior
With 4 models loaded and parallel inference enabled (nireq=4), I expect inference times to remain close to single-model performance (e.g., 20-30 ms total latency across all models), leveraging OpenVINO's multi-stream capabilities and parallel execution
Actual Behavior
Inference time per model increases significantly (30-100 ms per request), indicating resource contention or inefficient multi-model handling. For example, model2 jumps from 15-24 ms (single model) to 46.367 ms (multi-model).
Suggestions for optimizing resource allocation or server configuration to maintain low latency with multiple YOLOv5 models would be greatly appreciated.
Thanks in advance!
The text was updated successfully, but these errors were encountered: