Significant Inference Time Increase with Multiple Models in OpenVINO Model Server #3136

sriram-dsl · 2025-03-20T18:47:52Z

Significant Inference Time Increase with Multiple Models in OpenVINO Model Server

Environment

Operating System: Ubuntu 24.04
OpenVINO Version: openvino/model_server:latest (Docker container)
Hardware: Intel Core 12th Gen Intel(R) Core(TM) i3-1220P
Models: YOLOv5 models converted to OpenVINO IR format (.xml and .bin), FP32 precision
Deployment: Docker container with OpenVINO Model Server

Issue

I deployed the OpenVINO Model Server container with a single YOLOv5 model (FP32 precision) and observed inference times of 8-20 milliseconds per request, which is acceptable. However, when I load 4 YOLOv5 models on the same server, the inference time spikes to 30-100 milliseconds per model request. This significant increase in latency occurs despite using parallelism in my client script (via ThreadPoolExecutor) and setting "nireq": 4 per model in the server configuration.

This spike leads to higher hardware resource usage (e.g., CPU/GPU contention) and impacts real-time performance. I expected multi-model inference to maintain closer to single-model latency with proper resource allocation, especially given OpenVINO's support for parallel inference.

Logs

Single Model (`model2`)

[2025-03-20 17:30:41.135] Prediction duration in model model2, version 1, nireq 0: 15.680 ms
[2025-03-20 17:30:41.135] Total gRPC request processing time: 15.861 ms
[2025-03-20 17:30:41.266] Prediction duration in model model2, version 1, nireq 0: 24.077 ms
[2025-03-20 17:30:41.266] Total gRPC request processing time: 24.306 ms
[2025-03-20 17:30:41.383] Prediction duration in model model2, version 1, nireq 0: 15.227 ms
[2025-03-20 17:30:41.383] Total gRPC request processing time: 15.452 ms

Multi-Model (4 models loaded)

[2025-03-20 18:17:15.523] Prediction duration in model model1, version 1, nireq 0: 42.076 ms
[2025-03-20 18:17:15.523] Total gRPC request processing time: 42.317 ms
[2025-03-20 18:17:15.530] Prediction duration in model model2, version 1, nireq 0: 46.367 ms
[2025-03-20 18:17:15.530] Total gRPC request processing time: 46.606 ms
[2025-03-20 18:17:15.530] Prediction duration in model model3, version 1, nireq 0: 45.479 ms
[2025-03-20 18:17:15.530] Total gRPC request processing time: 45.68 ms
[2025-03-20 18:17:15.514] Prediction duration in model model4, version 1, nireq 0: 27.955 ms
[2025-03-20 18:17:15.514] Total gRPC request processing time: 28.175 ms

Configuration

Docker Command:

sudo docker run -d --shm-size=23g --ulimit memlock=-1 --ulimit stack=67108864 --name openvino_model_server -v /home/ubuntu/models:/models -p 900:9000 -p 811:8000 openvino/model_server:latest --config_path /models/config.json --port 9000 --rest_port 8000 --metrics_enable --log_level DEBUG

config.json:

{
"model_config_list": [
  {"name": "model1", "base_path": "/models/model1", "nireq": 4, "plugin_config": {"PERFORMANCE_HINT": "LATENCY"}},
  {"name": "model2", "base_path": "/models/model2", "nireq": 4, "plugin_config": {"PERFORMANCE_HINT": "LATENCY"}},
  {"name": "model3", "base_path": "/models/model3", "nireq": 4, "plugin_config": {"PERFORMANCE_HINT": "LATENCY"}},
  {"name": "model4", "base_path": "/models/model4", "nireq": 4, "plugin_config": {"PERFORMANCE_HINT": "LATENCY"}}
]

}

Steps to Reproduce

Deploy OpenVINO Model Server with a single YOLOv5 model (FP32) using the above command and a config.json containing only model2.
Send gRPC inference requests (e.g., via ovmsclient) and measure latency from logs or metrics endpoint (http://localhost:811/metrics).
Update config.json to include 4 YOLOv5 models (model1, model2, model3, model4).
Restart the container and send parallel gRPC requests for all 4 models using a Python script with ThreadPoolExecutor.
Compare inference times from logs.

Expected Behavior

With 4 models loaded and parallel inference enabled (nireq=4), I expect inference times to remain close to single-model performance (e.g., 20-30 ms total latency across all models), leveraging OpenVINO's multi-stream capabilities and parallel execution

Actual Behavior

Inference time per model increases significantly (30-100 ms per request), indicating resource contention or inefficient multi-model handling. For example, model2 jumps from 15-24 ms (single model) to 46.367 ms (multi-model).

Suggestions for optimizing resource allocation or server configuration to maintain low latency with multiple YOLOv5 models would be greatly appreciated.

Thanks in advance!

The text was updated successfully, but these errors were encountered:

dtrawins · 2025-03-24T07:45:44Z

@sriram-dsl Duplicating the models is not the recommended method for scalability and multi concurrency. Nireq parameter is also not causing the parallel processing - it is a size of the queue of the requests.
If you have just one model and you want to optimized for processing 4 concurrent requests, you should increase the number of streams in the model and use just one model. Set parameter NUM_STREAMS. If you have 4 clients, probably two streams will be optimal. That depends how long the client is processing the response and how long it is sending the requests. It is likely that one synchronous client will not be able to fully utilized one stream. Try setting four streams to confirm. Nireq is by default equal to the number of streams which is usually sufficient. You can increase it but it shouldn't be lower. Note that increasing the number of streams will improve the throughput for most of the models but it will also impact the latency. Single stream will have divided resources available for computing.

In case you have 4 completely different models, the recommendations are different because the streams are managed in the scope of a single model. With multiple models you should use just one stream in each model (default setting) and reserve the resources for each model. There are special parameters which can reserve and pin individual cores to each model. Consider this config,json:

{
    "model_config_list": [
        {
            "config": {
                "name": "model1",
                "base_path": "/model/resnet-50-tf",
                "plugin_config": {
                        "INFERENCE_NUM_THREADS":"2",
                        "ENABLE_CPU_RESERVATION": "true",
                        "ENABLE_CPU_PINNING": "true"
                }
                }
            },
        {
            "config": {
                "name": "model2",
                "base_path": "/model/resnet-50-tf",
                "plugin_config": {
                        "INFERENCE_NUM_THREADS":"2",
                        "ENABLE_CPU_RESERVATION": "true",
                        "ENABLE_CPU_PINNING": "true"
                }
                }
            },
        {
            "config": {
                "name": "model3",
                "base_path": "/model/resnet-50-tf",
                "plugin_config": {
                        "INFERENCE_NUM_THREADS":"2",
                        "ENABLE_CPU_RESERVATION": "true",
                        "ENABLE_CPU_PINNING": "true"
                }
                }
            }
    ]
}

It will assign 2 different cpu cores to each model. It will ensure that execution on one model is not impacting the rest.

sriram-dsl · 2025-03-26T11:04:24Z

How can I measure throughput effectively? Is there a tool similar to NVIDIA’s perf_analyzer that I can use for this purpose? Additionally, if I deploy a container within a Kubernetes cluster to enable scaling, how can I scale the models running inside each container?

dtrawins · 2025-03-29T22:44:54Z

@sriram-dsl you can use perf_analyser. Kserve API in OVMS is compatible with Triton so the same benchmarking tool can be used. You can use also this tool https://github.com/openvinotoolkit/model_server/tree/main/demos/benchmark/python.
In Kubernetes you can enable horizontal autoscaling to tune the number of replicas depending on the load. If can you prefer to deploy just a single container, you can scale the container vertically by adding more resources. In can you want to optimize for best throughput results, add in the plugin_config "PERFORMANCE_HINT": "THROUGHPUT". If you want to optimize for latency set "PERFORMANCE_HINT": "LATENCY".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significant Inference Time Increase with Multiple Models in OpenVINO Model Server #3136

Significant Inference Time Increase with Multiple Models in OpenVINO Model Server #3136

sriram-dsl commented Mar 20, 2025

dtrawins commented Mar 24, 2025

sriram-dsl commented Mar 26, 2025 •

edited

Loading

dtrawins commented Mar 29, 2025

Significant Inference Time Increase with Multiple Models in OpenVINO Model Server #3136

Significant Inference Time Increase with Multiple Models in OpenVINO Model Server #3136

Comments

sriram-dsl commented Mar 20, 2025

Significant Inference Time Increase with Multiple Models in OpenVINO Model Server

Environment

Issue

Logs

Single Model (model2)

Multi-Model (4 models loaded)

Configuration

Steps to Reproduce

Expected Behavior

Actual Behavior

dtrawins commented Mar 24, 2025

sriram-dsl commented Mar 26, 2025 • edited Loading

dtrawins commented Mar 29, 2025

Single Model (`model2`)

sriram-dsl commented Mar 26, 2025 •

edited

Loading