Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant Inference Time Increase with Multiple Models in OpenVINO Model Server #3136

Open
sriram-dsl opened this issue Mar 20, 2025 · 3 comments

Comments

@sriram-dsl
Copy link

Significant Inference Time Increase with Multiple Models in OpenVINO Model Server

Environment

  • Operating System: Ubuntu 24.04
  • OpenVINO Version: openvino/model_server:latest (Docker container)
  • Hardware: Intel Core 12th Gen Intel(R) Core(TM) i3-1220P
  • Models: YOLOv5 models converted to OpenVINO IR format (.xml and .bin), FP32 precision
  • Deployment: Docker container with OpenVINO Model Server

Issue

I deployed the OpenVINO Model Server container with a single YOLOv5 model (FP32 precision) and observed inference times of 8-20 milliseconds per request, which is acceptable. However, when I load 4 YOLOv5 models on the same server, the inference time spikes to 30-100 milliseconds per model request. This significant increase in latency occurs despite using parallelism in my client script (via ThreadPoolExecutor) and setting "nireq": 4 per model in the server configuration.

This spike leads to higher hardware resource usage (e.g., CPU/GPU contention) and impacts real-time performance. I expected multi-model inference to maintain closer to single-model latency with proper resource allocation, especially given OpenVINO's support for parallel inference.

Logs

Single Model (model2)

[2025-03-20 17:30:41.135] Prediction duration in model model2, version 1, nireq 0: 15.680 ms
[2025-03-20 17:30:41.135] Total gRPC request processing time: 15.861 ms
[2025-03-20 17:30:41.266] Prediction duration in model model2, version 1, nireq 0: 24.077 ms
[2025-03-20 17:30:41.266] Total gRPC request processing time: 24.306 ms
[2025-03-20 17:30:41.383] Prediction duration in model model2, version 1, nireq 0: 15.227 ms
[2025-03-20 17:30:41.383] Total gRPC request processing time: 15.452 ms

Multi-Model (4 models loaded)

[2025-03-20 18:17:15.523] Prediction duration in model model1, version 1, nireq 0: 42.076 ms
[2025-03-20 18:17:15.523] Total gRPC request processing time: 42.317 ms
[2025-03-20 18:17:15.530] Prediction duration in model model2, version 1, nireq 0: 46.367 ms
[2025-03-20 18:17:15.530] Total gRPC request processing time: 46.606 ms
[2025-03-20 18:17:15.530] Prediction duration in model model3, version 1, nireq 0: 45.479 ms
[2025-03-20 18:17:15.530] Total gRPC request processing time: 45.68 ms
[2025-03-20 18:17:15.514] Prediction duration in model model4, version 1, nireq 0: 27.955 ms
[2025-03-20 18:17:15.514] Total gRPC request processing time: 28.175 ms

Configuration

  • Docker Command:
    sudo docker run -d --shm-size=23g --ulimit memlock=-1 --ulimit stack=67108864 --name openvino_model_server -v /home/ubuntu/models:/models -p 900:9000 -p 811:8000 openvino/model_server:latest --config_path /models/config.json --port 9000 --rest_port 8000 --metrics_enable --log_level DEBUG
    
    
  • config.json:
    {
    "model_config_list": [
      {"name": "model1", "base_path": "/models/model1", "nireq": 4, "plugin_config": {"PERFORMANCE_HINT": "LATENCY"}},
      {"name": "model2", "base_path": "/models/model2", "nireq": 4, "plugin_config": {"PERFORMANCE_HINT": "LATENCY"}},
      {"name": "model3", "base_path": "/models/model3", "nireq": 4, "plugin_config": {"PERFORMANCE_HINT": "LATENCY"}},
      {"name": "model4", "base_path": "/models/model4", "nireq": 4, "plugin_config": {"PERFORMANCE_HINT": "LATENCY"}}
    ]

}


Steps to Reproduce

  1. Deploy OpenVINO Model Server with a single YOLOv5 model (FP32) using the above command and a config.json containing only model2.
  2. Send gRPC inference requests (e.g., via ovmsclient) and measure latency from logs or metrics endpoint (http://localhost:811/metrics).
  3. Update config.json to include 4 YOLOv5 models (model1, model2, model3, model4).
  4. Restart the container and send parallel gRPC requests for all 4 models using a Python script with ThreadPoolExecutor.
  5. Compare inference times from logs.

Expected Behavior

With 4 models loaded and parallel inference enabled (nireq=4), I expect inference times to remain close to single-model performance (e.g., 20-30 ms total latency across all models), leveraging OpenVINO's multi-stream capabilities and parallel execution

Actual Behavior

Inference time per model increases significantly (30-100 ms per request), indicating resource contention or inefficient multi-model handling. For example, model2 jumps from 15-24 ms (single model) to 46.367 ms (multi-model).

Suggestions for optimizing resource allocation or server configuration to maintain low latency with multiple YOLOv5 models would be greatly appreciated.

Thanks in advance!

@dtrawins
Copy link
Collaborator

@sriram-dsl Duplicating the models is not the recommended method for scalability and multi concurrency. Nireq parameter is also not causing the parallel processing - it is a size of the queue of the requests.
If you have just one model and you want to optimized for processing 4 concurrent requests, you should increase the number of streams in the model and use just one model. Set parameter NUM_STREAMS. If you have 4 clients, probably two streams will be optimal. That depends how long the client is processing the response and how long it is sending the requests. It is likely that one synchronous client will not be able to fully utilized one stream. Try setting four streams to confirm. Nireq is by default equal to the number of streams which is usually sufficient. You can increase it but it shouldn't be lower. Note that increasing the number of streams will improve the throughput for most of the models but it will also impact the latency. Single stream will have divided resources available for computing.

In case you have 4 completely different models, the recommendations are different because the streams are managed in the scope of a single model. With multiple models you should use just one stream in each model (default setting) and reserve the resources for each model. There are special parameters which can reserve and pin individual cores to each model. Consider this config,json:

{
    "model_config_list": [
        {
            "config": {
                "name": "model1",
                "base_path": "/model/resnet-50-tf",
                "plugin_config": {
                        "INFERENCE_NUM_THREADS":"2",
                        "ENABLE_CPU_RESERVATION": "true",
                        "ENABLE_CPU_PINNING": "true"
                }
                }
            },
        {
            "config": {
                "name": "model2",
                "base_path": "/model/resnet-50-tf",
                "plugin_config": {
                        "INFERENCE_NUM_THREADS":"2",
                        "ENABLE_CPU_RESERVATION": "true",
                        "ENABLE_CPU_PINNING": "true"
                }
                }
            },
        {
            "config": {
                "name": "model3",
                "base_path": "/model/resnet-50-tf",
                "plugin_config": {
                        "INFERENCE_NUM_THREADS":"2",
                        "ENABLE_CPU_RESERVATION": "true",
                        "ENABLE_CPU_PINNING": "true"
                }
                }
            }
    ]
}

It will assign 2 different cpu cores to each model. It will ensure that execution on one model is not impacting the rest.

@sriram-dsl
Copy link
Author

sriram-dsl commented Mar 26, 2025

How can I measure throughput effectively? Is there a tool similar to NVIDIA’s perf_analyzer that I can use for this purpose? Additionally, if I deploy a container within a Kubernetes cluster to enable scaling, how can I scale the models running inside each container?

@dtrawins
Copy link
Collaborator

@sriram-dsl you can use perf_analyser. Kserve API in OVMS is compatible with Triton so the same benchmarking tool can be used. You can use also this tool https://github.com/openvinotoolkit/model_server/tree/main/demos/benchmark/python.
In Kubernetes you can enable horizontal autoscaling to tune the number of replicas depending on the load. If can you prefer to deploy just a single container, you can scale the container vertically by adding more resources. In can you want to optimize for best throughput results, add in the plugin_config "PERFORMANCE_HINT": "THROUGHPUT". If you want to optimize for latency set "PERFORMANCE_HINT": "LATENCY".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants