Improve cuda decoding performance by ~2x using decoder caching #258

ahmadsharif1 · 2024-10-10T18:39:04Z

Creating the GPU decoder is a very expensive process. To reduce the time we spend creating decoders this PR now tries to reuse decoders. The way we do that is:

When we are done with the GPU decoder, instead of deallocating it we add it to a per-GPU cache.
When we are asked to create a new decoder, we try to use one from the cache.
If not we create a new GPU decoder.

This doesn't touch the CPU code.

Results show 2x improvement in the benchmark:

Before:

python benchmarks/decoders/gpu_benchmark.py --devices=cuda:0 --resize_devices=none --video ~/jupyter/853.mp4 --num_videos 20 --num_threads=5
[--------------------- Decode+Resize Time ---------------------]
                              |  threads=5 work=20 video=853.mp4
1 threads: -----------------------------------------------------
      D=cuda R=none T=5 W=20  |                19.5             

Times are in seconds (s).

Key: D=Decode, R=Resize T=threads W=work (number of videos to decode)
Native resize is done as part of the decode step
none resize means there is no resize step -- native or otherwise

After:

python benchmarks/decoders/gpu_benchmark.py --devices=cuda:0 --resize_devices=none --video ~/jupyter/853.mp4 --num_videos 20 --num_threads=5
[--------------------- Decode+Resize Time ---------------------]
                              |  threads=5 work=20 video=853.mp4
1 threads: -----------------------------------------------------
      D=cuda R=none T=5 W=20  |                9.8              

Times are in seconds (s).

It also improves single-threaded GPU decoding:

Before:

python benchmarks/decoders/gpu_benchmark.py --devices=cuda:0 --resize_devices=none
[----------- Decode+Resize Time -----------]
                     |  video=nasa_13013.mp4
1 threads: ---------------------------------
      D=cuda R=none  |         795.5        

Times are in milliseconds (ms).

After:

python benchmarks/decoders/gpu_benchmark.py --devices=cuda:0 --resize_devices=none
[----------- Decode+Resize Time -----------]
                     |  video=nasa_13013.mp4
1 threads: ---------------------------------
      D=cuda R=none  |         540.3

src/torchcodec/decoders/_core/VideoDecoder.cpp

NicolasHug

Thanks @ahmadsharif1

I think a bit of doc (as comment in code) could be useful. Some questions that I think could be addressed in the docs are:

What is cached
When is stored in the cache
What is the "hashing function" of the cache (since it's different from the "What is cached" question)

NicolasHug · 2024-10-11T08:18:28Z

src/torchcodec/decoders/_core/CudaDevice.cpp

+    const torch::Device& device,
+    AVCodecContext* codecContext) {
+  throwErrorIfNonCudaDevice(device);
+  AVBufferRef* hw_device_ctx = codecContext->hw_device_ctx;


I don't see hw_device_ctx being used, if this line is still necessary can you add a comment to explain why?

Good catch. It was dead code.

src/torchcodec/decoders/_core/CudaDevice.cpp

NicolasHug

Thank you @ahmadsharif1

facebook-github-bot · 2024-10-11T15:38:16Z