Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve cuda decoding performance by ~2x using decoder caching #258

Merged
merged 41 commits into from
Oct 11, 2024

Conversation

ahmadsharif1
Copy link
Contributor

@ahmadsharif1 ahmadsharif1 commented Oct 10, 2024

Creating the GPU decoder is a very expensive process. To reduce the time we spend creating decoders this PR now tries to reuse decoders. The way we do that is:

  1. When we are done with the GPU decoder, instead of deallocating it we add it to a per-GPU cache.
  2. When we are asked to create a new decoder, we try to use one from the cache.
  3. If not we create a new GPU decoder.

This doesn't touch the CPU code.

Results show 2x improvement in the benchmark:

Before:

python benchmarks/decoders/gpu_benchmark.py --devices=cuda:0 --resize_devices=none --video ~/jupyter/853.mp4 --num_videos 20 --num_threads=5
[--------------------- Decode+Resize Time ---------------------]
                              |  threads=5 work=20 video=853.mp4
1 threads: -----------------------------------------------------
      D=cuda R=none T=5 W=20  |                19.5             

Times are in seconds (s).

Key: D=Decode, R=Resize T=threads W=work (number of videos to decode)
Native resize is done as part of the decode step
none resize means there is no resize step -- native or otherwise

After:

python benchmarks/decoders/gpu_benchmark.py --devices=cuda:0 --resize_devices=none --video ~/jupyter/853.mp4 --num_videos 20 --num_threads=5
[--------------------- Decode+Resize Time ---------------------]
                              |  threads=5 work=20 video=853.mp4
1 threads: -----------------------------------------------------
      D=cuda R=none T=5 W=20  |                9.8              

Times are in seconds (s).

It also improves single-threaded GPU decoding:

Before:

python benchmarks/decoders/gpu_benchmark.py --devices=cuda:0 --resize_devices=none
[----------- Decode+Resize Time -----------]
                     |  video=nasa_13013.mp4
1 threads: ---------------------------------
      D=cuda R=none  |         795.5        

Times are in milliseconds (ms).

After:

python benchmarks/decoders/gpu_benchmark.py --devices=cuda:0 --resize_devices=none
[----------- Decode+Resize Time -----------]
                     |  video=nasa_13013.mp4
1 threads: ---------------------------------
      D=cuda R=none  |         540.3        

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 10, 2024
@ahmadsharif1 ahmadsharif1 changed the title Improve cuda decoding performance by caching decoders Improve cuda decoding performance by caching decoders by about 2x Oct 10, 2024
@ahmadsharif1 ahmadsharif1 changed the title Improve cuda decoding performance by caching decoders by about 2x Improve cuda decoding performance by ~2x using decoder caching Oct 10, 2024
@ahmadsharif1 ahmadsharif1 marked this pull request as ready for review October 10, 2024 20:07
Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ahmadsharif1

I think a bit of doc (as comment in code) could be useful. Some questions that I think could be addressed in the docs are:

  • What is cached
  • When is stored in the cache
  • What is the "hashing function" of the cache (since it's different from the "What is cached" question)

const torch::Device& device,
AVCodecContext* codecContext) {
throwErrorIfNonCudaDevice(device);
AVBufferRef* hw_device_ctx = codecContext->hw_device_ctx;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see hw_device_ctx being used, if this line is still necessary can you add a comment to explain why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. It was dead code.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ahmadsharif1

@facebook-github-bot
Copy link
Contributor

@ahmadsharif1 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ahmadsharif1 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ahmadsharif1 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ahmadsharif1 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ahmadsharif1 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ahmadsharif1 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@ahmadsharif1 ahmadsharif1 merged commit 5a674cf into pytorch:main Oct 11, 2024
22 of 24 checks passed
@ahmadsharif1 ahmadsharif1 deleted the cuda4 branch October 11, 2024 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants