Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMD] [ROCm] [Optimum] Add optimum-amd support #443

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

tjtanaa
Copy link

@tjtanaa tjtanaa commented Oct 26, 2024

Description

Add optimum-amd support to infinity_emb.

Note: To use optimum-amd, it is recommended to

  1. build from docker image docker build -f libs/infinity_emb/Dockerfile.amd_auto -t ghcr.io/embeddedllm/infinity-rocm:opti mum-amd ./libs/infinity_emb
  2. Pull a docker image docker pull ghcr.io/embeddedllm/infinity-rocm:optimum-amd (This is a docker image of this PR. Hopefully in the newer version of infinity, the optimum-amd support can be found in https://hub.docker.com/r/michaelf34/infinity ) So for now, as a quickstart, get it from EmbeddedLLM.

To launch the docker container:

  • Interactive mode:
    • Launch docker container
      #!/bin/bash
      
      docker run -it \
        --cap-add=SYS_PTRACE \
        --security-opt seccomp=unconfined \
        --device=/dev/kfd \
        --device=/dev/dri/renderD128 \
        --device=/dev/dri/renderD136 \
        --group-add video \
        --network host \
        --entrypoint /bin/bash \
        ghcr.io/embeddedllm/infinity-rocm:optimum-amd \
        -c "source .venv/bin/activate && bash"
    • Launch Embedding Model
      HIP_VISIBLE_DEVICES=0 infinity_emb v2 --port 6909 --model-id BAAI/bge-m3 --model-warmup --device cuda  --engine optimum 
  • Single line:
  #!/bin/bash
  
  docker run -it \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --device=/dev/kfd \
    --device=/dev/dri/renderD128 \
    --device=/dev/dri/renderD136 \
    --group-add video \
    --network host \
    --entrypoint /bin/bash \
    ghcr.io/embeddedllm/infinity-rocm:optimum-amd \
    -c "(source .venv/bin/activate) && (HIP_VISIBLE_DEVICES=0 infinity_emb v2 --port 6909 --model-id BAAI/bge-m3 --model-warmup --device cuda  --engine optimum)"

CHANGES

  • libs/infinity_emb/Dockerfile.amd_auto
    • Added installation step of onnxruntime-rocm and optimum-amd.

Performance

These are the steps obtained in the warm-up run of the infinity server

Configuration Performance
Torch only model warmed up, between 476.69-2408.85 embeddings/sec at batch_size=32
Torch Compile model warmed up, between 487.85-2789.83 embeddings/sec at batch_size=32
Optimum AMD model warmed up, between 268.33-4903.16 embeddings/sec at batch_size=32

Running benchmark_embed

Model Requests # / sec (mean) Time (seconds)
infinity (torch + no compile + fa2 disabled) 2.52 3.965
infinity (torch + compile + fa2 disabled) 0.52 (first run), 2.84 (second run) 18.612 , 3.517
infinity (optimum-amd) 1.33 7.523

Torch Only

torch-only

Torch Compile

torch-compile

Optimum-AMD

Screenshot 2024-10-26 220350
Screenshot 2024-10-26 220308
Screenshot 2024-10-26 220442

@tjtanaa tjtanaa marked this pull request as ready for review October 27, 2024 01:37
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

This PR adds AMD GPU support to the Infinity embedding library through optimum-amd integration and ROCm compatibility.

  • Added ROCm support in device_to_onnx() function to enable AMD GPU execution via ROCMExecutionProvider
  • Added AMD-specific Docker deployment guide for MI200/MI300 GPUs with required device mounts and security configurations
  • Added build process for onnxruntime-rocm from source with ROCm 6.2.3 support in Dockerfile.amd_auto
  • Added performance benchmarks showing optimum-amd achieving higher peak throughput (4903 embeddings/sec) but lower average performance compared to torch-only mode

3 file(s) reviewed, 6 comment(s)
Edit PR Review Bot Settings | Greptile

docs/docs/deploy.md Show resolved Hide resolved
libs/infinity_emb/Dockerfile.amd_auto Outdated Show resolved Hide resolved
libs/infinity_emb/Dockerfile.amd_auto Outdated Show resolved Hide resolved
libs/infinity_emb/Dockerfile.amd_auto Outdated Show resolved Hide resolved
libs/infinity_emb/Dockerfile.amd_auto Outdated Show resolved Hide resolved
@tjtanaa
Copy link
Author

tjtanaa commented Oct 27, 2024

@michaelfeil

Model Requests # / sec (mean) Time (seconds)
infinity (torch + no compile + fa2 disabled) 2.52 3.965
infinity (torch + compile + fa2 disabled) 0.52 (first run), 2.84 (second run) 18.612 , 3.517
infinity (optimum-amd) 1.33 7.523

Will running the same data samples over the infinity embedding server introduced biases to the benchmark value of torch compile model? I have launched the embedding server with warm-up, yet why there is a huge difference between two benchmark runs?

Must I address all of the bot's comment for the PR to be merged?

@michaelfeil
Copy link
Owner

michaelfeil commented Oct 28, 2024

@tjtanaa I ignore the bots comments for style, but every 1 in 10 comments is useful.

I added some options to improve, e.g. the dockerfile. If you dont have the capacity to work on it, I can do this changes in a couple of days. Thanks for the contribution again.

On which hardware did you run the above benchmarks? FYI, the --no-bettertransformer flag just disables torch.nested flash-attention which is not supported on amd. Amd should be still using a decent version of sdpa https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html

@tjtanaa
Copy link
Author

tjtanaa commented Oct 28, 2024

On which hardware did you run the above benchmarks? FYI, the --no-bettertransformer flag just disables torch.nested flash-attention which is not supported on amd. Amd should be still using a decent version of sdpa https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html

The benchmark is ran on MI300X.

@tjtanaa I ignore the bots comments for style, but every 1 in 10 comments is useful.

I added some options to improve, e.g. the dockerfile. If you dont have the capacity to work on it, I can do this changes in a couple of days. Thanks for the contribution again.

Which of the things that I should improve on?

@michaelfeil
Copy link
Owner

@tjtanaa How to continue from here? My Laptop breaks down on building the wheel from scratch and the rocm wheel that is pre-built is for Radeon and not for MI Series.

@codecov-commenter
Copy link

codecov-commenter commented Nov 6, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 52.38095% with 10 lines in your changes missing coverage. Please review.

Project coverage is 78.97%. Comparing base (7328a6e) to head (ae76a0c).
Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
...nity_emb/infinity_emb/transformer/utils_optimum.py 44.44% 10 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #443      +/-   ##
==========================================
- Coverage   79.18%   78.97%   -0.22%     
==========================================
  Files          41       41              
  Lines        3248     3263      +15     
==========================================
+ Hits         2572     2577       +5     
- Misses        676      686      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@michaelfeil
Copy link
Owner

Trying to get it merged soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants